Data Cleaning [01]: Handling Missing Data
Published:
Pandas uses the floating-point value NaN
(Not a Number, np.nan
) to represent missing data, and it provides several API functions related to missing data handling: dropna()
, fillna()
, isnull()
and notnull()
.
Filtering Out Missing Data: DataFrame.dropna()
DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
Parameters:
axis
: {0 or ‘index’, 1 or ‘columns’}, default 0- 0, or ‘index’ : Drop rows which contain missing values.
- 1, or ‘columns’ : Drop columns which contain missing value
how
: {‘any’, ‘all’}, default ‘any’- ‘any’ : If any NA values are present, drop that row or column.
- ‘all’ : If all values are NA, drop that row or column
thresh
: int, optionalsubset
: column label or sequence of labels, optional- Labels along other axis to consider
inplace
: bool, default False
import numpy.nan as NA
data = pd.DataFrame([[1,2,3,4,],[NA,3,4,1],[NA,NA,1,2],[NA,NA,NA,3],[NA,NA,NA,NA]])
data
Out[6]:
0 1 2 3
0 1.0 2.0 3.0 4.0
1 NaN 3.0 4.0 1.0
2 NaN NaN 1.0 2.0
3 NaN NaN NaN 3.0
4 NaN NaN NaN NaN
The default parameters will drop any rows that contain NA
.
dataCleaned = data.dropna()
dataCleaned
Out[8]:
0 1 2 3
0 1.0 2.0 3.0 4.0
To drop in columns, set axis = 1
, and the result will be an empty DataFrame.
dataCleaned = data.dropna(axis = 1)
dataCleaned
Out[10]:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]
To drop rows where all the items are NA
only, set how = 'all'
.
dataCleaned = data.dropna(how = 'all')
dataCleaned
Out[12]:
0 1 2 3
0 1.0 2.0 3.0 4.0
1 NaN 3.0 4.0 1.0
2 NaN NaN 1.0 2.0
3 NaN NaN NaN 3.0
To drop rows where the number of NA
is greater than 2.
dataCleaned = data.dropna(thresh = 2)
dataCleaned
Out[14]:
0 1 2 3
0 1.0 2.0 3.0 4.0
1 NaN 3.0 4.0 1.0
2 NaN NaN 1.0 2.0
To drop rows where there are NA
in columns 3 and 4.
dataCleaned = data.dropna(subset=[2,3])
dataCleaned
Out[17]:
0 1 2 3
0 1.0 2.0 3.0 4.0
1 NaN 3.0 4.0 1.0
2 NaN NaN 1.0 2.0
Filtering In Missing Data: DataFrame.fillna()
DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None)
Parameters:
value
: scalar, dict, Series, or DataFramemethod
: {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default Nonepad
/ffill
: propagate last valid observation forward to next validbackfill
/bfill
: use next valid observation to fill gap.
axis
: {0 or ‘index’, 1 or ‘columns’}limit
: int, default None- maximum number of consecutive NaN values to forward/backward fill
downcast
: dict, default is Noneinplace
: bool, default False
Fill all NA
elements with 0.
dataCleaned = data.fillna(0.0)
dataCleaned
Out[19]:
0 1 2 3
0 1.0 2.0 3.0 4.0
1 0.0 3.0 4.0 1.0
2 0.0 0.0 1.0 2.0
3 0.0 0.0 0.0 3.0
4 0.0 0.0 0.0 0.0
Propagate non-null values forward.
dataCleaned = data.fillna(method='ffill')
dataCleaned
Out[21]:
0 1 2 3
0 1.0 2.0 3.0 4.0
1 1.0 3.0 4.0 1.0
2 1.0 3.0 1.0 2.0
3 1.0 3.0 1.0 3.0
4 1.0 3.0 1.0 3.0
Propagate non-null values backward.
data = data.append({0:1,1:2,2:3,3:4},ignore_index=True)
data
Out[30]:
0 1 2 3
0 1.0 2.0 3.0 4.0
1 NaN 3.0 4.0 1.0
2 NaN NaN 1.0 2.0
3 NaN NaN NaN 3.0
4 NaN NaN NaN NaN
5 1.0 2.0 3.0 4.0
dataCleaned = data.fillna(method='bfill')
dataCleaned
Out[32]:
0 1 2 3
0 1.0 2.0 3.0 4.0
1 1.0 3.0 4.0 1.0
2 1.0 2.0 1.0 2.0
3 1.0 2.0 3.0 3.0
4 1.0 2.0 3.0 4.0
5 1.0 2.0 3.0 4.0
Fill in each column with different values.
dataCleaned = data.fillna({0:1,1:2,2:3,3:4})
dataCleaned
Out[34]:
0 1 2 3
0 1.0 2.0 3.0 4.0
1 1.0 3.0 4.0 1.0
2 1.0 2.0 1.0 2.0
3 1.0 2.0 3.0 3.0
4 1.0 2.0 3.0 4.0
5 1.0 2.0 3.0 4.0
Fill in with another DataFrame where the column names are the same.
fillDF = pd.DataFrame(np.zeros((5,4)),columns = [1,2,3,4])
dataCleaned = data.fillna(fillDF)
dataCleaned
Out[42]:
0 1 2 3
0 1.0 2.0 3.0 4.0
1 NaN 3.0 4.0 1.0
2 NaN 0.0 1.0 2.0
3 NaN 0.0 0.0 3.0
4 NaN 0.0 0.0 0.0
5 1.0 2.0 3.0 4.0
Filtering In Missing Data: DataFrame.interpolate()
DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None)
Parameters:
method
: str, default ‘linear’- ‘linear’: Ignore the index and treat the values as equally spaced. This is the only method supported on MultiIndexes.
- ‘time’: Works on daily and higher resolution data to interpolate given length of interval.
- ‘index’, ‘values’: use the actual numerical values of the index.
- ‘pad’: Fill in NaNs using existing values.
- ‘nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’, ‘spline’, ‘barycentric’, ‘polynomial’: Passed to scipy.interpolate.interp1d. These methods use the numerical values of the index. Both ‘polynomial’ and ‘spline’ require that you also specify an order (int).
- ‘krogh’, ‘piecewise_polynomial’, ‘spline’, ‘pchip’, ‘akima’, ‘cubicspline’: Wrappers around the SciPy interpolation methods of similar names. See Notes.
- ‘from_derivatives’: Refers to scipy.interpolate.BPoly.from_derivatives which replaces ‘piecewise_polynomial’ interpolation method in scipy 0.18.
pad
/ffill
: propagate last valid observation forward to next valid
axis
: {0 or ‘index’, 1 or ‘columns’}, default Nonelimit
: int, default None- maximum number of consecutive NaN values to forward/backward fill
limit_direction
: {‘forward’, ‘backward’, ‘both’}, Optionalinplace
: bool, default Falsedowncast
: optional, ‘infer’ or None, defaults to None
Linear interpolation.
dataCleaned = data.interpolate()
dataCleaned
Out[44]:
0 1 2 3
0 1.0 2.00 3.000000 4.0
1 1.0 3.00 4.000000 1.0
2 1.0 2.75 1.000000 2.0
3 1.0 2.50 1.666667 3.0
4 1.0 2.25 2.333333 3.5
5 1.0 2.00 3.000000 4.0
Interpolate by padding, limit = 2.
dataCleaned = data.interpolate(method = 'pad', limit = 2)
dataCleaned
Out[48]:
0 1 2 3
0 1.0 2.0 3.0 4.0
1 1.0 3.0 4.0 1.0
2 1.0 3.0 1.0 2.0
3 NaN 3.0 1.0 3.0
4 NaN NaN 1.0 3.0
5 1.0 2.0 3.0 4.0
Interpolate by polynomial (should be a Series, or columns of the DataFrame).
dataCleaned = data[2].interpolate(method = 'polynomial', order=2)
dataCleaned
Out[63]:
0 3.000000
1 4.000000
2 1.000000
3 -0.615385
4 0.051282
5 3.000000
Name: 2, dtype: float64
Comments