Data Cleaning [01]: Handling Missing Data

6 minute read

Published:

Pandas uses the floating-point value NaN (Not a Number, np.nan) to represent missing data, and it provides several API functions related to missing data handling: dropna(), fillna(), isnull() and notnull().


Filtering Out Missing Data: DataFrame.dropna()

DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

Parameters:

  • axis: {0 or ‘index’, 1 or ‘columns’}, default 0
    • 0, or ‘index’ : Drop rows which contain missing values.
    • 1, or ‘columns’ : Drop columns which contain missing value
  • how: {‘any’, ‘all’}, default ‘any’
    • ‘any’ : If any NA values are present, drop that row or column.
    • ‘all’ : If all values are NA, drop that row or column
  • thresh: int, optional
  • subset: column label or sequence of labels, optional
    • Labels along other axis to consider
  • inplace: bool, default False
import numpy.nan as NA
data = pd.DataFrame([[1,2,3,4,],[NA,3,4,1],[NA,NA,1,2],[NA,NA,NA,3],[NA,NA,NA,NA]])

data
Out[6]: 
     0    1    2    3
0  1.0  2.0  3.0  4.0
1  NaN  3.0  4.0  1.0
2  NaN  NaN  1.0  2.0
3  NaN  NaN  NaN  3.0
4  NaN  NaN  NaN  NaN 

The default parameters will drop any rows that contain NA.

dataCleaned = data.dropna()

dataCleaned
Out[8]: 
     0    1    2    3
0  1.0  2.0  3.0  4.0

To drop in columns, set axis = 1, and the result will be an empty DataFrame.

dataCleaned = data.dropna(axis = 1)

dataCleaned
Out[10]: 
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]

To drop rows where all the items are NA only, set how = 'all'.

dataCleaned = data.dropna(how = 'all')

dataCleaned
Out[12]: 
     0    1    2    3
0  1.0  2.0  3.0  4.0
1  NaN  3.0  4.0  1.0
2  NaN  NaN  1.0  2.0
3  NaN  NaN  NaN  3.0

To drop rows where the number of NA is greater than 2.

dataCleaned = data.dropna(thresh = 2)

dataCleaned
Out[14]: 
     0    1    2    3
0  1.0  2.0  3.0  4.0
1  NaN  3.0  4.0  1.0
2  NaN  NaN  1.0  2.0

To drop rows where there are NA in columns 3 and 4.

dataCleaned = data.dropna(subset=[2,3])

dataCleaned
Out[17]: 
     0    1    2    3
0  1.0  2.0  3.0  4.0
1  NaN  3.0  4.0  1.0
2  NaN  NaN  1.0  2.0

Filtering In Missing Data: DataFrame.fillna()

DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None)

Parameters:

  • value: scalar, dict, Series, or DataFrame
  • method: {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
    • pad/ffill: propagate last valid observation forward to next valid
    • backfill/bfill: use next valid observation to fill gap.
  • axis: {0 or ‘index’, 1 or ‘columns’}
  • limit: int, default None
    • maximum number of consecutive NaN values to forward/backward fill
  • downcast: dict, default is None
  • inplace: bool, default False

Fill all NA elements with 0.

dataCleaned = data.fillna(0.0)

dataCleaned
Out[19]: 
     0    1    2    3
0  1.0  2.0  3.0  4.0
1  0.0  3.0  4.0  1.0
2  0.0  0.0  1.0  2.0
3  0.0  0.0  0.0  3.0
4  0.0  0.0  0.0  0.0

Propagate non-null values forward.

dataCleaned = data.fillna(method='ffill')

dataCleaned
Out[21]: 
     0    1    2    3
0  1.0  2.0  3.0  4.0
1  1.0  3.0  4.0  1.0
2  1.0  3.0  1.0  2.0
3  1.0  3.0  1.0  3.0
4  1.0  3.0  1.0  3.0

Propagate non-null values backward.

data = data.append({0:1,1:2,2:3,3:4},ignore_index=True)

data
Out[30]: 
     0    1    2    3
0  1.0  2.0  3.0  4.0
1  NaN  3.0  4.0  1.0
2  NaN  NaN  1.0  2.0
3  NaN  NaN  NaN  3.0
4  NaN  NaN  NaN  NaN
5  1.0  2.0  3.0  4.0

dataCleaned = data.fillna(method='bfill')

dataCleaned
Out[32]: 
     0    1    2    3
0  1.0  2.0  3.0  4.0
1  1.0  3.0  4.0  1.0
2  1.0  2.0  1.0  2.0
3  1.0  2.0  3.0  3.0
4  1.0  2.0  3.0  4.0
5  1.0  2.0  3.0  4.0

Fill in each column with different values.

dataCleaned = data.fillna({0:1,1:2,2:3,3:4})

dataCleaned
Out[34]: 
     0    1    2    3
0  1.0  2.0  3.0  4.0
1  1.0  3.0  4.0  1.0
2  1.0  2.0  1.0  2.0
3  1.0  2.0  3.0  3.0
4  1.0  2.0  3.0  4.0
5  1.0  2.0  3.0  4.0

Fill in with another DataFrame where the column names are the same.

fillDF = pd.DataFrame(np.zeros((5,4)),columns = [1,2,3,4])

dataCleaned = data.fillna(fillDF)

dataCleaned
Out[42]: 
     0    1    2    3
0  1.0  2.0  3.0  4.0
1  NaN  3.0  4.0  1.0
2  NaN  0.0  1.0  2.0
3  NaN  0.0  0.0  3.0
4  NaN  0.0  0.0  0.0
5  1.0  2.0  3.0  4.0

Filtering In Missing Data: DataFrame.interpolate()

DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None)

Parameters:

  • method: str, default ‘linear’
    • ‘linear’: Ignore the index and treat the values as equally spaced. This is the only method supported on MultiIndexes.
    • ‘time’: Works on daily and higher resolution data to interpolate given length of interval.
    • ‘index’, ‘values’: use the actual numerical values of the index.
    • ‘pad’: Fill in NaNs using existing values.
    • ‘nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’, ‘spline’, ‘barycentric’, ‘polynomial’: Passed to scipy.interpolate.interp1d. These methods use the numerical values of the index. Both ‘polynomial’ and ‘spline’ require that you also specify an order (int).
    • ‘krogh’, ‘piecewise_polynomial’, ‘spline’, ‘pchip’, ‘akima’, ‘cubicspline’: Wrappers around the SciPy interpolation methods of similar names. See Notes.
    • ‘from_derivatives’: Refers to scipy.interpolate.BPoly.from_derivatives which replaces ‘piecewise_polynomial’ interpolation method in scipy 0.18.pad/ffill: propagate last valid observation forward to next valid
  • axis: {0 or ‘index’, 1 or ‘columns’}, default None
  • limit: int, default None
    • maximum number of consecutive NaN values to forward/backward fill
  • limit_direction: {‘forward’, ‘backward’, ‘both’}, Optional
  • inplace: bool, default False
  • downcast: optional, ‘infer’ or None, defaults to None

Linear interpolation.

dataCleaned = data.interpolate()

dataCleaned
Out[44]: 
     0     1         2    3
0  1.0  2.00  3.000000  4.0
1  1.0  3.00  4.000000  1.0
2  1.0  2.75  1.000000  2.0
3  1.0  2.50  1.666667  3.0
4  1.0  2.25  2.333333  3.5
5  1.0  2.00  3.000000  4.0

Interpolate by padding, limit = 2.

dataCleaned = data.interpolate(method = 'pad', limit = 2)

dataCleaned
Out[48]: 
     0    1    2    3
0  1.0  2.0  3.0  4.0
1  1.0  3.0  4.0  1.0
2  1.0  3.0  1.0  2.0
3  NaN  3.0  1.0  3.0
4  NaN  NaN  1.0  3.0
5  1.0  2.0  3.0  4.0

Interpolate by polynomial (should be a Series, or columns of the DataFrame).

dataCleaned = data[2].interpolate(method = 'polynomial', order=2)

dataCleaned
Out[63]: 
0    3.000000
1    4.000000
2    1.000000
3   -0.615385
4    0.051282
5    3.000000
Name: 2, dtype: float64

Comments