Data Cleaning [02]: Data Transformation

7 minute read

Published:

Pandas provides a bunch of APIs for data removing, replacing, renaming, transforming etc. This post will get a sketch of these functions.


Removing Duplicates

  • data.duplicated()
  • data.drop_duplicates()
    DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)
    

    Parameters:

  • subset: column label or sequence of labels, optional
  • keep: {‘first’, ‘last’, False}, default ‘first’
    • first : Drop duplicates except for the first occurrence.
    • last : Drop duplicates except for the last occurrence.
    • False : Drop all duplicates.
  • inplace: bool, default False
  • ignore_index: bool, default False

Dataset:

data = pd.DataFrame({'name':['Saito','Ozaki','Hanazawa','Saito','Kitamura'],'Age':[20,19,19,20,19], 'Grade':['M2','D1','D2','M2','D3'] })

data
Out[65]: 
       name  Age Grade
0     Saito   20    M2
1     Ozaki   19    D1
2  Hanazawa   19    D2
3     Saito   20    M2
4  Kitamura   19    D3

Drop all the duplicates except the first one:

dataDropped = data.drop_duplicates()

dataDropped
Out[67]: 
       name  Age Grade
0     Saito   20    M2
1     Ozaki   19    D1
2  Hanazawa   19    D2
4  Kitamura   19    D3

Reindex the index numbers:

dataDropped = data.drop_duplicates(ignore_index=True)

dataDropped
Out[69]: 
       name  Age Grade
0     Saito   20    M2
1     Ozaki   19    D1
2  Hanazawa   19    D2
3  Kitamura   19    D3

Consider the column Grade only:

dataDropped = data.drop_duplicates(subset = 'Grade')

dataDropped
Out[77]: 
       name  Age Grade
0     Saito   20    M2
1     Ozaki   19    D1
2  Hanazawa   19    D2
4  Kitamura   19    D3

Replacing Values

DataFrame.replace(to_replace=None, value=NoDefault.no_default, inplace=False, limit=None, regex=False, method=NoDefault.no_default)

Parameters:

  • to_replace: str, regex, list, dict, Series, int, float, or None
  • value: scalar, dict, list, str, regex, default None
  • inplace: bool, default False
  • limit: int, default None
  • regex: bool or same types as to_replace, default False
    • If this is True then to_replace must be a string.
    • Alternatively, this could be a regular expression or a list, dict, or array of regular expressions in which case to_replace must be None.
  • method: {‘pad’, ‘ffill’, ‘bfill’, None}

Data:

data = pd.DataFrame({'A': [0, 1, 2, 3, 4],
                     'B': [5, 6, 7, 8, 9],
                     'C': ['a', 'b', 'c', 'd', 'e']})

data
Out[79]: 
   A  B  C
0  0  5  a
1  1  6  b
2  2  7  c
3  3  8  d
4  4  9  e

data = pd.concat([data,pd.DataFrame({'D':[3,4,5,6,7]})], axis=1)

data
Out[83]: 
   A  B  C  D
0  0  5  a  3
1  1  6  b  4
2  2  7  c  5
3  3  8  d  6
4  4  9  e  7

Simple replacement:

dataReplaced = data.replace(3,0)

dataReplaced
Out[88]: 
   A  B  C  D
0  0  5  a  0
1  1  6  b  4
2  2  7  c  5
3  0  8  d  6
4  4  9  e  7

List-like to_replace:

dataReplaced = data.replace([3,4,5],0)

dataReplaced
Out[90]: 
   A  B  C  D
0  0  0  a  0
1  1  6  b  0
2  2  7  c  0
3  0  8  d  6
4  0  9  e  7

Use method to specify the value for replacement:

dataReplaced = data.replace(3, method = 'bfill')

dataReplaced
Out[92]: 
   A  B  C  D
0  0  5  a  4
1  1  6  b  4
2  2  7  c  5
3  4  8  d  6
4  4  9  e  7

Dict-like to_replace:

# if the value paremeter is None, dicts are be used to specify different replacement values for different existing values. 
dataReplaced = data.replace({3:5, 4:2})

dataReplaced
Out[104]: 
   A  B  C  D
0  0  5  a  5
1  1  6  b  2
2  2  7  c  5
3  5  8  d  6
4  2  9  e  7

# if value parameter is not None, a dict can specify that different values should be replaced in different columns.
# scalar value
dataReplaced = data.replace({3:5, 'A':2, 'B':6}, 2)

dataReplaced
Out[106]: 
   A  B  C  D
0  0  5  a  3
1  1  2  b  4
2  2  7  c  5
3  3  8  d  6
4  4  9  e  7

# dict-like value
dataReplaced = data.replace({3:5, 'A':2, 'B':6},{'A':0,'B':1})

dataReplaced
Out[115]: 
   A  B  C  D
0  0  5  a  3
1  1  1  b  4
2  0  7  c  5
3  3  8  d  6
4  4  9  e  7

# Nest-dict-like ato_replace
dataReplaced = data.replace({'A':{1:2,2:3,3:4}})

dataReplaced
Out[118]: 
   A  B  C  D
0  0  5  a  3
1  2  6  b  4
2  3  7  c  5
3  4  8  d  6
4  4  9  e  7

Regular expression to_replace:

dataRegex = pd.DataFrame({'A': ['bat', 'foo', 'bait'],
                          'B': ['abc', 'bar', 'xyz']})

dataRegex
Out[121]: 
      A    B
0   bat  abc
1   foo  bar
2  bait  xyz

df.replace(to_replace=r'^ba.$', value='new', regex=True)
df.replace({'A': r'^ba.$'}, {'A': 'new'}, regex=True)
df.replace(regex=r'^ba.$', value='new')
df.replace(regex={r'^ba.$': 'new'})

Out[129]: 
      A    B
0   new  abc
1   foo  new
2  bait  xyz

Here is a reference to regular expressions.


Transforming Data Using Mapping or Functions

  • Series.map(): Map values of Series according to an input mapping or function.
    Series.map(arg, na_action=None)
    

    Example: ```python data = pd.DataFrame({‘food’: [‘bacon’, ‘pulled pork’, ‘bacon’, ‘Pastrami’, ‘corned beef’, ‘Bacon’, ‘pastrami’, ‘honey ham’, ‘nova lox’], ‘ounces’: [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})

data Out[133]: food ounces 0 bacon 4.0 1 pulled pork 3.0 2 bacon 12.0 3 Pastrami 6.0 4 corned beef 7.5 5 Bacon 8.0 6 pastrami 3.0 7 honey ham 5.0 8 nova lox 6.0

mapping correspondence

meat_to_animal = { ‘bacon’: ‘pig’, ‘pulled pork’: ‘pig’, ‘pastrami’: ‘cow’, ‘corned beef’: ‘cow’, ‘honey ham’: ‘pig’, ‘nova lox’: ‘salmon’ }

data[‘animal’] = data[‘food’].str.lower().map(meat_to_animal)

data Out[137]: food ounces animal 0 bacon 4.0 pig 1 pulled pork 3.0 pig 2 bacon 12.0 pig 3 Pastrami 6.0 cow 4 corned beef 7.5 cow 5 Bacon 8.0 pig 6 pastrami 3.0 cow 7 honey ham 5.0 pig 8 nova lox 6.0 salmon


- `Series.apply()`: Invoke function on values of Series.
```python
Series.apply(func, convert_dtype=True, args=(), **kwargs)
data = pd.Series([1,2,3,4])

data
Out[139]: 
0    1
1    2
2    3
3    4
dtype: int64

# lambda function
data.apply(lambda x: x ** 2)
Out[140]: 
0     1
1     4
2     9
3    16
dtype: int64

# user-defined function
def square(x):
    return x ** 2
    
data.apply(square)
Out[142]: 
0     1
1     4
2     9
3    16
dtype: int64
  • DataFrame.applymap(): Apply a function to a Dataframe elementwise.
    DataFrame.applymap(func, na_action=None, **kwargs)
    
  • DataFrame.apply(): Apply a function along an axis of the DataFrame.
    DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), **kwargs)
    
  • DataFrame.rename(): Alter axes labels.
    DataFrame.rename(mapper=None, *, index=None, columns=None, axis=None, copy=True, inplace=False, level=None, errors='ignore')
    

Data Binning

  • pandas.cut()
    pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise', ordered=True)
    

    ```python pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3)

Out[144]: [(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], (0.994, 3.0]] Categories (3, interval[float64, right]): [(0.994, 3.0] < (3.0, 5.0] < (5.0, 7.0]]

Add labels:
```python
pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3, labels=["bad", "medium", "good"])
Out[145]: 
['bad', 'good', 'medium', 'medium', 'good', 'bad']
Categories (3, object): ['bad' < 'medium' < 'good']
  • pandas.qcut(): Quantile-based discretization function.
    pandas.qcut(x, q, labels=None, retbins=False, precision=3, duplicates='raise')
    

Coping With Outliers

data = pd.DataFrame(np.random.randn(20, 4))

# Consider value exceeding 3 as outliers
data[(np.abs(data) > 3).any(1)]
Out[176]: 
          0         1         2        3
7 -3.219183  0.939318 -1.538681 -0.67697

# Cap value outside [-3,3]
data[(np.abs(data) >= 3).any(1)]
Out[180]: 
     0         1         2        3
7 -3.0  0.939318 -1.538681 -0.67697

Dummy Variables

  • pandas.get_dummies(): Convert categorical variable into dummy/indicator variables.
    pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)
    

    ```python data = pd.DataFrame({‘key’: [‘b’, ‘b’, ‘a’, ‘c’, ‘a’, ‘b’], ‘value’: range(6)})

data Out[189]: key value 0 b 0 1 b 1 2 a 2 3 c 3 4 a 4 5 b 5

get dummies

dummies = pd.get_dummies(df[‘key’], prefix=’key’)

dummies Out[192]: key_a key_b key_c 0 0 1 0 1 0 1 0 2 1 0 0 3 0 0 1 4 1 0 0 5 0 1 0

join together

data_with_dummy = data[[‘value’]].join(dummies)

data_with_dummy Out[194]: value key_a key_b key_c 0 0 0 1 0 1 1 0 1 0 2 2 1 0 0 3 3 0 0 1 4 4 1 0 0 5 5 0 1 0 ```


Comments