Data Cleaning [02]: Data Transformation
Published:
Pandas provides a bunch of APIs for data removing, replacing, renaming, transforming etc. This post will get a sketch of these functions.
Removing Duplicates
- data.duplicated()
- data.drop_duplicates()
DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)
Parameters:
subset
: column label or sequence of labels, optionalkeep
: {‘first’, ‘last’, False}, default ‘first’- first : Drop duplicates except for the first occurrence.
- last : Drop duplicates except for the last occurrence.
- False : Drop all duplicates.
inplace
: bool, default Falseignore_index
: bool, default False
Dataset:
data = pd.DataFrame({'name':['Saito','Ozaki','Hanazawa','Saito','Kitamura'],'Age':[20,19,19,20,19], 'Grade':['M2','D1','D2','M2','D3'] })
data
Out[65]:
name Age Grade
0 Saito 20 M2
1 Ozaki 19 D1
2 Hanazawa 19 D2
3 Saito 20 M2
4 Kitamura 19 D3
Drop all the duplicates except the first one:
dataDropped = data.drop_duplicates()
dataDropped
Out[67]:
name Age Grade
0 Saito 20 M2
1 Ozaki 19 D1
2 Hanazawa 19 D2
4 Kitamura 19 D3
Reindex the index numbers:
dataDropped = data.drop_duplicates(ignore_index=True)
dataDropped
Out[69]:
name Age Grade
0 Saito 20 M2
1 Ozaki 19 D1
2 Hanazawa 19 D2
3 Kitamura 19 D3
Consider the column Grade
only:
dataDropped = data.drop_duplicates(subset = 'Grade')
dataDropped
Out[77]:
name Age Grade
0 Saito 20 M2
1 Ozaki 19 D1
2 Hanazawa 19 D2
4 Kitamura 19 D3
Replacing Values
DataFrame.replace(to_replace=None, value=NoDefault.no_default, inplace=False, limit=None, regex=False, method=NoDefault.no_default)
Parameters:
to_replace
: str, regex, list, dict, Series, int, float, or Nonevalue
: scalar, dict, list, str, regex, default Noneinplace
: bool, default Falselimit
: int, default Noneregex
: bool or same types as to_replace, default False- If this is True then to_replace must be a string.
- Alternatively, this could be a regular expression or a list, dict, or array of regular expressions in which case to_replace must be None.
method
: {‘pad’, ‘ffill’, ‘bfill’, None}
Data:
data = pd.DataFrame({'A': [0, 1, 2, 3, 4],
'B': [5, 6, 7, 8, 9],
'C': ['a', 'b', 'c', 'd', 'e']})
data
Out[79]:
A B C
0 0 5 a
1 1 6 b
2 2 7 c
3 3 8 d
4 4 9 e
data = pd.concat([data,pd.DataFrame({'D':[3,4,5,6,7]})], axis=1)
data
Out[83]:
A B C D
0 0 5 a 3
1 1 6 b 4
2 2 7 c 5
3 3 8 d 6
4 4 9 e 7
Simple replacement:
dataReplaced = data.replace(3,0)
dataReplaced
Out[88]:
A B C D
0 0 5 a 0
1 1 6 b 4
2 2 7 c 5
3 0 8 d 6
4 4 9 e 7
List-like to_replace
:
dataReplaced = data.replace([3,4,5],0)
dataReplaced
Out[90]:
A B C D
0 0 0 a 0
1 1 6 b 0
2 2 7 c 0
3 0 8 d 6
4 0 9 e 7
Use method
to specify the value for replacement:
dataReplaced = data.replace(3, method = 'bfill')
dataReplaced
Out[92]:
A B C D
0 0 5 a 4
1 1 6 b 4
2 2 7 c 5
3 4 8 d 6
4 4 9 e 7
Dict-like to_replace
:
# if the value paremeter is None, dicts are be used to specify different replacement values for different existing values.
dataReplaced = data.replace({3:5, 4:2})
dataReplaced
Out[104]:
A B C D
0 0 5 a 5
1 1 6 b 2
2 2 7 c 5
3 5 8 d 6
4 2 9 e 7
# if value parameter is not None, a dict can specify that different values should be replaced in different columns.
# scalar value
dataReplaced = data.replace({3:5, 'A':2, 'B':6}, 2)
dataReplaced
Out[106]:
A B C D
0 0 5 a 3
1 1 2 b 4
2 2 7 c 5
3 3 8 d 6
4 4 9 e 7
# dict-like value
dataReplaced = data.replace({3:5, 'A':2, 'B':6},{'A':0,'B':1})
dataReplaced
Out[115]:
A B C D
0 0 5 a 3
1 1 1 b 4
2 0 7 c 5
3 3 8 d 6
4 4 9 e 7
# Nest-dict-like ato_replace
dataReplaced = data.replace({'A':{1:2,2:3,3:4}})
dataReplaced
Out[118]:
A B C D
0 0 5 a 3
1 2 6 b 4
2 3 7 c 5
3 4 8 d 6
4 4 9 e 7
Regular expression to_replace
:
dataRegex = pd.DataFrame({'A': ['bat', 'foo', 'bait'],
'B': ['abc', 'bar', 'xyz']})
dataRegex
Out[121]:
A B
0 bat abc
1 foo bar
2 bait xyz
df.replace(to_replace=r'^ba.$', value='new', regex=True)
df.replace({'A': r'^ba.$'}, {'A': 'new'}, regex=True)
df.replace(regex=r'^ba.$', value='new')
df.replace(regex={r'^ba.$': 'new'})
Out[129]:
A B
0 new abc
1 foo new
2 bait xyz
Here is a reference to regular expressions.
Transforming Data Using Mapping or Functions
Series.map()
: Map values of Series according to an input mapping or function.Series.map(arg, na_action=None)
Example: ```python data = pd.DataFrame({‘food’: [‘bacon’, ‘pulled pork’, ‘bacon’, ‘Pastrami’, ‘corned beef’, ‘Bacon’, ‘pastrami’, ‘honey ham’, ‘nova lox’], ‘ounces’: [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data Out[133]: food ounces 0 bacon 4.0 1 pulled pork 3.0 2 bacon 12.0 3 Pastrami 6.0 4 corned beef 7.5 5 Bacon 8.0 6 pastrami 3.0 7 honey ham 5.0 8 nova lox 6.0
mapping correspondence
meat_to_animal = { ‘bacon’: ‘pig’, ‘pulled pork’: ‘pig’, ‘pastrami’: ‘cow’, ‘corned beef’: ‘cow’, ‘honey ham’: ‘pig’, ‘nova lox’: ‘salmon’ }
data[‘animal’] = data[‘food’].str.lower().map(meat_to_animal)
data Out[137]: food ounces animal 0 bacon 4.0 pig 1 pulled pork 3.0 pig 2 bacon 12.0 pig 3 Pastrami 6.0 cow 4 corned beef 7.5 cow 5 Bacon 8.0 pig 6 pastrami 3.0 cow 7 honey ham 5.0 pig 8 nova lox 6.0 salmon
- `Series.apply()`: Invoke function on values of Series.
```python
Series.apply(func, convert_dtype=True, args=(), **kwargs)
data = pd.Series([1,2,3,4])
data
Out[139]:
0 1
1 2
2 3
3 4
dtype: int64
# lambda function
data.apply(lambda x: x ** 2)
Out[140]:
0 1
1 4
2 9
3 16
dtype: int64
# user-defined function
def square(x):
return x ** 2
data.apply(square)
Out[142]:
0 1
1 4
2 9
3 16
dtype: int64
DataFrame.applymap()
: Apply a function to a Dataframe elementwise.DataFrame.applymap(func, na_action=None, **kwargs)
DataFrame.apply()
: Apply a function along an axis of the DataFrame.DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), **kwargs)
DataFrame.rename()
: Alter axes labels.DataFrame.rename(mapper=None, *, index=None, columns=None, axis=None, copy=True, inplace=False, level=None, errors='ignore')
Data Binning
pandas.cut()
pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise', ordered=True)
```python pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3)
Out[144]: [(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], (0.994, 3.0]] Categories (3, interval[float64, right]): [(0.994, 3.0] < (3.0, 5.0] < (5.0, 7.0]]
Add labels:
```python
pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3, labels=["bad", "medium", "good"])
Out[145]:
['bad', 'good', 'medium', 'medium', 'good', 'bad']
Categories (3, object): ['bad' < 'medium' < 'good']
pandas.qcut()
: Quantile-based discretization function.pandas.qcut(x, q, labels=None, retbins=False, precision=3, duplicates='raise')
Coping With Outliers
data = pd.DataFrame(np.random.randn(20, 4))
# Consider value exceeding 3 as outliers
data[(np.abs(data) > 3).any(1)]
Out[176]:
0 1 2 3
7 -3.219183 0.939318 -1.538681 -0.67697
# Cap value outside [-3,3]
data[(np.abs(data) >= 3).any(1)]
Out[180]:
0 1 2 3
7 -3.0 0.939318 -1.538681 -0.67697
Dummy Variables
pandas.get_dummies()
: Convert categorical variable into dummy/indicator variables.pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)
```python data = pd.DataFrame({‘key’: [‘b’, ‘b’, ‘a’, ‘c’, ‘a’, ‘b’], ‘value’: range(6)})
data Out[189]: key value 0 b 0 1 b 1 2 a 2 3 c 3 4 a 4 5 b 5
get dummies
dummies = pd.get_dummies(df[‘key’], prefix=’key’)
dummies Out[192]: key_a key_b key_c 0 0 1 0 1 0 1 0 2 1 0 0 3 0 0 1 4 1 0 0 5 0 1 0
join together
data_with_dummy = data[[‘value’]].join(dummies)
data_with_dummy Out[194]: value key_a key_b key_c 0 0 0 1 0 1 1 0 1 0 2 2 1 0 0 3 3 0 0 1 4 4 1 0 0 5 5 0 1 0 ```
Comments