Data Wrangling [03]: Reshaping and Pivot Tables
Published:
Several functions are useful for reshaping and pivoting the tables:
DataFrame.stack()
: Stack the prescribed level(s) from columns to index.DataFrame.unstack()
: Pivot a level of the (necessarily hierarchical) index labels.DataFrame.pivot()
: Return reshaped DataFrame organized by given index / column values.DataFrame.melt()
: Unpivot a DataFrame from wide to long format.DataFrame.explode()
: Transform each element of a list-like to a row, replicating index values.
Reshaping By stack() & unstack()
DataFrame.stack(level=- 1, dropna=True)
Return a reshaped DataFrame or Series having a multi-level index with one or more new inner-most levels compared to the current DataFrame. The new inner-most levels are created by pivoting the columns of the current dataframe.
DataFrame.unstack(level=- 1, fill_value=None)
Returns a DataFrame having a new level of column labels whose inner-most level consists of the pivoted index labels.
Single Level Columns
df = pd.DataFrame([['Tom','M'], ['Jack', 'M'], ['Susan', 'F']], columns = ['Name', 'Gender'], index = ['A','B','C'])
df
Out[710]:
Name Gender
A Tom M
B Jack M
C Susan F
dfStacked = df.stack()
dfStacked
Out[713]:
A Name Tom
Gender M
B Name Jack
Gender M
C Name Susan
Gender F
dtype: object
# the result is a Series
type(dfStacked)
Out[714]: pandas.core.series.Series
# unstack the Series results in a DataFrame
dfStacked.unstack()
Out[715]:
Name Gender
A Tom M
B Jack M
C Susan F
Multiple Levels Columns: Simple
df = pd.DataFrame([['Tom', 'Ford'], ['Jack', 'Browne'], ['Susan', 'Sontag']],
columns = pd.MultiIndex.from_tuples([('Name', 'F. Name'), ('Name', 'S. Name')]),
index = ['A','B','C'])
df
Out[731]:
Name
F. Name S. Name
A Tom Ford
B Jack Browne
C Susan Sontag
df.stack()
Out[732]:
Name
A F. Name Tom
S. Name Ford
B F. Name Jack
S. Name Browne
C F. Name Susan
S. Name Sontag
df.stack().unstack()
Out[733]:
Name
F. Name S. Name
A Tom Ford
B Jack Browne
C Susan Sontag
Multiple Levels Columns: Missing Values
df = pd.DataFrame([['Tom', 'Ford', 'M'], ['Jack', 'Browne', 'M'], ['Susan', 'Sontag', 'F']],
columns = pd.MultiIndex.from_tuples([('Name', 'F. Name'), ('Name', 'S. Name'), ('Other','Gender')]),
index = ['A','B','C'])
# Simple stack will result in missing values
df.stack()
Out[735]:
Name Other
A F. Name Tom NaN
Gender NaN M
S. Name Ford NaN
B F. Name Jack NaN
Gender NaN M
S. Name Browne NaN
C F. Name Susan NaN
Gender NaN F
S. Name Sontag NaN
# In this case, we can designate the levels we want to stack
df.stack([0,1])
Out[738]:
A Name F. Name Tom
S. Name Ford
Other Gender M
B Name F. Name Jack
S. Name Browne
Other Gender M
C Name F. Name Susan
S. Name Sontag
Other Gender F
dtype: object
# The same is with unstack()
df.stack([0,1]).unstack([1,2])
Out[741]:
Name Other
F. Name S. Name Gender
A Tom Ford M
B Jack Browne M
C Susan Sontag F
Reshaping By pivot()
DataFrame.pivot(index=None, columns=None, values=None)
Reshape data based on column values. Uses unique values from specified index / columns to form axes of the resulting DataFrame. This function does not support data aggregation, multiple values will result in a MultiIndex in the columns.
Parameters:
index
: str or object or a list of str, optional- Column to use to make new frame’s index. If None, uses existing index.
columns
: str or object or a list of str- Column to use to make new frame’s columns.
values
: str, object or a list of the previous, optional- Column(s) to use for populating new frame’s values. If not specified, all remaining columns will be used and the result will have hierarchically indexed columns.
Simple Pivoting
df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two', 'two'],
'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
'baz': [1, 2, 3, 4, 5, 6],
'zoo': ['x', 'y', 'z', 'q', 'w', 't']})
# choose 'foo' as index and 'bar' as columns
df.pivot(index='foo', columns='bar')
Out[743]:
baz zoo
bar A B C A B C
foo
one 1 2 3 x y z
two 4 5 6 q w t
# choose 'baz' as values
df.pivot(index='foo', columns='bar', values='baz')
Out[748]:
bar A B C
foo
one 1 2 3
two 4 5 6
List of Column/Index Names
df = pd.DataFrame({
"lev1": [1, 1, 1, 2, 2, 2],
"lev2": [1, 1, 2, 1, 1, 2],
"lev3": [1, 2, 1, 2, 1, 2],
"lev4": [1, 2, 3, 4, 5, 6],
"values": [0, 1, 2, 3, 4, 5]})
# list of column names
df.pivot(index="lev1", columns=["lev2", "lev3"],values="values")
Out[750]:
lev2 1 2
lev3 1 2 1 2
lev1
1 0.0 1.0 2.0 NaN
2 4.0 3.0 NaN 5.0
# list of index names
df.pivot(index=["lev1",'lev2'], columns=["lev3"],values="values")
Out[753]:
lev3 1 2
lev1 lev2
1 1 0.0 1.0
2 2.0 NaN
2 1 4.0 3.0
2 NaN 5.0
# list of column and stack names
df.pivot(index=["lev1",'lev2'], columns=["lev3", 'lev4'],values="values")
Out[754]:
lev3 1 2 1 2 1 2
lev4 1 2 3 4 5 6
lev1 lev2
1 1 0.0 1.0 NaN NaN NaN NaN
2 NaN NaN 2.0 NaN NaN NaN
2 1 NaN NaN NaN 3.0 4.0 NaN
2 NaN NaN NaN NaN NaN 5.0
Reshaping By melt()
DataFrame.melt(id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None, ignore_index=True)
Parameters:
id_vars
: tuple, list, or ndarray, optional- Column(s) to use as identifier variables.
value_vars
: tuple, list, or ndarray, optional- Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.
var_name
: scalar- Name to use for the ‘variable’ column. If None it uses frame.columns.name or ‘variable’.
value_name
: scalar, default ‘value’- Name to use for the ‘value’ column.
col_level
: int or str, optional- If columns are a MultiIndex then use this level to melt.
ignore_index
: bool, default True- If True, original index is ignored. If False, the original index is retained. Index labels will be repeated as necessary.
Simple Case
df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
'B': {0: 1, 1: 3, 2: 5},
'C': {0: 2, 1: 4, 2: 6}})
df.melt(id_vars=['A'])
Out[756]:
A variable value
0 a B 1
1 b B 3
2 c B 5
3 a C 2
4 b C 4
5 c C 6
Multi-index Columns
Example 1:
df.columns = pd.MultiIndex.from_tuples([('A','B'), ('A','C'), ('D','E')])
df
Out[780]:
A D
B C E
0 a 1 2
1 b 3 4
2 c 5 6
df.melt(id_vars=[('A','B')])
Out[785]:
(A, B) variable_0 variable_1 value
0 a A C 1
1 b A C 3
2 c A C 5
3 a D E 2
4 b D E 4
5 c D E 6
Example 2:
df.columns = pd.MultiIndex.from_tuples([('A','B'), ('D','C'), ('D','E')])
df
Out[790]:
A D
B C E
0 a 1 2
1 b 3 4
2 c 5 6
df.melt(id_vars=[('A')], col_level=0)
Out[791]:
A variable value
0 a D 1
1 b D 3
2 c D 5
3 a D 2
4 b D 4
5 c D 6
Comments