Data Wrangling [03]: Reshaping and Pivot Tables
Published:
Several functions are useful for reshaping and pivoting the tables:
- DataFrame.stack(): Stack the prescribed level(s) from columns to index.
- DataFrame.unstack(): Pivot a level of the (necessarily hierarchical) index labels.
- DataFrame.pivot(): Return reshaped DataFrame organized by given index / column values.
- DataFrame.melt(): Unpivot a DataFrame from wide to long format.
- DataFrame.explode(): Transform each element of a list-like to a row, replicating index values.
Reshaping By stack() & unstack()
DataFrame.stack(level=- 1, dropna=True)
Return a reshaped DataFrame or Series having a multi-level index with one or more new inner-most levels compared to the current DataFrame. The new inner-most levels are created by pivoting the columns of the current dataframe.
DataFrame.unstack(level=- 1, fill_value=None)
Returns a DataFrame having a new level of column labels whose inner-most level consists of the pivoted index labels.
Single Level Columns
df = pd.DataFrame([['Tom','M'], ['Jack', 'M'], ['Susan', 'F']], columns = ['Name', 'Gender'], index = ['A','B','C'])
df
Out[710]: 
    Name Gender
A    Tom      M
B   Jack      M
C  Susan      F
dfStacked = df.stack()
dfStacked
Out[713]: 
A  Name        Tom
   Gender        M
B  Name       Jack
   Gender        M
C  Name      Susan
   Gender        F
dtype: object
# the result is a Series
type(dfStacked)
Out[714]: pandas.core.series.Series
# unstack the Series results in a DataFrame
dfStacked.unstack()
Out[715]: 
    Name Gender
A    Tom      M
B   Jack      M
C  Susan      F
Multiple Levels Columns: Simple
df = pd.DataFrame([['Tom', 'Ford'], ['Jack', 'Browne'], ['Susan', 'Sontag']], 
                  columns = pd.MultiIndex.from_tuples([('Name', 'F. Name'), ('Name', 'S. Name')]), 
                  index = ['A','B','C'])
df
Out[731]: 
     Name        
  F. Name S. Name
A     Tom    Ford
B    Jack  Browne
C   Susan  Sontag
df.stack()
Out[732]: 
             Name
A F. Name     Tom
  S. Name    Ford
B F. Name    Jack
  S. Name  Browne
C F. Name   Susan
  S. Name  Sontag
df.stack().unstack()
Out[733]: 
     Name        
  F. Name S. Name
A     Tom    Ford
B    Jack  Browne
C   Susan  Sontag
Multiple Levels Columns: Missing Values
df = pd.DataFrame([['Tom', 'Ford', 'M'], ['Jack', 'Browne', 'M'], ['Susan', 'Sontag', 'F']], 
                   columns = pd.MultiIndex.from_tuples([('Name', 'F. Name'), ('Name', 'S. Name'), ('Other','Gender')]), 
                   index = ['A','B','C'])
# Simple stack will result in missing values
df.stack()
Out[735]: 
             Name Other
A F. Name     Tom   NaN
  Gender      NaN     M
  S. Name    Ford   NaN
B F. Name    Jack   NaN
  Gender      NaN     M
  S. Name  Browne   NaN
C F. Name   Susan   NaN
  Gender      NaN     F
  S. Name  Sontag   NaN
# In this case, we can designate the levels we want to stack
df.stack([0,1])
Out[738]: 
A  Name   F. Name       Tom
          S. Name      Ford
   Other  Gender          M
B  Name   F. Name      Jack
          S. Name    Browne
   Other  Gender          M
C  Name   F. Name     Susan
          S. Name    Sontag
   Other  Gender          F
dtype: object
# The same is with unstack()
df.stack([0,1]).unstack([1,2])
Out[741]: 
     Name          Other
  F. Name S. Name Gender
A     Tom    Ford      M
B    Jack  Browne      M
C   Susan  Sontag      F
Reshaping By pivot()
DataFrame.pivot(index=None, columns=None, values=None)
Reshape data based on column values. Uses unique values from specified index / columns to form axes of the resulting DataFrame. This function does not support data aggregation, multiple values will result in a MultiIndex in the columns.
Parameters:
- index: str or object or a list of str, optional- Column to use to make new frame’s index. If None, uses existing index.
 
- columns: str or object or a list of str- Column to use to make new frame’s columns.
 
- values: str, object or a list of the previous, optional- Column(s) to use for populating new frame’s values. If not specified, all remaining columns will be used and the result will have hierarchically indexed columns.
 
Simple Pivoting
df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two', 'two'],
                   'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'baz': [1, 2, 3, 4, 5, 6],
                   'zoo': ['x', 'y', 'z', 'q', 'w', 't']})
# choose 'foo' as index and 'bar' as columns
df.pivot(index='foo', columns='bar')
Out[743]: 
    baz       zoo      
bar   A  B  C   A  B  C
foo                    
one   1  2  3   x  y  z
two   4  5  6   q  w  t
# choose 'baz' as values
df.pivot(index='foo', columns='bar', values='baz')
Out[748]: 
bar  A  B  C
foo         
one  1  2  3
two  4  5  6
List of Column/Index Names
df = pd.DataFrame({
       "lev1": [1, 1, 1, 2, 2, 2],
       "lev2": [1, 1, 2, 1, 1, 2],
       "lev3": [1, 2, 1, 2, 1, 2],
       "lev4": [1, 2, 3, 4, 5, 6],
       "values": [0, 1, 2, 3, 4, 5]})
# list of column names
df.pivot(index="lev1", columns=["lev2", "lev3"],values="values")
Out[750]: 
lev2    1         2     
lev3    1    2    1    2
lev1                    
1     0.0  1.0  2.0  NaN
2     4.0  3.0  NaN  5.0
# list of index names
df.pivot(index=["lev1",'lev2'], columns=["lev3"],values="values")
Out[753]: 
lev3         1    2
lev1 lev2          
1    1     0.0  1.0
     2     2.0  NaN
2    1     4.0  3.0
     2     NaN  5.0
# list of column and stack names
df.pivot(index=["lev1",'lev2'], columns=["lev3", 'lev4'],values="values")
Out[754]: 
lev3         1    2    1    2    1    2
lev4         1    2    3    4    5    6
lev1 lev2                              
1    1     0.0  1.0  NaN  NaN  NaN  NaN
     2     NaN  NaN  2.0  NaN  NaN  NaN
2    1     NaN  NaN  NaN  3.0  4.0  NaN
     2     NaN  NaN  NaN  NaN  NaN  5.0
Reshaping By melt()
DataFrame.melt(id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None, ignore_index=True)
Parameters:
- id_vars: tuple, list, or ndarray, optional- Column(s) to use as identifier variables.
 
- value_vars: tuple, list, or ndarray, optional- Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.
 
- var_name: scalar- Name to use for the ‘variable’ column. If None it uses frame.columns.name or ‘variable’.
 
- value_name: scalar, default ‘value’- Name to use for the ‘value’ column.
 
- col_level: int or str, optional- If columns are a MultiIndex then use this level to melt.
 
- ignore_index: bool, default True- If True, original index is ignored. If False, the original index is retained. Index labels will be repeated as necessary.
 
Simple Case
df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
                   'B': {0: 1, 1: 3, 2: 5},
                   'C': {0: 2, 1: 4, 2: 6}})
                   
df.melt(id_vars=['A'])
Out[756]: 
   A variable  value
0  a        B      1
1  b        B      3
2  c        B      5
3  a        C      2
4  b        C      4
5  c        C      6                   
Multi-index Columns
Example 1:
df.columns = pd.MultiIndex.from_tuples([('A','B'), ('A','C'), ('D','E')])
df
Out[780]: 
   A     D
   B  C  E
0  a  1  2
1  b  3  4
2  c  5  6
df.melt(id_vars=[('A','B')])
Out[785]: 
  (A, B) variable_0 variable_1  value
0      a          A          C      1
1      b          A          C      3
2      c          A          C      5
3      a          D          E      2
4      b          D          E      4
5      c          D          E      6
Example 2:
df.columns = pd.MultiIndex.from_tuples([('A','B'), ('D','C'), ('D','E')])
df
Out[790]: 
   A  D   
   B  C  E
0  a  1  2
1  b  3  4
2  c  5  6
df.melt(id_vars=[('A')], col_level=0)
Out[791]: 
   A variable  value
0  a        D      1
1  b        D      3
2  c        D      5
3  a        D      2
4  b        D      4
5  c        D      6

Comments