Data Wrangling [03]: Reshaping and Pivot Tables

6 minute read

Published:

Several functions are useful for reshaping and pivoting the tables:

  • DataFrame.stack(): Stack the prescribed level(s) from columns to index.
  • DataFrame.unstack(): Pivot a level of the (necessarily hierarchical) index labels.
  • DataFrame.pivot(): Return reshaped DataFrame organized by given index / column values.
  • DataFrame.melt(): Unpivot a DataFrame from wide to long format.
  • DataFrame.explode(): Transform each element of a list-like to a row, replicating index values.

Reshaping By stack() & unstack()

DataFrame.stack(level=- 1, dropna=True)

Return a reshaped DataFrame or Series having a multi-level index with one or more new inner-most levels compared to the current DataFrame. The new inner-most levels are created by pivoting the columns of the current dataframe.

DataFrame.unstack(level=- 1, fill_value=None)

Returns a DataFrame having a new level of column labels whose inner-most level consists of the pivoted index labels.

Single Level Columns

df = pd.DataFrame([['Tom','M'], ['Jack', 'M'], ['Susan', 'F']], columns = ['Name', 'Gender'], index = ['A','B','C'])

df
Out[710]: 
    Name Gender
A    Tom      M
B   Jack      M
C  Susan      F

dfStacked = df.stack()

dfStacked
Out[713]: 
A  Name        Tom
   Gender        M
B  Name       Jack
   Gender        M
C  Name      Susan
   Gender        F
dtype: object

# the result is a Series
type(dfStacked)
Out[714]: pandas.core.series.Series

# unstack the Series results in a DataFrame
dfStacked.unstack()
Out[715]: 
    Name Gender
A    Tom      M
B   Jack      M
C  Susan      F

Multiple Levels Columns: Simple

df = pd.DataFrame([['Tom', 'Ford'], ['Jack', 'Browne'], ['Susan', 'Sontag']], 
                  columns = pd.MultiIndex.from_tuples([('Name', 'F. Name'), ('Name', 'S. Name')]), 
                  index = ['A','B','C'])

df
Out[731]: 
     Name        
  F. Name S. Name
A     Tom    Ford
B    Jack  Browne
C   Susan  Sontag

df.stack()
Out[732]: 
             Name
A F. Name     Tom
  S. Name    Ford
B F. Name    Jack
  S. Name  Browne
C F. Name   Susan
  S. Name  Sontag

df.stack().unstack()
Out[733]: 
     Name        
  F. Name S. Name
A     Tom    Ford
B    Jack  Browne
C   Susan  Sontag

Multiple Levels Columns: Missing Values

df = pd.DataFrame([['Tom', 'Ford', 'M'], ['Jack', 'Browne', 'M'], ['Susan', 'Sontag', 'F']], 
                   columns = pd.MultiIndex.from_tuples([('Name', 'F. Name'), ('Name', 'S. Name'), ('Other','Gender')]), 
                   index = ['A','B','C'])

# Simple stack will result in missing values
df.stack()
Out[735]: 
             Name Other
A F. Name     Tom   NaN
  Gender      NaN     M
  S. Name    Ford   NaN
B F. Name    Jack   NaN
  Gender      NaN     M
  S. Name  Browne   NaN
C F. Name   Susan   NaN
  Gender      NaN     F
  S. Name  Sontag   NaN

# In this case, we can designate the levels we want to stack
df.stack([0,1])
Out[738]: 
A  Name   F. Name       Tom
          S. Name      Ford
   Other  Gender          M
B  Name   F. Name      Jack
          S. Name    Browne
   Other  Gender          M
C  Name   F. Name     Susan
          S. Name    Sontag
   Other  Gender          F
dtype: object

# The same is with unstack()
df.stack([0,1]).unstack([1,2])
Out[741]: 
     Name          Other
  F. Name S. Name Gender
A     Tom    Ford      M
B    Jack  Browne      M
C   Susan  Sontag      F

Reshaping By pivot()

DataFrame.pivot(index=None, columns=None, values=None)

Reshape data based on column values. Uses unique values from specified index / columns to form axes of the resulting DataFrame. This function does not support data aggregation, multiple values will result in a MultiIndex in the columns.

Parameters:

  • index: str or object or a list of str, optional
    • Column to use to make new frame’s index. If None, uses existing index.
  • columns: str or object or a list of str
    • Column to use to make new frame’s columns.
  • values: str, object or a list of the previous, optional
    • Column(s) to use for populating new frame’s values. If not specified, all remaining columns will be used and the result will have hierarchically indexed columns.

Simple Pivoting

df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two', 'two'],
                   'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'baz': [1, 2, 3, 4, 5, 6],
                   'zoo': ['x', 'y', 'z', 'q', 'w', 't']})

# choose 'foo' as index and 'bar' as columns
df.pivot(index='foo', columns='bar')
Out[743]: 
    baz       zoo      
bar   A  B  C   A  B  C
foo                    
one   1  2  3   x  y  z
two   4  5  6   q  w  t

# choose 'baz' as values
df.pivot(index='foo', columns='bar', values='baz')
Out[748]: 
bar  A  B  C
foo         
one  1  2  3
two  4  5  6

List of Column/Index Names

df = pd.DataFrame({
       "lev1": [1, 1, 1, 2, 2, 2],
       "lev2": [1, 1, 2, 1, 1, 2],
       "lev3": [1, 2, 1, 2, 1, 2],
       "lev4": [1, 2, 3, 4, 5, 6],
       "values": [0, 1, 2, 3, 4, 5]})

# list of column names
df.pivot(index="lev1", columns=["lev2", "lev3"],values="values")
Out[750]: 
lev2    1         2     
lev3    1    2    1    2
lev1                    
1     0.0  1.0  2.0  NaN
2     4.0  3.0  NaN  5.0

# list of index names
df.pivot(index=["lev1",'lev2'], columns=["lev3"],values="values")
Out[753]: 
lev3         1    2
lev1 lev2          
1    1     0.0  1.0
     2     2.0  NaN
2    1     4.0  3.0
     2     NaN  5.0

# list of column and stack names
df.pivot(index=["lev1",'lev2'], columns=["lev3", 'lev4'],values="values")
Out[754]: 
lev3         1    2    1    2    1    2
lev4         1    2    3    4    5    6
lev1 lev2                              
1    1     0.0  1.0  NaN  NaN  NaN  NaN
     2     NaN  NaN  2.0  NaN  NaN  NaN
2    1     NaN  NaN  NaN  3.0  4.0  NaN
     2     NaN  NaN  NaN  NaN  NaN  5.0

Reshaping By melt()

DataFrame.melt(id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None, ignore_index=True)

Parameters:

  • id_vars: tuple, list, or ndarray, optional
    • Column(s) to use as identifier variables.
  • value_vars: tuple, list, or ndarray, optional
    • Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.
  • var_name: scalar
    • Name to use for the ‘variable’ column. If None it uses frame.columns.name or ‘variable’.
  • value_name: scalar, default ‘value’
    • Name to use for the ‘value’ column.
  • col_level: int or str, optional
    • If columns are a MultiIndex then use this level to melt.
  • ignore_index: bool, default True
    • If True, original index is ignored. If False, the original index is retained. Index labels will be repeated as necessary.

Simple Case

df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
                   'B': {0: 1, 1: 3, 2: 5},
                   'C': {0: 2, 1: 4, 2: 6}})
                   
df.melt(id_vars=['A'])
Out[756]: 
   A variable  value
0  a        B      1
1  b        B      3
2  c        B      5
3  a        C      2
4  b        C      4
5  c        C      6                   

Multi-index Columns

Example 1:

df.columns = pd.MultiIndex.from_tuples([('A','B'), ('A','C'), ('D','E')])

df
Out[780]: 
   A     D
   B  C  E
0  a  1  2
1  b  3  4
2  c  5  6

df.melt(id_vars=[('A','B')])
Out[785]: 
  (A, B) variable_0 variable_1  value
0      a          A          C      1
1      b          A          C      3
2      c          A          C      5
3      a          D          E      2
4      b          D          E      4
5      c          D          E      6

Example 2:

df.columns = pd.MultiIndex.from_tuples([('A','B'), ('D','C'), ('D','E')])

df
Out[790]: 
   A  D   
   B  C  E
0  a  1  2
1  b  3  4
2  c  5  6

df.melt(id_vars=[('A')], col_level=0)
Out[791]: 
   A variable  value
0  a        D      1
1  b        D      3
2  c        D      5
3  a        D      2
4  b        D      4
5  c        D      6

Comments