Data Wrangling [01]: Multi-Indexing

6 minute read

Published:

Multi-indexing feature provides a way to work with higher dimensional data in a lower dimensional form.


Basics

MultiIndex

DataFrame with multi-index is like excel with multi-headers. There are several ways to construct a MultiIndex object.

# Construct a MultiIndex from tuples
tuples = [(1, 'red'), (1, 'blue'),
          (2, 'red'), (2, 'blue')]
index = pd.MultiIndex.from_tuples(tuples, names=('number', 'color'))

# Construct a MultiIndex from arrays
arrays = [[1, 1, 2, 2], ['red', 'blue', 'red', 'blue']]
idnex = pd.MultiIndex.from_arrays(arrays, names=('number', 'color'))

# Construct a MultiIndex from DataFrame
df = pd.DataFrame([[1, 'red'], [1, 'blue'],
                   [2, 'red'], [2, 'blue']],
                  columns=['number', 'color'])
index = pd.MultiIndex.from_frame(df)

# construct a MultiIndex from product
numbers = [1, 2]
colors = ['red', 'blue']
pd.MultiIndex.from_product([numbers, colors], names=['number', 'color'])

# Results
MultiIndex([(1,  'red'),
            (1, 'blue'),
            (2,  'red'),
            (2, 'blue')],
           names=['number', 'color'])

DataFrame with Multiple Index Levels

Let’s construct a DataFrame with multiple index levels.

key1 = ['a', 'b']
key2 = [1, 2]
Index = pd.MultiIndex.from_product([key1, key2], names=['key1', 'key2'])

state = ['Ohio', 'Ohio', 'Colorado']
color = ['Green', 'Red', 'Green']
Column = pd.MultiIndex.from_arrays([state, color], names=('state', 'color'))

 frame = pd.DataFrame(np.arange(12).reshape((4, 3)), index = Index, columns = Column)
 
 frame
 Out[467]: 
state      Ohio     Colorado
color     Green Red    Green
key1 key2                   
a    1        0   1        2
     2        3   4        5
b    1        6   7        8
     2        9  10       11

MultiIndex Indexing - Row

frame.loc['a']
Out[468]: 
state  Ohio     Colorado
color Green Red    Green
key2                    
1         0   1        2
2         3   4        5

# using tuple to index the inner levels
frame.loc[('a',1)]
Out[471]: 
state     color
Ohio      Green    0
          Red      1
Colorado  Green    2
Name: (a, 1), dtype: int32

# use [[]] to return a DataFrame
frame.loc[[('a',1)]]
Out[472]: 
state      Ohio     Colorado
color     Green Red    Green
key1 key2                   
a    1        0   1        2

MultiIndex Indexing - Column

frame['Ohio']
Out[473]: 
color      Green  Red
key1 key2            
a    1         0    1
     2         3    4
b    1         6    7
     2         9   10

# frame['Ohio'] is the same as frame.loc[:,'Ohio']
frame.loc[:,'Ohio']
Out[475]: 
color      Green  Red
key1 key2            
a    1         0    1
     2         3    4
b    1         6    7
     2         9   10

# index the inner column label
frame['Ohio']['Red']
Out[474]: 
key1  key2
a     1        1
      2        4
b     1        7
      2       10
Name: Red, dtype: int32

frame.loc[:,'Ohio']['Red']
Out[486]: 
key1  key2
a     1        1
      2        4
b     1        7
      2       10
Name: Red, dtype: int32

frame.loc[:,'Ohio'].loc[:,'Red']
Out[487]: 
key1  key2
a     1        1
      2        4
b     1        7
      2       10
Name: Red, dtype: int32

frame['Ohio'].loc[:,'Red']
Out[488]: 
key1  key2
a     1        1
      2        4
b     1        7
      2       10
Name: Red, dtype: int32

MultiIndex Indexing - Row & Column

frame.loc['a', 'Ohio']
Out[489]: 
color  Green  Red
key2             
1          0    1
2          3    4

frame.loc[('a',1), 'Ohio']
Out[490]: 
color
Green    0
Red      1
Name: (a, 1), dtype: int32

frame.loc[[('a',1)], 'Ohio']
Out[493]: 
color      Green  Red
key1 key2            
a    1         0    1

frame.loc[('a',1),('Ohio','Red')]
Out[513]: 1

# indexing a specific inner index
frame.loc[(['a','b'],1),('Ohio','Red')]
Out[514]: 
key1  key2
a     1       1
b     1       7
Name: (Ohio, Red), dtype: int32

# indexing a specific inner index and column label
frame.loc[(['a','b'],1),(['Ohio','Colorado'],'Green')]
Out[515]: 
state      Ohio Colorado
color     Green    Green
key1 key2               
a    1        0        2
b    1        6        8

# the same as above
frame.loc[(key1,1),(state,'Green')]
Out[531]: 
state      Ohio Colorado
color     Green    Green
key1 key2               
a    1        0        2
b    1        6        8

Reordering and Sorting by Levels

swaplevel()

DataFrame.swaplevel(i=- 2, j=- 1, axis=0)

Parameters:

  • i, j: int or str
    • Levels of the indices to be swapped. Can pass level name as string.
  • axis: {0 or ‘index’, 1 or ‘columns’}, default 0
    • The axis to swap levels on. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.
# swap 'key1' and 'key2'
frame.swaplevel(0,1,axis=0)
Out[532]: 
state      Ohio     Colorado
color     Green Red    Green
key2 key1                   
1    a        0   1        2
2    a        3   4        5
1    b        6   7        8
2    b        9  10       11

# swap 'state' and 'color'
frame.swaplevel(0,1,axis=1)
Out[533]: 
color     Green  Red    Green
state      Ohio Ohio Colorado
key1 key2                    
a    1        0    1        2
     2        3    4        5
b    1        6    7        8
     2        9   10       11
     
# swap both
frame.swaplevel(0,1,axis=0).swaplevel(0,1,axis=1)
Out[534]: 
color     Green  Red    Green
state      Ohio Ohio Colorado
key2 key1                    
1    a        0    1        2
2    a        3    4        5
1    b        6    7        8
2    b        9   10       11

sort_index()

DataFrame.sort_index(axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, ignore_index=False, key=None)
frame.sort_index(level=1)
Out[537]: 
state      Ohio     Colorado
color     Green Red    Green
key1 key2                   
a    1        0   1        2
b    1        6   7        8
a    2        3   4        5
b    2        9  10       11

frame.sort_index(level=1,ascending=False)
Out[538]: 
state      Ohio     Colorado
color     Green Red    Green
key1 key2                   
b    2        9  10       11
a    2        3   4        5
b    1        6   7        8
a    1        0   1        2

frame.sort_index(axis = 1)
Out[540]: 
state     Colorado  Ohio    
color        Green Green Red
key1 key2                   
a    1           2     0   1
     2           5     3   4
b    1           8     6   7
     2          11     9  10

frame.sort_index(axis = 1, level=1, ascending=False)
Out[541]: 
state     Ohio       Colorado
color      Red Green    Green
key1 key2                    
a    1       1     0        2
     2       4     3        5
b    1       7     6        8
     2      10     9       11

sort_values()

DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key=None)
frame.sort_values(('Ohio','Red'), ascending=False)
Out[545]: 
state      Ohio     Colorado
color     Green Red    Green
key1 key2                   
b    2        9  10       11
     1        6   7        8
a    2        3   4        5
     1        0   1        2
     
frame.sort_values(('a',1), axis = 1, ascending=False)
Out[548]: 
state     Colorado Ohio      
color        Green  Red Green
key1 key2                    
a    1           2    1     0
     2           5    4     3
b    1           8    7     6
     2          11   10     9

Summary Statistics by Level

Many descriptive and summary statistics on DataFrame and Series have a level option in which we can specify the level we want to aggregate by on a particular axis. Take the sum() function as an example:

frame.sum()
Out[556]: 
state     color
Ohio      Green    18
          Red      22
Colorado  Green    26
dtype: int64

# frame.sum(level = 1) is also applicable with the warning below.
# __main__:1: FutureWarning: Using the level keyword in DataFrame and Series aggregations is deprecated and will be removed in a future version. Use groupby instead. 
# df.sum(level=1) should use df.groupby(level=1).sum().

frame.groupby(level=0).sum()
Out[563]: 
state  Ohio     Colorado
color Green Red    Green
key1                    
a         3   5        7
b        15  17       19

frame.groupby(level=1).sum()
Out[564]: 
state  Ohio     Colorado
color Green Red    Green
key2                    
1         6   8       10
2        12  14       16


frame.groupby(axis = 1, level=0).sum()
Out[565]: 
state      Colorado  Ohio
key1 key2                
a    1            2     1
     2            5     7
b    1            8    13
     2           11    19

frame.groupby(axis = 1, level=1).sum()
Out[566]: 
color      Green  Red
key1 key2            
a    1         2    1
     2         8    4
b    1        14    7
     2        20   10

Comments