Data Wrangling [01]: Multi-Indexing
Published:
Multi-indexing feature provides a way to work with higher dimensional data in a lower dimensional form.
Basics
MultiIndex
DataFrame with multi-index is like excel with multi-headers. There are several ways to construct a MultiIndex object.
# Construct a MultiIndex from tuples
tuples = [(1, 'red'), (1, 'blue'),
(2, 'red'), (2, 'blue')]
index = pd.MultiIndex.from_tuples(tuples, names=('number', 'color'))
# Construct a MultiIndex from arrays
arrays = [[1, 1, 2, 2], ['red', 'blue', 'red', 'blue']]
idnex = pd.MultiIndex.from_arrays(arrays, names=('number', 'color'))
# Construct a MultiIndex from DataFrame
df = pd.DataFrame([[1, 'red'], [1, 'blue'],
[2, 'red'], [2, 'blue']],
columns=['number', 'color'])
index = pd.MultiIndex.from_frame(df)
# construct a MultiIndex from product
numbers = [1, 2]
colors = ['red', 'blue']
pd.MultiIndex.from_product([numbers, colors], names=['number', 'color'])
# Results
MultiIndex([(1, 'red'),
(1, 'blue'),
(2, 'red'),
(2, 'blue')],
names=['number', 'color'])
DataFrame with Multiple Index Levels
Let’s construct a DataFrame with multiple index levels.
key1 = ['a', 'b']
key2 = [1, 2]
Index = pd.MultiIndex.from_product([key1, key2], names=['key1', 'key2'])
state = ['Ohio', 'Ohio', 'Colorado']
color = ['Green', 'Red', 'Green']
Column = pd.MultiIndex.from_arrays([state, color], names=('state', 'color'))
frame = pd.DataFrame(np.arange(12).reshape((4, 3)), index = Index, columns = Column)
frame
Out[467]:
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
MultiIndex Indexing - Row
frame.loc['a']
Out[468]:
state Ohio Colorado
color Green Red Green
key2
1 0 1 2
2 3 4 5
# using tuple to index the inner levels
frame.loc[('a',1)]
Out[471]:
state color
Ohio Green 0
Red 1
Colorado Green 2
Name: (a, 1), dtype: int32
# use [[]] to return a DataFrame
frame.loc[[('a',1)]]
Out[472]:
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
MultiIndex Indexing - Column
frame['Ohio']
Out[473]:
color Green Red
key1 key2
a 1 0 1
2 3 4
b 1 6 7
2 9 10
# frame['Ohio'] is the same as frame.loc[:,'Ohio']
frame.loc[:,'Ohio']
Out[475]:
color Green Red
key1 key2
a 1 0 1
2 3 4
b 1 6 7
2 9 10
# index the inner column label
frame['Ohio']['Red']
Out[474]:
key1 key2
a 1 1
2 4
b 1 7
2 10
Name: Red, dtype: int32
frame.loc[:,'Ohio']['Red']
Out[486]:
key1 key2
a 1 1
2 4
b 1 7
2 10
Name: Red, dtype: int32
frame.loc[:,'Ohio'].loc[:,'Red']
Out[487]:
key1 key2
a 1 1
2 4
b 1 7
2 10
Name: Red, dtype: int32
frame['Ohio'].loc[:,'Red']
Out[488]:
key1 key2
a 1 1
2 4
b 1 7
2 10
Name: Red, dtype: int32
MultiIndex Indexing - Row & Column
frame.loc['a', 'Ohio']
Out[489]:
color Green Red
key2
1 0 1
2 3 4
frame.loc[('a',1), 'Ohio']
Out[490]:
color
Green 0
Red 1
Name: (a, 1), dtype: int32
frame.loc[[('a',1)], 'Ohio']
Out[493]:
color Green Red
key1 key2
a 1 0 1
frame.loc[('a',1),('Ohio','Red')]
Out[513]: 1
# indexing a specific inner index
frame.loc[(['a','b'],1),('Ohio','Red')]
Out[514]:
key1 key2
a 1 1
b 1 7
Name: (Ohio, Red), dtype: int32
# indexing a specific inner index and column label
frame.loc[(['a','b'],1),(['Ohio','Colorado'],'Green')]
Out[515]:
state Ohio Colorado
color Green Green
key1 key2
a 1 0 2
b 1 6 8
# the same as above
frame.loc[(key1,1),(state,'Green')]
Out[531]:
state Ohio Colorado
color Green Green
key1 key2
a 1 0 2
b 1 6 8
Reordering and Sorting by Levels
swaplevel()
DataFrame.swaplevel(i=- 2, j=- 1, axis=0)
Parameters:
i
,j
: int or str- Levels of the indices to be swapped. Can pass level name as string.
axis
: {0 or ‘index’, 1 or ‘columns’}, default 0- The axis to swap levels on. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.
# swap 'key1' and 'key2'
frame.swaplevel(0,1,axis=0)
Out[532]:
state Ohio Colorado
color Green Red Green
key2 key1
1 a 0 1 2
2 a 3 4 5
1 b 6 7 8
2 b 9 10 11
# swap 'state' and 'color'
frame.swaplevel(0,1,axis=1)
Out[533]:
color Green Red Green
state Ohio Ohio Colorado
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
# swap both
frame.swaplevel(0,1,axis=0).swaplevel(0,1,axis=1)
Out[534]:
color Green Red Green
state Ohio Ohio Colorado
key2 key1
1 a 0 1 2
2 a 3 4 5
1 b 6 7 8
2 b 9 10 11
sort_index()
DataFrame.sort_index(axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, ignore_index=False, key=None)
frame.sort_index(level=1)
Out[537]:
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
b 1 6 7 8
a 2 3 4 5
b 2 9 10 11
frame.sort_index(level=1,ascending=False)
Out[538]:
state Ohio Colorado
color Green Red Green
key1 key2
b 2 9 10 11
a 2 3 4 5
b 1 6 7 8
a 1 0 1 2
frame.sort_index(axis = 1)
Out[540]:
state Colorado Ohio
color Green Green Red
key1 key2
a 1 2 0 1
2 5 3 4
b 1 8 6 7
2 11 9 10
frame.sort_index(axis = 1, level=1, ascending=False)
Out[541]:
state Ohio Colorado
color Red Green Green
key1 key2
a 1 1 0 2
2 4 3 5
b 1 7 6 8
2 10 9 11
sort_values()
DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key=None)
frame.sort_values(('Ohio','Red'), ascending=False)
Out[545]:
state Ohio Colorado
color Green Red Green
key1 key2
b 2 9 10 11
1 6 7 8
a 2 3 4 5
1 0 1 2
frame.sort_values(('a',1), axis = 1, ascending=False)
Out[548]:
state Colorado Ohio
color Green Red Green
key1 key2
a 1 2 1 0
2 5 4 3
b 1 8 7 6
2 11 10 9
Summary Statistics by Level
Many descriptive and summary statistics on DataFrame and Series have a level option in which we can specify the level we want to aggregate by on a particular axis. Take the sum()
function as an example:
frame.sum()
Out[556]:
state color
Ohio Green 18
Red 22
Colorado Green 26
dtype: int64
# frame.sum(level = 1) is also applicable with the warning below.
# __main__:1: FutureWarning: Using the level keyword in DataFrame and Series aggregations is deprecated and will be removed in a future version. Use groupby instead.
# df.sum(level=1) should use df.groupby(level=1).sum().
frame.groupby(level=0).sum()
Out[563]:
state Ohio Colorado
color Green Red Green
key1
a 3 5 7
b 15 17 19
frame.groupby(level=1).sum()
Out[564]:
state Ohio Colorado
color Green Red Green
key2
1 6 8 10
2 12 14 16
frame.groupby(axis = 1, level=0).sum()
Out[565]:
state Colorado Ohio
key1 key2
a 1 2 1
2 5 7
b 1 8 13
2 11 19
frame.groupby(axis = 1, level=1).sum()
Out[566]:
color Green Red
key1 key2
a 1 2 1
2 8 4
b 1 14 7
2 20 10
Comments