Data Aggregation and Grouping [01]: GroupBy Method
Published:
Group operations involves three stages:
- split: object is split into groups based on one or more keys
- apply: a function is applied to each group, producing a new value
- combine: results of are combined into a result object
Basics
In pandas, we use DataFrame.group()
to realize the three steps.
DataFrame.groupby(by=None,
axis=0,
level=None,
as_index=True,
sort=True,
group_keys=True,
squeeze=NoDefault.no_default,
observed=False, dropna=True)
Parameters:
by
: mapping, function, label, or list of labels- Used to determine the groups for the groupby.
- If
by
is a function, it’s called on each value of the object’s index. - If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups.
- If a list or ndarray of length equal to the selected axis is passed, the values are used to determine the groups.
- Id a list of values indicating column names is passed, the values of the columns will be used to determine the groups.
- If
- Used to determine the groups for the groupby.
axis
: {0 or ‘index’, 1 or ‘columns’}, default 0level
: int, level name, or sequence of such, default None- If the axis is a MultiIndex (hierarchical), group by a particular level or levels.
as_index
: bool, default True- For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output.
sort
: bool, default True- Sort group keys.
group_keys
: bool, default True- When calling apply, add group keys to index to identify pieces.
observed
: bool, default False- This only applies if any of the groupers are Categoricals.
- If True: only show observed values for categorical groupers.
- If False: show all values for categorical groupers.
- This only applies if any of the groupers are Categoricals.
dropna
: bool, default True- If True, and if group keys contain NA values, NA values together with row/column will be dropped.
- If False, NA values will also be treated as the key in groups.
df = pd.DataFrame(
{
"A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
"B": ["one", "one", "two", "three", "two", "two", "one", "three"],
"C": np.random.randn(8),
"D": np.random.randn(8),
}
)
grouped = df.groupby("A")
# check the size of the group, there 3 for 'bar' and 5 for 'foo'
grouped.size()
Out[26]:
A
bar 3
foo 5
dtype: int64
# get the mean of 'C' and 'D'
grouped.mean()
Out[27]:
C D
A
bar 0.174952 0.025694
foo 0.134058 0.564041
# We can also split the DataFrame by colum
def get_letter_type(letter):
if letter.lower() in 'aeiou':
return 'vowel'
else:
return 'consonant'
# assign 'A' to the first group and 'B', 'C', 'D' to another group
grouped = df.groupby(get_letter_type, axis=1)
grouped.size()
Out[138]:
consonant 3
vowel 1
dtype: int64
# print
# every element in the GroupBy object is a tuple,
# where the first element is the group name, and the second
# is the group values
for i in grouped:
print(i[0])
print(i[1])
# This is the same as
for name, group in grouped:
print(name)
print(group)
consonant
B C D
0 one 0.697843 0.732219
1 one -0.945606 -0.319682
2 two 0.058214 -0.622784
3 three 0.228985 0.951989
4 two -0.486862 0.886640
5 two -0.168847 -1.489620
6 one 1.976851 -0.322967
7 three -1.314112 -1.147206
vowel
A
0 foo
1 bar
2 foo
3 bar
4 foo
5 bar
6 foo
7 foo
GroupBy Object Attributes
BroupBy object has a lot of attributes, which can be referenced to this page.
Let’s review the above print example using attribute GroupBy.groups
, which is a dict whose keys are the group names and corresponding values being the axis labels belonging to each group.
df.groupby(get_letter_type, axis=1).groups
df.groupby(get_letter_type, axis=1).groups
Out[153]: {'consonant': ['B', 'C', 'D'], 'vowel': ['A']}
Attritbute GroupBy.indices
is similar to GroupBy.groups
, with its values being the indices instead of axis labels.
df.groupby(get_letter_type, axis=1).indices
Out[160]: {'consonant': array([1, 2, 3], dtype=int64), 'vowel': array([0], dtype=int64)}
Finally, attribute GroupBy.get_group()
can be used to retrive the group values.
df.groupby(get_letter_type, axis=1).get_group('vowel')
Out[163]:
A
0 foo
1 bar
2 foo
3 bar
4 foo
5 bar
6 foo
7 foo
df.groupby(get_letter_type, axis=1).get_group('consonant')
Out[165]:
B C D
0 one 0.697843 0.732219
1 one -0.945606 -0.319682
2 two 0.058214 -0.622784
3 three 0.228985 0.951989
4 two -0.486862 0.886640
5 two -0.168847 -1.489620
6 one 1.976851 -0.322967
7 three -1.314112 -1.147206
Grouping With Index Levels
Grouping By Row
arrays = [["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
["one", "two", "one", "two", "one", "two", "one", "two"]]
index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"])
df = pd.DataFrame({"A": [1, 1, 1, 1, 2, 2, 3, 3], "B": np.arange(8)}, index=index)
df
Out[174]:
A B
first second
bar one 1 0
two 1 1
baz one 1 2
two 1 3
foo one 2 4
two 2 5
qux one 3 6
two 3 7
# group by the second index
df.groupby('second').get_group('one')
Out[177]:
A B
first second
bar one 1 0
baz one 1 2
foo one 2 4
qux one 3 6
df.groupby('second').get_group('two')
Out[178]:
A B
first second
bar two 1 1
baz two 1 3
foo two 2 5
qux two 3 7
# Group by index level and columns
df.groupby(['second','A']).groups
Out[181]: {('one', 1): [('bar', 'one'), ('baz', 'one')],
('one', 2): [('foo', 'one')],
('one', 3): [('qux', 'one')],
('two', 1): [('bar', 'two'), ('baz', 'two')],
('two', 2): [('foo', 'two')],
('two', 3): [('qux', 'two')]}
df.groupby(['second','A']).sum()
Out[182]:
B
second A
one 1 2
2 4
3 6
two 1 4
2 5
3 7
# alternately, we can use 'pandas.Grouper()' to mix index and columns
df.groupby([pd.Grouper(level=1),'A']).sum()
Out[197]:
B
second A
one 1 2
2 4
3 6
two 1 4
2 5
3 7
Grouping By Column
arrays = [["bar", "bar", "bar", "foo", "foo"],
["one", "two", "three", "one", "two"]]
index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"])
df = pd.DataFrame(np.random.randn(4,5), columns=index)
df.groupby('second', axis = 1).groups
Out[186]: {'one': [('bar', 'one'), ('foo', 'one')],
'three': [('bar', 'three')],
'two': [('bar', 'two'), ('foo', 'two')]}
df.groupby('second', axis = 1).count()
Out[187]:
second one three two
0 2 1 2
1 2 1 2
2 2 1 2
3 2 1 2
Grouping With Dicts/Series/Functions
people = pd.DataFrame(np.random.randn(5, 5),
columns=['a', 'b', 'c', 'd', 'e'],
index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
mapping_c = {'a': 'red', 'b': 'red', 'c': 'blue', 'd': 'blue', 'e': 'red', 'f' : 'orange'}
mapping_r = {'Joe': 'A', 'Steve': 'B', 'Wes': 'A', 'Jim': 'B', 'Travis': 'A'}
people.groupby(mapping_c, axis=1).groups
Out[211]: {'blue': ['c', 'd'], 'red': ['a', 'b', 'e']}
people.groupby(mapping_r).groups
Out[212]: {'A': ['Joe', 'Wes', 'Travis'], 'B': ['Steve', 'Jim']}
# The same is with Series
# Using Python functions is a more generic way of defining a group mapping compared with a dict or Series.
people.groupby(len).groups
Out[214]: {3: ['Joe', 'Wes', 'Jim'], 5: ['Steve'], 6: ['Travis']}
# Combination of function and list
key_list = ['one', 'one', 'one', 'two', 'two']
people.groupby([len, key_list]).groups
Out[216]: {(3, 'one'): ['Joe', 'Wes'],
(3, 'two'): ['Jim'],
(5, 'one'): ['Steve'],
(6, 'two'): ['Travis']}
Comments