Data Aggregation and Grouping [01]: GroupBy Method

6 minute read

Published: February 10, 2022

Group operations involves three stages:

split: object is split into groups based on one or more keys
apply: a function is applied to each group, producing a new value
combine: results of are combined into a result object

Basics

In pandas, we use DataFrame.group() to realize the three steps.

DataFrame.groupby(by=None, 
                  axis=0, 
                  level=None, 
                  as_index=True, 
                  sort=True, 
                  group_keys=True, 
                  squeeze=NoDefault.no_default, 
                  observed=False, dropna=True)

Parameters:

by: mapping, function, label, or list of labels
- Used to determine the groups for the groupby.
  - If by is a function, it’s called on each value of the object’s index.
  - If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups.
  - If a list or ndarray of length equal to the selected axis is passed, the values are used to determine the groups.
  - Id a list of values indicating column names is passed, the values of the columns will be used to determine the groups.
axis: {0 or ‘index’, 1 or ‘columns’}, default 0
level: int, level name, or sequence of such, default None
- If the axis is a MultiIndex (hierarchical), group by a particular level or levels.
as_index: bool, default True
- For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output.
sort: bool, default True
- Sort group keys.
group_keys: bool, default True
- When calling apply, add group keys to index to identify pieces.
observed: bool, default False
- This only applies if any of the groupers are Categoricals.
  - If True: only show observed values for categorical groupers.
  - If False: show all values for categorical groupers.
dropna: bool, default True
- If True, and if group keys contain NA values, NA values together with row/column will be dropped.
- If False, NA values will also be treated as the key in groups.

df = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
        "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
        "C": np.random.randn(8),
        "D": np.random.randn(8),
    }
)

grouped = df.groupby("A")

# check the size of the group, there 3 for 'bar' and 5 for 'foo'
grouped.size()
Out[26]: 
A
bar    3
foo    5
dtype: int64

# get the mean of 'C' and 'D'
grouped.mean()
Out[27]: 
            C         D
A                      
bar  0.174952  0.025694
foo  0.134058  0.564041

# We can also split the DataFrame by colum
def get_letter_type(letter):
    if letter.lower() in 'aeiou':
        return 'vowel'
    else:
        return 'consonant'

# assign 'A' to the first group and 'B', 'C', 'D' to another group
grouped = df.groupby(get_letter_type, axis=1)
grouped.size()
Out[138]: 
consonant    3
vowel        1
dtype: int64

# print
# every element in the GroupBy object is a tuple, 
# where the first element is the group name, and the second
# is the group values
for i in grouped:
    print(i[0])
    print(i[1])

# This is the same as
for name, group in grouped:
    print(name)
    print(group)

consonant
       B         C         D
0    one  0.697843  0.732219
1    one -0.945606 -0.319682
2    two  0.058214 -0.622784
3  three  0.228985  0.951989
4    two -0.486862  0.886640
5    two -0.168847 -1.489620
6    one  1.976851 -0.322967
7  three -1.314112 -1.147206
vowel
     A
0  foo
1  bar
2  foo
3  bar
4  foo
5  bar
6  foo
7  foo

GroupBy Object Attributes

BroupBy object has a lot of attributes, which can be referenced to this page.

Let’s review the above print example using attribute GroupBy.groups, which is a dict whose keys are the group names and corresponding values being the axis labels belonging to each group.

df.groupby(get_letter_type, axis=1).groups

df.groupby(get_letter_type, axis=1).groups
Out[153]: {'consonant': ['B', 'C', 'D'], 'vowel': ['A']}

Attritbute GroupBy.indices is similar to GroupBy.groups, with its values being the indices instead of axis labels.

df.groupby(get_letter_type, axis=1).indices
Out[160]: {'consonant': array([1, 2, 3], dtype=int64), 'vowel': array([0], dtype=int64)}

Finally, attribute GroupBy.get_group() can be used to retrive the group values.

df.groupby(get_letter_type, axis=1).get_group('vowel')
Out[163]: 
     A
foo
bar
foo
bar
foo
bar
foo
foo

df.groupby(get_letter_type, axis=1).get_group('consonant')
Out[165]: 
       B         C         D
  one  0.697843  0.732219
  one -0.945606 -0.319682
  two  0.058214 -0.622784
three  0.228985  0.951989
  two -0.486862  0.886640
  two -0.168847 -1.489620
  one  1.976851 -0.322967
three -1.314112 -1.147206

Grouping With Index Levels

Grouping By Row

arrays = [["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
          ["one", "two", "one", "two", "one", "two", "one", "two"]]

index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"])

df = pd.DataFrame({"A": [1, 1, 1, 1, 2, 2, 3, 3], "B": np.arange(8)}, index=index)

df
Out[174]: 
              A  B
first second      
bar   one     1  0
      two     1  1
baz   one     1  2
      two     1  3
foo   one     2  4
      two     2  5
qux   one     3  6
      two     3  7

# group by the second index
df.groupby('second').get_group('one')
Out[177]: 
              A  B
first second      
bar   one     1  0
baz   one     1  2
foo   one     2  4
qux   one     3  6

df.groupby('second').get_group('two')
Out[178]: 
              A  B
first second      
bar   two     1  1
baz   two     1  3
foo   two     2  5
qux   two     3  7

# Group by index level and columns
df.groupby(['second','A']).groups
Out[181]: {('one', 1): [('bar', 'one'), ('baz', 'one')], 
           ('one', 2): [('foo', 'one')], 
           ('one', 3): [('qux', 'one')], 
           ('two', 1): [('bar', 'two'), ('baz', 'two')], 
           ('two', 2): [('foo', 'two')], 
           ('two', 3): [('qux', 'two')]}
           
df.groupby(['second','A']).sum()
Out[182]: 
          B
second A   
one    1  2
       2  4
       3  6
two    1  4
       2  5
       3  7

# alternately, we can use 'pandas.Grouper()' to mix index and columns
df.groupby([pd.Grouper(level=1),'A']).sum()
Out[197]: 
          B
second A   
one    1  2
       2  4
       3  6
two    1  4
       2  5
       3  7

Grouping By Column

arrays = [["bar", "bar", "bar", "foo", "foo"],
          ["one", "two", "three", "one", "two"]]

index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"])

df = pd.DataFrame(np.random.randn(4,5), columns=index)

df.groupby('second', axis = 1).groups
Out[186]: {'one': [('bar', 'one'), ('foo', 'one')], 
           'three': [('bar', 'three')], 
           'two': [('bar', 'two'), ('foo', 'two')]}
           
df.groupby('second', axis = 1).count()
Out[187]: 
second  one  three  two
0         2      1    2
1         2      1    2
2         2      1    2
3         2      1    2

Grouping With Dicts/Series/Functions

people = pd.DataFrame(np.random.randn(5, 5),
                      columns=['a', 'b', 'c', 'd', 'e'],
                      index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
mapping_c = {'a': 'red', 'b': 'red', 'c': 'blue', 'd': 'blue', 'e': 'red', 'f' : 'orange'}
mapping_r = {'Joe': 'A', 'Steve': 'B', 'Wes': 'A', 'Jim': 'B', 'Travis': 'A'}

people.groupby(mapping_c, axis=1).groups
Out[211]: {'blue': ['c', 'd'], 'red': ['a', 'b', 'e']}

people.groupby(mapping_r).groups
Out[212]: {'A': ['Joe', 'Wes', 'Travis'], 'B': ['Steve', 'Jim']}

# The same is with Series

# Using Python functions is a more generic way of defining a group mapping compared with a dict or Series.
people.groupby(len).groups
Out[214]: {3: ['Joe', 'Wes', 'Jim'], 5: ['Steve'], 6: ['Travis']}

# Combination of function and list
key_list = ['one', 'one', 'one', 'two', 'two']

people.groupby([len, key_list]).groups
Out[216]: {(3, 'one'): ['Joe', 'Wes'], 
           (3, 'two'): ['Jim'], 
           (5, 'one'): ['Steve'], 
           (6, 'two'): ['Travis']}

Twitter Facebook LinkedIn

Chao Huang