Data Aggregation and Grouping [02]: Data Aggregation

6 minute read

Published:

Aggregations refer to any data transformation that produces scalar values from arrays.


Basics

Here are some commonly used aggregation functions:

  • count: Number of non-NA values in the group
  • sum: Sum of non-NA values
  • mean: Mean of non-NA values
  • median: Arithmetic median of non-NA values
  • std, var: Unbiased (n – 1 denominator) standard deviation and variance
  • min, max: Minimum and maximum of non-NA values
  • prod: Product of non-NA values
  • first, last: First and last non-NA values
  • quantile(): Quantiles of the Series

Multiple Function Application

GroupBy.agg() or DataFrame.agg().

DataFrame.agg(func=None, axis=0, *args, **kwargs)

Parameters:

  • func: function, str, list or dict
    • Function to use for aggregating the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply. Accepted combinations are:
      • function
      • string function name
      • list of functions and/or function names, e.g. [np.sum, ‘mean’]
      • dict of axis labels -> functions, function names or list of such.
  • axis: {0 or ‘index’, 1 or ‘columns’}, default 0
df = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
        "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
        "C": np.random.randn(8),
        "D": np.random.randn(8),
    }
)

df
Out[228]: 
     A      B         C         D
0  foo    one -1.186263  0.576306
1  bar    one  0.053118 -0.349163
2  foo    two  2.884776  0.716290
3  bar  three -1.523466 -0.191066
4  foo    two -1.284458  0.718828
5  bar    two  0.371957 -0.322206
6  foo    one -1.282876 -0.682011
7  foo  three  0.214550  0.453315

# aggregate by list of functions
df.groupby(['A','B'])['C'].agg(['min','max','sum'])
Out[235]: 
                min       max       sum
A   B                                  
bar one    0.053118  0.053118  0.053118
    three -1.523466 -1.523466 -1.523466
    two    0.371957  0.371957  0.371957
foo one   -1.282876 -1.186263 -2.469139
    three  0.214550  0.214550  0.214550
    two   -1.284458  2.884776  1.600318

# different functions for different columns
df.groupby(['A','B']).agg({'C':['min','max'], 'D':['mean','sum']})
Out[239]: 
                  C                   D          
                min       max      mean       sum
A   B                                            
bar one    0.053118  0.053118 -0.349163 -0.349163
    three -1.523466 -1.523466 -0.191066 -0.191066
    two    0.371957  0.371957 -0.322206 -0.322206
foo one   -1.282876 -1.186263 -0.052853 -0.105705
    three  0.214550  0.214550  0.453315  0.453315
    two   -1.284458  2.884776  0.717559  1.435118

# rename the index using bracket '()'
df.groupby(['A','B'],as_index=False).agg({'C':[('mi','min'),('ma','max')], 'D':[('me','mean'),('su','sum')]})
Out[241]: 
     A      B         C                   D          
                     mi        ma        me        su
0  bar    one  0.053118  0.053118 -0.349163 -0.349163
1  bar  three -1.523466 -1.523466 -0.191066 -0.191066
2  bar    two  0.371957  0.371957 -0.322206 -0.322206
3  foo    one -1.282876 -1.186263 -0.052853 -0.105705
4  foo  three  0.214550  0.214550  0.453315  0.453315
5  foo    two -1.284458  2.884776  0.717559  1.435118

Aggregate and Grouping Using apply()

Apply function func group-wise and combine the results together.

GroupBy.apply(func, *args, **kwargs)

The function passed to apply must take a dataframe as its first argument and return a DataFrame, Series or scalar. apply will then take care of combining the results back together into a single dataframe or series.

# return a DataFrame
df = pd.DataFrame({'A': 'a a b'.split(),
                   'B': [1,2,3],
                   'C': [4,6,5]})
g = df.groupby('A')

g[['B', 'C']].apply(lambda x: x / x.sum())
Out[251]: 
          B    C
0  0.333333  0.4
1  0.666667  0.6
2  1.000000  1.0

# return Series, and combine to DataFrame
g[['B', 'C']].apply(lambda x: x.max() - x.min())
Out[252]: 
   B  C
A      
a  1  2
b  0  0

# return Scalar, and combine to Series
g.apply(lambda x: x.C.max() - x.B.min())
Out[253]: 
A
a    5
b    2
dtype: int64

User-Defined Function: Examples

Group Filling Missing Values

# fill with group mean
fill_mean = lambda g: g.fillna(g.mean())
data.groupby(group_key).apply(fill_mean)

# fill with a pre-defined value
fill_values = {'G1': 0.5, 'G2': -1}
fill_func = lambda g: g.fillna(fill_values[g.name])
data.groupby(group_key).apply(fill_func)

Group Weighted Average

get_wavg = lambda g: np.average(g['data'], weights=g['weights'])
grouped.apply(get_wavg)

# time series data, corr analysis
rets = data.pct_change().dropna()         # Calculate pct_change of each value to previous entry in group.
get_year = lambda x: x.year
spx_corr = lambda x: x.corrwith(x['SPX']) # Compute pairwise correlation.
rets.groupby(get_year).apply(spx_corr)

Group Linear Regression

import statsmodels.api as sm
def regress(data, yvar, xvars):
    Y = data[yvar]
    X = data[xvars]
    X['intercept'] = 1.
    result = sm.OLS(Y, X).fit()
    return result.params

by_year.apply(regress, 'Y', ['X'])

pivot_table()

Pivot table aggregates a table of data by one or more keys, arranging the data in a rectangle with some of the group keys along the rows and some along the columns.

DataFrame.pivot_table(values=None, index=None, columns=None, aggfunc='mean', 
                      fill_value=None, margins=False, dropna=True, margins_name='All', 
                      observed=False, sort=True)
df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",
                         "bar", "bar", "bar", "bar"],
                   "B": ["one", "one", "one", "two", "two",
                         "one", "one", "two", "two"],
                   "C": ["small", "large", "large", "small",
                         "small", "large", "small", "small",
                         "large"],
                   "D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
                   "E": [2, 4, 5, 5, 6, 6, 8, 9, 9]})

table = df.pivot_table(values='D', 
                       index=['A', 'B'],
                       columns=['C'], 
                       aggfunc=np.sum)

table
Out[261]: 
C        large  small
A   B                
bar one    4.0    5.0
    two    7.0    6.0
foo one    4.0    1.0
    two    NaN    6.0

# 
table = df.pivot_table(values=['D','E'], 
                       index=['A', 'C'],
                       aggfunc={'D': np.mean,
                                'E': [min, max, np.mean]},
                       fill_value = 0)
                       
table
Out[278]: 
                  D   E              
               mean max      mean min
A   C                                
bar large  5.500000   9  7.500000   6
    small  5.500000   9  8.500000   8
foo large  2.000000   5  4.500000   4
    small  2.333333   6  4.333333   2

crosstab()

Similar to pivot_table() is crosstab(), a simple cross tabulation of two (or more) factors. By default computes a frequency table of the factors unless an array of values and an aggregation function are passed.

pandas.crosstab(index, columns, values=None, rownames=None, colnames=None, 
                aggfunc=None, margins=False, margins_name='All', 
                dropna=True, normalize=False)
df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",
                         "bar", "bar", "bar", "bar"],
                   "B": ["one", "one", "one", "two", "two",
                         "one", "one", "two", "two"],
                   "C": ["small", "large", "large", "small",
                         "small", "large", "small", "small",
                         "large"],
                   "D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
                   "E": [2, 4, 5, 5, 6, 6, 8, 9, 9]})
   
pd.crosstab(df['A'], [df['D']])
Out[302]: 
D    1  2  3  4  5  6  7
A                       
bar  0  0  0  1  1  1  1
foo  1  2  2  0  0  0  0  

# list of index
pd.crosstab([df['A'],df['C']], [df['D']], margins=True)
Out[306]: 
D          1  2  3  4  5  6  7  All
A   C                              
bar large  0  0  0  1  0  0  1    2
    small  0  0  0  0  1  1  0    2
foo large  0  2  0  0  0  0  0    2
    small  1  0  2  0  0  0  0    3
All        1  2  2  1  1  1  1    9

# compare with groupby
df.groupby(['A','C'])['C'].count()
Out[305]: 
A    C    
bar  large    2
     small    2
foo  large    2
     small    3
Name: C, dtype: int64

Comments