Data Aggregation and Grouping [02]: Data Aggregation
Published:
Aggregations refer to any data transformation that produces scalar values from arrays.
Basics
Here are some commonly used aggregation functions:
count
: Number of non-NA values in the groupsum
: Sum of non-NA valuesmean
: Mean of non-NA valuesmedian
: Arithmetic median of non-NA valuesstd
,var
: Unbiased (n – 1 denominator) standard deviation and variancemin
,max
: Minimum and maximum of non-NA valuesprod
: Product of non-NA valuesfirst
,last
: First and last non-NA valuesquantile()
: Quantiles of the Series
Multiple Function Application
GroupBy.agg()
or DataFrame.agg()
.
DataFrame.agg(func=None, axis=0, *args, **kwargs)
Parameters:
func
: function, str, list or dict- Function to use for aggregating the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply. Accepted combinations are:
- function
- string function name
- list of functions and/or function names, e.g. [np.sum, ‘mean’]
- dict of axis labels -> functions, function names or list of such.
- Function to use for aggregating the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply. Accepted combinations are:
axis
: {0 or ‘index’, 1 or ‘columns’}, default 0
df = pd.DataFrame(
{
"A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
"B": ["one", "one", "two", "three", "two", "two", "one", "three"],
"C": np.random.randn(8),
"D": np.random.randn(8),
}
)
df
Out[228]:
A B C D
0 foo one -1.186263 0.576306
1 bar one 0.053118 -0.349163
2 foo two 2.884776 0.716290
3 bar three -1.523466 -0.191066
4 foo two -1.284458 0.718828
5 bar two 0.371957 -0.322206
6 foo one -1.282876 -0.682011
7 foo three 0.214550 0.453315
# aggregate by list of functions
df.groupby(['A','B'])['C'].agg(['min','max','sum'])
Out[235]:
min max sum
A B
bar one 0.053118 0.053118 0.053118
three -1.523466 -1.523466 -1.523466
two 0.371957 0.371957 0.371957
foo one -1.282876 -1.186263 -2.469139
three 0.214550 0.214550 0.214550
two -1.284458 2.884776 1.600318
# different functions for different columns
df.groupby(['A','B']).agg({'C':['min','max'], 'D':['mean','sum']})
Out[239]:
C D
min max mean sum
A B
bar one 0.053118 0.053118 -0.349163 -0.349163
three -1.523466 -1.523466 -0.191066 -0.191066
two 0.371957 0.371957 -0.322206 -0.322206
foo one -1.282876 -1.186263 -0.052853 -0.105705
three 0.214550 0.214550 0.453315 0.453315
two -1.284458 2.884776 0.717559 1.435118
# rename the index using bracket '()'
df.groupby(['A','B'],as_index=False).agg({'C':[('mi','min'),('ma','max')], 'D':[('me','mean'),('su','sum')]})
Out[241]:
A B C D
mi ma me su
0 bar one 0.053118 0.053118 -0.349163 -0.349163
1 bar three -1.523466 -1.523466 -0.191066 -0.191066
2 bar two 0.371957 0.371957 -0.322206 -0.322206
3 foo one -1.282876 -1.186263 -0.052853 -0.105705
4 foo three 0.214550 0.214550 0.453315 0.453315
5 foo two -1.284458 2.884776 0.717559 1.435118
Aggregate and Grouping Using apply()
Apply function func group-wise and combine the results together.
GroupBy.apply(func, *args, **kwargs)
The function passed to apply must take a dataframe as its first argument and return a DataFrame, Series or scalar. apply will then take care of combining the results back together into a single dataframe or series.
# return a DataFrame
df = pd.DataFrame({'A': 'a a b'.split(),
'B': [1,2,3],
'C': [4,6,5]})
g = df.groupby('A')
g[['B', 'C']].apply(lambda x: x / x.sum())
Out[251]:
B C
0 0.333333 0.4
1 0.666667 0.6
2 1.000000 1.0
# return Series, and combine to DataFrame
g[['B', 'C']].apply(lambda x: x.max() - x.min())
Out[252]:
B C
A
a 1 2
b 0 0
# return Scalar, and combine to Series
g.apply(lambda x: x.C.max() - x.B.min())
Out[253]:
A
a 5
b 2
dtype: int64
User-Defined Function: Examples
Group Filling Missing Values
# fill with group mean
fill_mean = lambda g: g.fillna(g.mean())
data.groupby(group_key).apply(fill_mean)
# fill with a pre-defined value
fill_values = {'G1': 0.5, 'G2': -1}
fill_func = lambda g: g.fillna(fill_values[g.name])
data.groupby(group_key).apply(fill_func)
Group Weighted Average
get_wavg = lambda g: np.average(g['data'], weights=g['weights'])
grouped.apply(get_wavg)
# time series data, corr analysis
rets = data.pct_change().dropna() # Calculate pct_change of each value to previous entry in group.
get_year = lambda x: x.year
spx_corr = lambda x: x.corrwith(x['SPX']) # Compute pairwise correlation.
rets.groupby(get_year).apply(spx_corr)
Group Linear Regression
import statsmodels.api as sm
def regress(data, yvar, xvars):
Y = data[yvar]
X = data[xvars]
X['intercept'] = 1.
result = sm.OLS(Y, X).fit()
return result.params
by_year.apply(regress, 'Y', ['X'])
pivot_table()
Pivot table aggregates a table of data by one or more keys, arranging the data in a rectangle with some of the group keys along the rows and some along the columns.
DataFrame.pivot_table(values=None, index=None, columns=None, aggfunc='mean',
fill_value=None, margins=False, dropna=True, margins_name='All',
observed=False, sort=True)
df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",
"bar", "bar", "bar", "bar"],
"B": ["one", "one", "one", "two", "two",
"one", "one", "two", "two"],
"C": ["small", "large", "large", "small",
"small", "large", "small", "small",
"large"],
"D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
"E": [2, 4, 5, 5, 6, 6, 8, 9, 9]})
table = df.pivot_table(values='D',
index=['A', 'B'],
columns=['C'],
aggfunc=np.sum)
table
Out[261]:
C large small
A B
bar one 4.0 5.0
two 7.0 6.0
foo one 4.0 1.0
two NaN 6.0
#
table = df.pivot_table(values=['D','E'],
index=['A', 'C'],
aggfunc={'D': np.mean,
'E': [min, max, np.mean]},
fill_value = 0)
table
Out[278]:
D E
mean max mean min
A C
bar large 5.500000 9 7.500000 6
small 5.500000 9 8.500000 8
foo large 2.000000 5 4.500000 4
small 2.333333 6 4.333333 2
crosstab()
Similar to pivot_table()
is crosstab()
, a simple cross tabulation of two (or more) factors. By default computes a frequency table of the factors unless an array of values and an aggregation function are passed.
pandas.crosstab(index, columns, values=None, rownames=None, colnames=None,
aggfunc=None, margins=False, margins_name='All',
dropna=True, normalize=False)
df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",
"bar", "bar", "bar", "bar"],
"B": ["one", "one", "one", "two", "two",
"one", "one", "two", "two"],
"C": ["small", "large", "large", "small",
"small", "large", "small", "small",
"large"],
"D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
"E": [2, 4, 5, 5, 6, 6, 8, 9, 9]})
pd.crosstab(df['A'], [df['D']])
Out[302]:
D 1 2 3 4 5 6 7
A
bar 0 0 0 1 1 1 1
foo 1 2 2 0 0 0 0
# list of index
pd.crosstab([df['A'],df['C']], [df['D']], margins=True)
Out[306]:
D 1 2 3 4 5 6 7 All
A C
bar large 0 0 0 1 0 0 1 2
small 0 0 0 0 1 1 0 2
foo large 0 2 0 0 0 0 0 2
small 1 0 2 0 0 0 0 3
All 1 2 2 1 1 1 1 9
# compare with groupby
df.groupby(['A','C'])['C'].count()
Out[305]:
A C
bar large 2
small 2
foo large 2
small 3
Name: C, dtype: int64
Comments