Data Wrangling [02]: Merging DataFrames
Published:
DataFrames can be merged, concatenated or combined in a number of ways, where
DataFrame.merge()
orpandas.merge()
merges DataFrame or named Series objects with a database-style joinDataFrame.join()
join columns of another DataFrame, which is similiar toDataFrame.merge()
DataFrame.update()
modifies in place using non-NA values from another DataFrame.pandas.cancat()
concatenates pandas objects along a particular axis with optional set logic along the other axesDataFrame.combine()
performs column-wise combine with another DataFrameDataFrame.combine_first()
updates null elements with value in the same location in other
merge()
DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
Parameters:
right
: DataFrame or named Serieshow
: {‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘inner’left
: use only keys from left frame, similar to a SQL left outer join; preserve key order.right
: use only keys from right frame, similar to a SQL right outer join; preserve key order.outer
: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.inner
: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.cross
: creates the cartesian product from both frames, preserves the order of the left keys.
on
: label or list, column or index level names to join on.- If on is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.
left_on
: label or list, or array-likeright_on
: label or list, or array-likeleft_index
: bool, default False- Use the index from the left DataFrame as the join key(s). If it is a MultiIndex, the number of keys in the other DataFrame (either the index or a number of columns) must match the number of levels.
right_index
: bool, default Falsesort
: bool, default Falsesuffixes
: list-like, default is (“_x”, “_y”)copy
: bool, default Trueindicator
: bool or str, default False- If True, adds a column to the output DataFrame called “_merge” with information on the source of each row.
validate
: str, optional- “one_to_one” or “1:1”: check if merge keys are unique in both left and right datasets.
- “one_to_many” or “1:m”: check if merge keys are unique in left dataset.
- “many_to_one” or “m:1”: check if merge keys are unique in right dataset.
- “many_to_many” or “m:m”: allowed, but does not result in checks.
Simple Merge
df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
'value': [2, 3, 4, 5]})
df1
Out[583]:
lkey value
0 foo 1
1 bar 2
2 baz 3
3 foo 5
df2
Out[584]:
rkey value
0 foo 2
1 bar 3
2 baz 4
3 foo 5
# default key is the common columns in both DataFrames, here is `value`
# default is to use intersection of keys from both DataFrames
df1.merge(df2)
Out[585]:
lkey value rkey
0 bar 2 foo
1 baz 3 bar
2 foo 5 foo
# use union of keys from both DataFrames
df1.merge(df2, how = 'outer')
Out[595]:
lkey value rkey
0 foo 1 NaN
1 bar 2 foo
2 baz 3 bar
3 foo 5 foo
4 NaN 4 baz
# use only keys from the left DataFrame
df1.merge(df2, how = 'left')
Out[596]:
lkey value rkey
0 foo 1 NaN
1 bar 2 foo
2 baz 3 bar
3 foo 5 foo
# use only keys from the right DataFrame
df1.merge(df2, how = 'right')
Out[597]:
lkey value rkey
0 bar 2 foo
1 baz 3 bar
2 NaN 4 baz
3 foo 5 foo
# designate keys from the two DataFrames
# foo(2) -> foo(2), bar -> bar, baz -> barz
df1.merge(df2, left_on='lkey', right_on='rkey')
Out[592]:
lkey value_x rkey value_y
0 foo 1 foo 2
1 foo 1 foo 5
2 foo 5 foo 2
3 foo 5 foo 5
4 bar 2 bar 3
5 baz 3 baz 4
# merge the keys columns using `DataFrame.set_index()`
df1.merge(df2.set_index('rkey'), left_on='lkey', right_on='rkey')
Out[617]:
lkey value_x value_y
0 foo 1 2
1 foo 1 5
2 foo 5 2
3 foo 5 5
4 bar 2 3
5 baz 3 4
Merge With Multiple Keys
df1 = pd.DataFrame({'key': ['foo', 'bar', 'baz', 'foo'],
'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'key': ['foo', 'bar', 'baz', 'foo'],
'value': [2, 3, 4, 5]})
# intersection of keys
df1.merge(df2)
Out[600]:
key value
0 foo 5
# union of keys
df1.merge(df2, how='outer')
Out[601]:
key value
0 foo 1
1 bar 2
2 baz 3
3 foo 5
4 foo 2
5 bar 3
6 baz 4
Merge With Index
A simple one:
df1 = pd.DataFrame({'key': ['foo', 'bar', 'baz', 'foo'],
'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'value': [2, 3, 4, 5]},index = ['foo', 'bar', 'baz', 'foo'])
df1
Out[606]:
key value
0 foo 1
1 bar 2
2 baz 3
3 foo 5
df2
Out[607]:
value
foo 2
bar 3
baz 4
foo 5
df1.merge(df2, left_on='key',right_index=True)
Out[608]:
key value_x value_y
0 foo 1 2
0 foo 1 5
3 foo 5 2
3 foo 5 5
1 bar 2 3
2 baz 3 4
Hierarchically indexed data (example from Python for Data Analysis):
lefth = pd.DataFrame({'key1': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'key2': [2000, 2001, 2002, 2001, 2002],
'data': np.arange(5.)})
righth = pd.DataFrame(np.arange(12).reshape((6, 2)),
index=[['Nevada', 'Nevada', 'Ohio', 'Ohio', 'Ohio', 'Ohio'],
[2001, 2000, 2000, 2000, 2001, 2002]],
columns=['event1', 'event2'])
lefth
Out[610]:
key1 key2 data
0 Ohio 2000 0.0
1 Ohio 2001 1.0
2 Ohio 2002 2.0
3 Nevada 2001 3.0
4 Nevada 2002 4.0
righth
Out[611]:
event1 event2
Nevada 2001 0 1
2000 2 3
Ohio 2000 4 5
2000 6 7
2001 8 9
2002 10 11
# intersection of keys
lefth.merge(righth, left_on=['key1','key2'], right_index=True)
Out[614]:
key1 key2 data event1 event2
0 Ohio 2000 0.0 4 5
0 Ohio 2000 0.0 6 7
1 Ohio 2001 1.0 8 9
2 Ohio 2002 2.0 10 11
3 Nevada 2001 3.0 0 1
# union of keys
lefth.merge(righth, left_on=['key1','key2'],right_index=True, how = 'outer')
Out[615]:
key1 key2 data event1 event2
0 Ohio 2000 0.0 4.0 5.0
0 Ohio 2000 0.0 6.0 7.0
1 Ohio 2001 1.0 8.0 9.0
2 Ohio 2002 2.0 10.0 11.0
3 Nevada 2001 3.0 0.0 1.0
4 Nevada 2002 4.0 NaN NaN
4 Nevada 2000 NaN 2.0 3.0
concat()
pandas.concat(objs, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=True)
Parameters:
objs
: a sequence or mapping of Series or DataFrame objectsaxis
: {0/’index’, 1/’columns’}, default 0join
: {‘inner’, ‘outer’}, default ‘outer’ignore_index
: bool, default False- If True, do not use the index values along the concatenation axis. The resulting axis will be labeled 0, …, n - 1.
keys
: sequence, default None- If multiple levels passed, should contain tuples. Construct hierarchical index using the passed keys as the outermost level.
levels
: list of sequences, default None- Specific levels (unique values) to use for constructing a MultiIndex. Otherwise they will be inferred from the keys.
names
: list, default None- Names for the levels in the resulting hierarchical index.
verify_integrity
: bool, default Falsesort
: bool, default False
Simple Concat
df1 = pd.DataFrame(
{
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
},
index=[0, 1, 2, 3],
)
df2 = pd.DataFrame(
{
"A": ["A4", "A5", "A6", "A7"],
"B": ["B4", "B5", "B6", "B7"],
"C": ["C4", "C5", "C6", "C7"],
"E": ["D4", "D5", "D6", "D7"],
},
index=[2, 3, 4, 5],
)
# default is to concat in rows
# default is union of column
pd.concat([df1,df2])
Out[630]:
A B C D E
0 A0 B0 C0 D0 NaN
1 A1 B1 C1 D1 NaN
2 A2 B2 C2 D2 NaN
3 A3 B3 C3 D3 NaN
2 A4 B4 C4 NaN D4
3 A5 B5 C5 NaN D5
4 A6 B6 C6 NaN D6
5 A7 B7 C7 NaN D7
# intersection of column
pd.concat([df1,df2], join='inner')
Out[632]:
A B C
0 A0 B0 C0
1 A1 B1 C1
2 A2 B2 C2
3 A3 B3 C3
2 A4 B4 C4
3 A5 B5 C5
4 A6 B6 C6
5 A7 B7 C7
# concat in columns
# union of index
pd.concat([df1,df2],axis = 1)
Out[643]:
A B C D A B C E
0 A0 B0 C0 D0 NaN NaN NaN NaN
1 A1 B1 C1 D1 NaN NaN NaN NaN
2 A2 B2 C2 D2 A4 B4 C4 D4
3 A3 B3 C3 D3 A5 B5 C5 D5
4 NaN NaN NaN NaN A6 B6 C6 D6
5 NaN NaN NaN NaN A7 B7 C7 D7
# intersection of index
pd.concat([df1,df2],axis = 1,join='inner')
Out[644]:
A B C D A B C E
2 A2 B2 C2 D2 A4 B4 C4 D4
3 A3 B3 C3 D3 A5 B5 C5 D5
Concat with Group Keys
dfC = pd.concat([df1, df2], axis = 0, keys=[('df1'), ('df2')])
Out[651]:
A B C D E
df1 0 A0 B0 C0 D0 NaN
1 A1 B1 C1 D1 NaN
2 A2 B2 C2 D2 NaN
3 A3 B3 C3 D3 NaN
df2 2 A4 B4 C4 NaN D4
3 A5 B5 C5 NaN D5
4 A6 B6 C6 NaN D6
5 A7 B7 C7 NaN D7
dfC.index
Out[676]:
MultiIndex([('df1', 0),
('df1', 1),
('df1', 2),
('df1', 3),
('df2', 2),
('df2', 3),
('df2', 4),
('df2', 5)],
names=['G', None])
dfC = pd.concat([df1,df2], axis = 1, keys=[('df1'), ('df2')])
Out[655]:
df1 df2
A B C D A B C E
0 A0 B0 C0 D0 NaN NaN NaN NaN
1 A1 B1 C1 D1 NaN NaN NaN NaN
2 A2 B2 C2 D2 A4 B4 C4 D4
3 A3 B3 C3 D3 A5 B5 C5 D5
4 NaN NaN NaN NaN A6 B6 C6 D6
5 NaN NaN NaN NaN A7 B7 C7 D7
dfC.columns
Out[679]:
MultiIndex([('df1', 'A'),
('df1', 'B'),
('df1', 'C'),
('df1', 'D'),
('df2', 'A'),
('df2', 'B'),
('df2', 'C'),
('df2', 'E')],
names=['G', None])
# alternately, we can pass a dict of onjects instead of a list,
# the dict's keys will be used for 'keys' option
dfs = {'df1':df1, 'df2':df2}
pd.concat(dfs)
Out[683]:
A B C D E
df1 0 A0 B0 C0 D0 NaN
1 A1 B1 C1 D1 NaN
2 A2 B2 C2 D2 NaN
3 A3 B3 C3 D3 NaN
df2 2 A4 B4 C4 NaN D4
3 A5 B5 C5 NaN D5
4 A6 B6 C6 NaN D6
5 A7 B7 C7 NaN D7
pd.concat(dfs,axis=1)
Out[684]:
df1 df2
A B C D A B C E
0 A0 B0 C0 D0 NaN NaN NaN NaN
1 A1 B1 C1 D1 NaN NaN NaN NaN
2 A2 B2 C2 D2 A4 B4 C4 D4
3 A3 B3 C3 D3 A5 B5 C5 D5
4 NaN NaN NaN NaN A6 B6 C6 D6
5 NaN NaN NaN NaN A7 B7 C7 D7
combine() & coimbine_first()
combine()
DataFrame.combine(other, func, fill_value=None, overwrite=True)
Combines a DataFrame with other DataFrame using func to element-wise combine columns. The row and column indexes of the resulting DataFrame will be the union of the two.
Parameters:
other
: DataFrame- The DataFrame to merge column-wise.
func
: function- Function that takes two series as inputs and return a Series or a scalar. Used to merge the two dataframes column by columns.
fill_value
: scalar value, default None- The value to fill NaNs with prior to passing any column to the merge func.
overwrite
: bool, default True- If True, columns in self that do not exist in other will be overwritten with NaNs.
df1 = pd.DataFrame({'A': [5, 0], 'B': [1, 4]})
df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]})
df1
Out[686]:
A B
0 5 1
1 0 4
df2
Out[687]:
A B
0 1 3
1 1 3
# ufunc
df1.combine(df2, np.minimum)
Out[688]:
A B
0 1 1
1 0 3
# lambda function
# choose column with smaller summation
take_smaller = lambda s1, s2: s1 if s1.sum() < s2.sum() else s2
df1.combine(df2, take_smaller)
Out[700]:
A B
0 1 1
1 1 4
combine_first()
DataFrame.combine_first(other)
Combine two DataFrame objects by filling null values in one DataFrame with non-null values from other DataFrame. The row and column indexes of the resulting DataFrame will be the union of the two.
df1 = pd.DataFrame({'A': [None, 0], 'B': [None, 4]})
df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]})
# combine DataFrames with the same columns
df1.combine_first(df2)
Out[703]:
A B
0 1.0 3.0
1 0.0 4.0
# combine DataFrames with different columns
df3 = pd.DataFrame({'B': [1, 1], 'C': [3, 3]})
df1.combine_first(df3)
Out[705]:
A B C
0 NaN 1.0 3
1 0.0 4.0 3
Comments