Data Wrangling [02]: Merging DataFrames

12 minute read

Published:

DataFrames can be merged, concatenated or combined in a number of ways, where

  • DataFrame.merge() or pandas.merge() merges DataFrame or named Series objects with a database-style join
  • DataFrame.join() join columns of another DataFrame, which is similiar to DataFrame.merge()
  • DataFrame.update() modifies in place using non-NA values from another DataFrame.
  • pandas.cancat() concatenates pandas objects along a particular axis with optional set logic along the other axes
  • DataFrame.combine() performs column-wise combine with another DataFrame
  • DataFrame.combine_first() updates null elements with value in the same location in other

merge()

DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

Parameters:

  • right: DataFrame or named Series
  • how: {‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘inner’
    • left: use only keys from left frame, similar to a SQL left outer join; preserve key order.
    • right: use only keys from right frame, similar to a SQL right outer join; preserve key order.
    • outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.
    • inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.
    • cross: creates the cartesian product from both frames, preserves the order of the left keys.
  • on: label or list, column or index level names to join on.
    • If on is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.
  • left_on: label or list, or array-like
  • right_on: label or list, or array-like
  • left_index: bool, default False
    • Use the index from the left DataFrame as the join key(s). If it is a MultiIndex, the number of keys in the other DataFrame (either the index or a number of columns) must match the number of levels.
  • right_index: bool, default False
  • sort: bool, default False
  • suffixes: list-like, default is (“_x”, “_y”)
  • copy: bool, default True
  • indicator: bool or str, default False
  • If True, adds a column to the output DataFrame called “_merge” with information on the source of each row.
  • validate: str, optional
    • “one_to_one” or “1:1”: check if merge keys are unique in both left and right datasets.
    • “one_to_many” or “1:m”: check if merge keys are unique in left dataset.
    • “many_to_one” or “m:1”: check if merge keys are unique in right dataset.
    • “many_to_many” or “m:m”: allowed, but does not result in checks.

Simple Merge

df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
                    'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
                    'value': [2, 3, 4, 5]})
                    
df1
Out[583]: 
  lkey  value
0  foo      1
1  bar      2
2  baz      3
3  foo      5

df2
Out[584]: 
  rkey  value
0  foo      2
1  bar      3
2  baz      4
3  foo      5

# default key is the common columns in both DataFrames, here is `value`
# default is to use intersection of keys from both DataFrames
df1.merge(df2)
Out[585]: 
  lkey  value rkey
0  bar      2  foo
1  baz      3  bar
2  foo      5  foo

# use union of keys from both DataFrames
df1.merge(df2, how = 'outer')
Out[595]: 
  lkey  value rkey
0  foo      1  NaN
1  bar      2  foo
2  baz      3  bar
3  foo      5  foo
4  NaN      4  baz

# use only keys from the left DataFrame
df1.merge(df2, how = 'left')
Out[596]: 
  lkey  value rkey
0  foo      1  NaN
1  bar      2  foo
2  baz      3  bar
3  foo      5  foo

# use only keys from the right DataFrame
df1.merge(df2, how = 'right')
Out[597]: 
  lkey  value rkey
0  bar      2  foo
1  baz      3  bar
2  NaN      4  baz
3  foo      5  foo

# designate keys from the two  DataFrames
# foo(2) -> foo(2), bar -> bar, baz -> barz
df1.merge(df2, left_on='lkey', right_on='rkey')
Out[592]: 
  lkey  value_x rkey  value_y
0  foo        1  foo        2
1  foo        1  foo        5
2  foo        5  foo        2
3  foo        5  foo        5
4  bar        2  bar        3
5  baz        3  baz        4

# merge the keys columns using `DataFrame.set_index()`
df1.merge(df2.set_index('rkey'), left_on='lkey', right_on='rkey')
Out[617]: 
  lkey  value_x  value_y
0  foo        1        2
1  foo        1        5
2  foo        5        2
3  foo        5        5
4  bar        2        3
5  baz        3        4

Merge With Multiple Keys

df1 = pd.DataFrame({'key': ['foo', 'bar', 'baz', 'foo'],
                    'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'key': ['foo', 'bar', 'baz', 'foo'],
                    'value': [2, 3, 4, 5]})
                    
# intersection of keys
df1.merge(df2)
Out[600]: 
   key  value
0  foo      5

# union of keys
df1.merge(df2, how='outer')
Out[601]: 
   key  value
0  foo      1
1  bar      2
2  baz      3
3  foo      5
4  foo      2
5  bar      3
6  baz      4

Merge With Index

A simple one:

df1 = pd.DataFrame({'key': ['foo', 'bar', 'baz', 'foo'],
                    'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'value': [2, 3, 4, 5]},index = ['foo', 'bar', 'baz', 'foo'])

df1
Out[606]: 
   key  value
0  foo      1
1  bar      2
2  baz      3
3  foo      5

df2
Out[607]: 
     value
foo      2
bar      3
baz      4
foo      5

df1.merge(df2, left_on='key',right_index=True)
Out[608]: 
   key  value_x  value_y
0  foo        1        2
0  foo        1        5
3  foo        5        2
3  foo        5        5
1  bar        2        3
2  baz        3        4

Hierarchically indexed data (example from Python for Data Analysis):

lefth = pd.DataFrame({'key1': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
                      'key2': [2000, 2001, 2002, 2001, 2002],
                      'data': np.arange(5.)})
righth = pd.DataFrame(np.arange(12).reshape((6, 2)),
            index=[['Nevada', 'Nevada', 'Ohio', 'Ohio', 'Ohio', 'Ohio'],
                   [2001, 2000, 2000, 2000, 2001, 2002]],
            columns=['event1', 'event2'])
            
lefth
Out[610]: 
     key1  key2  data
0    Ohio  2000   0.0
1    Ohio  2001   1.0
2    Ohio  2002   2.0
3  Nevada  2001   3.0
4  Nevada  2002   4.0

righth
Out[611]: 
             event1  event2
Nevada 2001       0       1
       2000       2       3
Ohio   2000       4       5
       2000       6       7
       2001       8       9
       2002      10      11
       
# intersection of keys
lefth.merge(righth, left_on=['key1','key2'], right_index=True)
Out[614]: 
     key1  key2  data  event1  event2
0    Ohio  2000   0.0       4       5
0    Ohio  2000   0.0       6       7
1    Ohio  2001   1.0       8       9
2    Ohio  2002   2.0      10      11
3  Nevada  2001   3.0       0       1

# union of keys
lefth.merge(righth, left_on=['key1','key2'],right_index=True, how = 'outer')
Out[615]: 
     key1  key2  data  event1  event2
0    Ohio  2000   0.0     4.0     5.0
0    Ohio  2000   0.0     6.0     7.0
1    Ohio  2001   1.0     8.0     9.0
2    Ohio  2002   2.0    10.0    11.0
3  Nevada  2001   3.0     0.0     1.0
4  Nevada  2002   4.0     NaN     NaN
4  Nevada  2000   NaN     2.0     3.0

concat()

pandas.concat(objs, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=True)

Parameters:

  • objs: a sequence or mapping of Series or DataFrame objects
  • axis: {0/’index’, 1/’columns’}, default 0
  • join: {‘inner’, ‘outer’}, default ‘outer’
  • ignore_index: bool, default False
    • If True, do not use the index values along the concatenation axis. The resulting axis will be labeled 0, …, n - 1.
  • keys: sequence, default None
    • If multiple levels passed, should contain tuples. Construct hierarchical index using the passed keys as the outermost level.
  • levels: list of sequences, default None
    • Specific levels (unique values) to use for constructing a MultiIndex. Otherwise they will be inferred from the keys.
  • names: list, default None
    • Names for the levels in the resulting hierarchical index.
  • verify_integrity: bool, default False
  • sort: bool, default False

Simple Concat

df1 = pd.DataFrame(
    {
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    },
    index=[0, 1, 2, 3],
)


df2 = pd.DataFrame(
    {
        "A": ["A4", "A5", "A6", "A7"],
        "B": ["B4", "B5", "B6", "B7"],
        "C": ["C4", "C5", "C6", "C7"],
        "E": ["D4", "D5", "D6", "D7"],
    },
    index=[2, 3, 4, 5],
)

# default is to concat in rows
# default is union of column
pd.concat([df1,df2])
Out[630]: 
    A   B   C    D    E
0  A0  B0  C0   D0  NaN
1  A1  B1  C1   D1  NaN
2  A2  B2  C2   D2  NaN
3  A3  B3  C3   D3  NaN
2  A4  B4  C4  NaN   D4
3  A5  B5  C5  NaN   D5
4  A6  B6  C6  NaN   D6
5  A7  B7  C7  NaN   D7

# intersection of column
pd.concat([df1,df2], join='inner')
Out[632]: 
    A   B   C
0  A0  B0  C0
1  A1  B1  C1
2  A2  B2  C2
3  A3  B3  C3
2  A4  B4  C4
3  A5  B5  C5
4  A6  B6  C6
5  A7  B7  C7

# concat in columns
# union of index
pd.concat([df1,df2],axis = 1)
Out[643]: 
     A    B    C    D    A    B    C    E
0   A0   B0   C0   D0  NaN  NaN  NaN  NaN
1   A1   B1   C1   D1  NaN  NaN  NaN  NaN
2   A2   B2   C2   D2   A4   B4   C4   D4
3   A3   B3   C3   D3   A5   B5   C5   D5
4  NaN  NaN  NaN  NaN   A6   B6   C6   D6
5  NaN  NaN  NaN  NaN   A7   B7   C7   D7

# intersection of index
pd.concat([df1,df2],axis = 1,join='inner')
Out[644]: 
    A   B   C   D   A   B   C   E
2  A2  B2  C2  D2  A4  B4  C4  D4
3  A3  B3  C3  D3  A5  B5  C5  D5

Concat with Group Keys

dfC = pd.concat([df1, df2], axis = 0, keys=[('df1'), ('df2')])
Out[651]: 
        A   B   C    D    E
df1 0  A0  B0  C0   D0  NaN
    1  A1  B1  C1   D1  NaN
    2  A2  B2  C2   D2  NaN
    3  A3  B3  C3   D3  NaN
df2 2  A4  B4  C4  NaN   D4
    3  A5  B5  C5  NaN   D5
    4  A6  B6  C6  NaN   D6
    5  A7  B7  C7  NaN   D7
 
dfC.index
Out[676]: 
MultiIndex([('df1', 0),
            ('df1', 1),
            ('df1', 2),
            ('df1', 3),
            ('df2', 2),
            ('df2', 3),
            ('df2', 4),
            ('df2', 5)],
           names=['G', None])
           
dfC = pd.concat([df1,df2], axis = 1, keys=[('df1'), ('df2')])
Out[655]: 
   df1                 df2               
     A    B    C    D    A    B    C    E
0   A0   B0   C0   D0  NaN  NaN  NaN  NaN
1   A1   B1   C1   D1  NaN  NaN  NaN  NaN
2   A2   B2   C2   D2   A4   B4   C4   D4
3   A3   B3   C3   D3   A5   B5   C5   D5
4  NaN  NaN  NaN  NaN   A6   B6   C6   D6
5  NaN  NaN  NaN  NaN   A7   B7   C7   D7

dfC.columns
Out[679]: 
MultiIndex([('df1', 'A'),
            ('df1', 'B'),
            ('df1', 'C'),
            ('df1', 'D'),
            ('df2', 'A'),
            ('df2', 'B'),
            ('df2', 'C'),
            ('df2', 'E')],
           names=['G', None])
           
# alternately, we can pass a dict of onjects instead of a list,
# the dict's keys will be used for 'keys' option
dfs = {'df1':df1, 'df2':df2}

pd.concat(dfs)
Out[683]: 
        A   B   C    D    E
df1 0  A0  B0  C0   D0  NaN
    1  A1  B1  C1   D1  NaN
    2  A2  B2  C2   D2  NaN
    3  A3  B3  C3   D3  NaN
df2 2  A4  B4  C4  NaN   D4
    3  A5  B5  C5  NaN   D5
    4  A6  B6  C6  NaN   D6
    5  A7  B7  C7  NaN   D7

pd.concat(dfs,axis=1)
Out[684]: 
   df1                 df2               
     A    B    C    D    A    B    C    E
0   A0   B0   C0   D0  NaN  NaN  NaN  NaN
1   A1   B1   C1   D1  NaN  NaN  NaN  NaN
2   A2   B2   C2   D2   A4   B4   C4   D4
3   A3   B3   C3   D3   A5   B5   C5   D5
4  NaN  NaN  NaN  NaN   A6   B6   C6   D6
5  NaN  NaN  NaN  NaN   A7   B7   C7   D7

combine() & coimbine_first()

combine()

DataFrame.combine(other, func, fill_value=None, overwrite=True)

Combines a DataFrame with other DataFrame using func to element-wise combine columns. The row and column indexes of the resulting DataFrame will be the union of the two.

Parameters:

  • other: DataFrame
    • The DataFrame to merge column-wise.
  • func: function
    • Function that takes two series as inputs and return a Series or a scalar. Used to merge the two dataframes column by columns.
  • fill_value: scalar value, default None
    • The value to fill NaNs with prior to passing any column to the merge func.
  • overwrite: bool, default True
    • If True, columns in self that do not exist in other will be overwritten with NaNs.
df1 = pd.DataFrame({'A': [5, 0], 'B': [1, 4]})
df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]})

df1
Out[686]: 
   A  B
0  5  1
1  0  4

df2
Out[687]: 
   A  B
0  1  3
1  1  3

# ufunc
df1.combine(df2, np.minimum)
Out[688]: 
   A  B
0  1  1
1  0  3

# lambda function
# choose column with smaller summation
take_smaller = lambda s1, s2: s1 if s1.sum() < s2.sum() else s2
df1.combine(df2, take_smaller)
Out[700]: 
   A  B
0  1  1
1  1  4

combine_first()

DataFrame.combine_first(other)

Combine two DataFrame objects by filling null values in one DataFrame with non-null values from other DataFrame. The row and column indexes of the resulting DataFrame will be the union of the two.

df1 = pd.DataFrame({'A': [None, 0], 'B': [None, 4]})
df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]})

# combine DataFrames with the same columns
df1.combine_first(df2)
Out[703]: 
     A    B
0  1.0  3.0
1  0.0  4.0

# combine DataFrames with different columns
df3 = pd.DataFrame({'B': [1, 1], 'C': [3, 3]})
df1.combine_first(df3)
Out[705]: 
     A    B  C
0  NaN  1.0  3
1  0.0  4.0  3

Comments