Combining Datasets: merge and join

노정훈·2023년 7월 24일

Pandas

목록 보기

7/12

# In[1]
class display(object):
    """Display HTML representation of multiple objects"""
    template="""<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{0}{1}
    """
    def __init__(self,*args):
        self.args=args
    
    def _repr_html_(self):
        return '\n'.join(self.template.format(a,eval(a)._repr_html_()) 
                         for a in self.args)
    
    def __repr__(self):
        return '\n\n'.join(a+'\n'+repr(eval(a))
                         for a in self.args)

Relational Algebra

The behavior implemented in pd.merge is a subset of what is known as relational algebra, which is a formal set of rules for manipulating relational data that forms the conceptual foundation of operations available in most databases.
The strength of the relational algebra approach is that it proposes several fundamental operations, which become the building blocks of more complicated operations on any dataset.
Pandas implements several of these fundamental building blocks in the pd.merge function and the related join method of Series and DataFrame objects.

Categories of Joins

One-to-One Joins

The simplest type of merge.

# In[2]
df1=pd.DataFrame({'employee':['Bob','Jake','Lisa','Sue'],
'group':['Accounting','Engineering','Engineering','HR']})
df2=pd.DataFrame({'employee':['Lisa','Bob','Jake','Sue'],
'hire_data':[2004,2008,2012,2014]})
display('df1','df2')

# Out[2]
df1

  employee	      group
0	   Bob	 Accounting
1	  Jake	Engineering
2	  Lisa	Engineering
3	   Sue	         HR

df2

  employee	hire_data
0	  Lisa	     2004
1	   Bob	     2008
2	  Jake	     2012
3	   Sue	     2014

To combine this information into a single DataFrame, we can use the pd.merge function.

# In[3]
df3=pd.merge(df1,df2)
df3

# Out[3]
  employee	      group	hire_data
0	   Bob	 Accounting	     2008
1	  Jake	Engineering	     2012
2	  Lisa	Engineering	     2004
3	   Sue	         HR	     2014

pd.merge function recognizes that each DataFrame has an employee column, and automatically joins using this column as a key.
The order of entries in each column is not necessarily maintained: in this case, the order of employee column differs between df1 and df2, and the pd.merge function correctly accounts for this.
The merge in general discards the index, except in the special case of merges by index.

Many-to-One Joins

Many to One joins are joins in which one of the two key columns contains duplicate entries.

# In[4]
df4=pd.DataFrame({'group':['Accounting','Engineering','HR'],
'supervisor':['Carly','Guido','Steve']})
display('df3','df4','pd.merge(df3,df4)')

# Out[4]
df3

  employee	      group	hire_data
0	   Bob	 Accounting	     2008
1	  Jake	Engineering	     2012
2	  Lisa	Engineering	     2004
3	   Sue	         HR	     2014

df4

          group	supervisor
0	 Accounting	     Carly
1	Engineering	     Guido
2	         HR	     Steve

pd.merge(df3,df4)

  employee	      group	hire_data	supervisor
0	   Bob	 Accounting	     2008	     Carly
1	  Jake	Engineering	     2012	     Guido
2	  Lisa	Engineering	     2004	     Guido
3	   Sue	         HR	     2014	     Steve

Many-to-Many Joins

If the key column in both the left and right arrays contains duplicates, then the result is a many to many merge.

# In[5]
df5=pd.DataFrame({'group':['Accounting','Accounting','Engineering','Engineering','HR','HR'],
'skills':['math','spreadsheets','software','math','spreadsheets','organization']})
display('df1','df5','pd.merge(df1,df5)')

# Out[5]
df1

  employee	      group
0	   Bob	 Accounting
1	  Jake	Engineering
2	  Lisa	Engineering
3	   Sue	         HR

df5

          group	      skills
0	 Accounting	        math
1	 Accounting	spreadsheets
2	Engineering	    software
3	Engineering	        math
4	         HR	spreadsheets
5	         HR	organization

pd.merge(df1,df5)

  employee	      group	      skills
0	   Bob	 Accounting	        math
1	   Bob	 Accounting	spreadsheets
2	  Jake	Engineering	    software
3	  Jake	Engineering	        math
4	  Lisa	Engineering	    software
5	  Lisa	Engineering	        math
6	   Sue	         HR	spreadsheets
7	   Sue	         HR	organization

Specification of the Merge Key

The on Keyword

You can explicitly specify the name of the key column using the on keyword, which takes a column name of a list of column names.

# In[6]
display('df1','df2',"pd.merge(df1,df2,on='employee')")

# Out[6]
df1

  employee	      group
0	   Bob	 Accounting
1	  Jake	Engineering
2	  Lisa	Engineering
3	   Sue	         HR

df2

  employee	hire_data
0	  Lisa	     2004
1	   Bob	     2008
2	  Jake	     2012
3	   Sue	     2014

pd.merge(df1,df2,on='employee')

  employee	      group	hire_data
0	   Bob	 Accounting	     2008
1	  Jake	Engineering	     2012
2	  Lisa	Engineering	     2004
3	   Sue	         HR	     2014

This option works only if both the left and right DataFrames have the specified column name.

The left_on and right_on Keywords

At times you may wish to merge two datasets with different column names.

# In[7]
df3=pd.DataFrame({'name':['Bob','Jake','Lisa','Sue'],
'salary':[70000,80000,120000,90000]})
display('df1','df3',"pd.merge(df1,df3,left_on='employee',right_on='name')")

# Out[7]
df1

  employee	      group
0	   Bob	 Accounting
1	  Jake	Engineering
2	  Lisa	Engineering
3	   Sue	         HR

df3

    name	salary
0	 Bob	 70000
1	Jake	 80000
2	Lisa	120000
3	 Sue	 90000

pd.merge(df1,df3,left_on='employee',right_on='name')

  employee	      group	name	salary
0	   Bob	 Accounting	 Bob	 70000
1	  Jake	Engineering	Jake	 80000
2	  Lisa	Engineering	Lisa	120000
3	   Sue	         HR	 Sue	 90000

The result has a redundant column that we can drop if desired by using DataFrame.drop() method.

# In[8]
pd.merge(df1,df3,left_on='employee',right_on='name').drop('name',axis=1)

# Out[8]
  employee	      group	salary
0	   Bob	 Accounting	 70000
1	  Jake	Engineering	 80000
2	  Lisa	Engineering	120000
3	   Sue	         HR	 90000

The left_index and right_index Keywords

Sometimes, rather than merging on a column, you would instead like to merge on an index.
You can use the index as the key for merging by specifying the left_index and/or right_index flags in pd.merge()

# In[9]
df1a=df1.set_index('employee')
df2a=df2.set_index('employee')
display('df1a','df2a')

# Out[9]
df1a

              group
employee	
Bob	     Accounting
Jake	Engineering
Lisa	Engineering
Sue	             HR

df2a

        hire_data
employee	
Lisa	     2004
Bob	         2008
Jake	     2012
Sue	         2014

# In[10]
display('df1a','df2a',"pd.merge(df1a,df2a,left_index=True,right_index=True)")

# Out[10]
df1a

              group
employee	
Bob	     Accounting
Jake	Engineering
Lisa	Engineering
Sue	             HR

df2a

        hire_data
employee	
Lisa	     2004
Bob	         2008
Jake	     2012
Sue	         2014

pd.merge(df1a,df2a,left_index=True,right_index=True)

              group	hire_data
employee		
Bob	     Accounting	     2008
Jake	Engineering	     2012
Lisa	Engineering	     2004
Sue	             HR	     2014

For convenience, Pandas includes the DataFrame.join() method, which performs an index-based merge without extra keywords.

# In[11]
df1a.join(df2a)

# Out[11]
              group	hire_data
employee		
Bob	     Accounting	     2008
Jake	Engineering	     2012
Lisa	Engineering	     2004
Sue	             HR	     2014

If you'd like to mix indices and columns, you can combine left_index with right_on or left_on with right_index to get the desired behavior.

# In[12]
display('df1a','df3',"pd.merge(df1a,df3,left_index=True,right_on='name')")

# Out[12]
df1a

              group
employee	
Bob	     Accounting
Jake	Engineering
Lisa	Engineering
Sue	             HR

df3

    name	salary
0	 Bob     70000
1	Jake	 80000
2	Lisa	120000
3	 Sue	 90000

pd.merge(df1a,df3,left_index=True,right_on='name')

          group	name	salary
0	 Accounting	 Bob	 70000
1	Engineering	Jake	 80000
2	Engineering	Lisa	120000
3	         HR	 Sue     90000

Specifying Set Arithmetic for Joins

# In[13]
df6=pd.DataFrame({'name':['Peter','Paul','Mary'],
'food':['fish','beans','bread']})
df7=pd.DataFrame({'name':['Mary','Joseph'],
'drink':['wine','beer']}, columns=['name','drink'])
display('df6','df7',"pd.merge(df6,df7)")

# Out[13]
df6

     name	 food
0	Peter	 fish
1	 Paul	beans
2	 Mary	bread

df7

      name	drink
0	  Mary	 wine
1	Joseph	 beer

pd.merge(df6,df7)

    name	 food	drink
0	Mary	bread	 wine

By default, the result contains the intersection of the two sets of inputs; this is what is known as an inner join.
We can specify this explicitly using the how keyword, which defaults to inner

# In[14]
pd.merge(df6,df7,how='inner')

# Out[14]
    name	 food	drink
0	Mary	bread	 wine

Other options for the how keyword are outer, left, and right.
outer join returns a join over the union of the input columns and fills in missing values with NAs.

# In[15]
display('df6','df7',"pd.merge(df6,df7,how='outer')")

# Out[15]
df6

     name	 food
0	Peter	 fish
1	 Paul	beans
2	 Mary	bread

df7

      name	drink
0	  Mary	 wine
1	Joseph	 beer

pd.merge(df6,df7,how='outer')

      name	 food	drink
0	 Peter	 fish	  NaN
1	  Paul	beans	  NaN
2	  Mary	bread	 wine
3	Joseph	  NaN	 beer

The left and right join return joins over the left entries and right entries, respectively.

# In[16]
display('df6','df7',"pd.merge(df6,df7,how='left')")

# Out[16]
df6

     name	 food
0	Peter	 fish
1	 Paul	beans
2	 Mary	bread

df7

      name	drink
0	  Mary	 wine
1	Joseph	 beer

pd.merge(df6,df7,how='left')

     name	 food	drink
0	Peter	 fish	  NaN
1	 Paul	beans	  NaN
2	 Mary	bread	 wine

The output rows now correspond to the entries in the left input.

# In[17]
display('df6','df7',"pd.merge(df6,df7,how='right')")

# Out[17]
df6

     name	food
0	Peter	 fish
1	 Paul	beans
2	 Mary	bread

df7

      name	drink
0	  Mary	 wine
1	Joseph	 beer

pd.merge(df6,df7,how='right')

      name	 food	drink
0	  Mary	bread	 wine
1	Joseph	  NaN	 beer

Overlapping Column Names: The suffixes Keyword

# In[18]
df8=pd.DataFrame({'name':['Bob','Jake','Lisa','Sue'],
'rank':[1,2,3,4]})
df9=pd.DataFrame({'name':['Bob','Jake','Lisa','Sue'],
'rank':[3,1,4,2]})
display('df8','df9','pd.merge(df8,df9,on="name")')

# Out[18]
df8

    name rank
0	 Bob	1
1	Jake	2
2	Lisa	3
3	 Sue	4

df9

    name rank
0	 Bob	3
1	Jake	1
2	Lisa	4
3	 Sue	2

pd.merge(df8,df9,on="name")

    name	rank_x	rank_y
0	 Bob	     1	     3
1	Jake	     2	     1
2	Lisa	     3	     4
3	 Sue	     4	     2

Because the output would have two conflicting column names, the merge function automatically appends the suffixes _x and _y to make the output columns unique.
If these defaults are inappropriate, it is possible to specify a custom suffix using the suffixes keyword.

# In[19]
pd.merge(df8,df9,on="name",suffixes=["_L","_R"])

# Out[19]
    name	rank_L	rank_R
0	 Bob	     1	     3
1	Jake	     2	     1
2	Lisa	     3	     4
3	 Sue	     4	     2

Reference these urls if you want to know about join, merge and concat :
1. Combining DataFrame
2. About merge method

노정훈

이전 포스트

Combining Datasets: concat and append

다음 포스트

Combining Datasets: merge and join

Pandas

Relational Algebra

Categories of Joins

One-to-One Joins

Many-to-One Joins

Many-to-Many Joins

Specification of the Merge Key

The on Keyword

The left_on and right_on Keywords

The left_index and right_index Keywords

Specifying Set Arithmetic for Joins

Overlapping Column Names: The suffixes Keyword

Combining Datasets: concat and append

Aggregation and Grouping

0개의 댓글