Combining Datasets: concat and append

노정훈·2023년 7월 23일
0

Pandas

목록 보기
6/12
# In[1]
def make_df(cols,ind):
    data={c:[str(c)+str(i) for i in ind] for c in cols}
    return pd.DataFrame(data,ind)
# In[2]
class display(object):
    """Display HTML representation of multiple objects"""
    template="""<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{0}{1}
    """
    def __init__(self,*args):
        self.args=args
    
    def _repr_html_(self):
        return '\n'.join(self.template.format(a,eval(a)._repr_html_()) 
                         for a in self.args)
    
    def __repr__(self):
        return '\n\n'.join(a+'\n'+repr(eval(a))
                         for a in self.args)

Recall: Concatenation of Numpy Arrays

Simple Concatenation with pd.concat

  • pd.concat function provides a similar syntax to np.concatente but contains a number of options.
pd.concat(objs, axis=0, join='outer', ignore_index=False, keys=None, 
levels=None, names=None, verify_integrity=False, sort=False, copy=True)
  • It can be used for a simple concatenation of Series and DataFrame objects, just as np.concatenate can be used for simple concatenation of arrays.
# In[3]
ser1=pd.Series(['A','B','C'],index=[1,2,3])
ser2=pd.Series(['D','E','F'],index=[4,5,6])
pd.concat([ser1,ser2])
# Out[3]
1    A
2    B
3    C
4    D
5    E
6    F
dtype: object

# In[4]
df1=make_df('AB',[1,2])
df2=make_df('AB',[3,4])
display('df1','df2','pd.concat([df1,df2])')
# Out[4]
df1

     A	 B
1	A1	B1
2	A2	B2

df2

     A	 B
3	A3	B3
4	A4	B4

pd.concat([df1,df2])

     A	 B
1	A1	B1
2	A2	B2
3	A3	B3
4	A4	B4
  • It's default behavior is to concatenate row-wise(axis=0) within the DataFrame.
  • Like np.concatenate, pd.concat allows specification of an axis along which concatenation will take place.
# In[5]
df3=make_df('AB',[0,1])
df4=make_df('CD',[0,1])
display('df3','df4',"pd.concat([df3,df4],axis='columns')")
# Out[5]
df3

     A 	 B
0	A0	B0
1	A1	B1

df4

     C	 D
0	C0	D0
1	C1	D1

pd.concat([df3,df4],axis='columns')

     A	 B	 C	 D
0	A0	B0	C0	D0
1	A1	B1	C1	D1

Duplicate Indices

  • One important difference between np.concatenate and pd.concat is that Pandas concatenation preserves indices, even if the result will have duplicate(복사된) indices.
# In[6]
x=make_df('AB',[0,1])
y=make_df('AB',[2,3])
y.index=x.index
display('x','y','pd.concat([x,y])')
# Out[6]
x

     A	 B
0	A0	B0
1	A1	B1

y

     A	 B
0	A2	B2
1	A3	B3

pd.concat([x,y])

     A	 B
0	A0	B0
1	A1	B1
0	A2	B2
1	A3	B3

Treating repeated indices as an error

  • If you'd like to simply verify that the indices in the result of pd.concat do not overlap, you can include the verify_integrity flag.
  • With this set to True, the concatenation will raise an exception if there are duplicate indices.

Ignoring the index

  • Sometimes the index itself does not matter, and you would prefer it to simply be ignored.
  • This option can be specified using the ignore_index flag.
  • With this set to True, the concatenation will create a new integer for the resulting DataFrame.
# In[7]
display('x','y','pd.concat([x,y],ignore_index=True)')
# Out[7]
x

     A	 B
0	A0	B0
1	A1	B1

y

     A	 B
0	A2	B2
1	A3	B3

pd.concat([x,y],ignore_index=True)

     A	 B
0	A0	B0
1	A1	B1
2	A2	B2
3	A3	B3

Adding MultiIndex keys

  • Another option is to use the keys option to specify a label for the data sources.
  • The result will be a hierarchically indexed series containing the data
# In[8]
display('x','y',"pd.concat([x,y], keys=['x','y'])")
# Out[8]
x

     A	 B
0	A0	B0
1	A1	B1

y

     A	 B
0	A2	B2
1	A3	B3

pd.concat([x,y], keys=['x','y'])

         A	 B
x	0	A0	B0
    1	A1	B1
y	0	A2	B2
    1	A3	B3

Concatenation with Joins

  • Data from different sources might have different sets of column names, and pd.concat offers several options in this case.
# In[9]
df5=make_df('ABC',[1,2])
df6=make_df('BCD',[3,4])
display('df5','df6','pd.concat([df5,df6])')
# Out[9]
df5

     A	 B	 C
1	A1	B1	C1
2	A2	B2	C2

df6

     B	 C	 D
3	B3	C3	D3
4	B4	C4	D4

pd.concat([df5,df6])

     A	 B	 C	  D
1	 A1	B1	C1	NaN
2	 A2	B2	C2	NaN
3	NaN	B3	C3	 D3
4	NaN	B4	C4	 D4
  • The default behavior is to fill entries for which no data is available with NA values.
  • To change this, we can adjust the join parameter of the concat function.
  • By default, the join is a union of the input columns, but we can change this to an intersection of the columns using join='inner'
# In[10]
display('df5','df6',"pd.concat([df5,df6], join='inner')")
# Out[10]
df5

     A	 B	 C
1	A1	B1	C1
2	A2	B2	C2

df6

     B	 C	 D
3	B3	C3	D3
4	B4	C4	D4

pd.concat([df5,df6], join='inner')

     B	 C
1	B1	C1
2	B2	C2
3	B3	C3
4	B4	C4
  • Another useful pattern is to use the reindex method before concatenate for finer control over which columns are dropped.
# In[11]
pd.concat([df5, df6.reindex(df5.columns, axis=1)])
# Out[11]
     A	 B	 C
1	A1	B1	C1
2	A2	B2	C2
3	NaN	B3	C3
4	NaN	B4	C4

The append Method

  • Because direct array concatenation is so common, Series and DataFrame objects have an append method that can accomplish the same thing in fewer keystrokes.
# In[12]
display('df1','df2','df1.append(df2)')
# Out[12]
df1

     A	 B
1	A1	B1
2	A2	B2

df2

     A	 B
3	A3	B3
4	A4	B4

df1.append(df2)

     A	 B
1	A1	B1
2	A2	B2
3	A3	B3
4	A4	B4
  • Keep in mind that unlike the append and extend methods of Python lists, the append method in Pandas does not modify the original object; instead it creates a new object with the combined data.
  • It also is not a very efficient method, because it involves creation of a new index and data buffer.
  • Notice that append method is deprecated and will be removed from pandas in a future version. Use pd.concat instead.
profile
노정훈

0개의 댓글