Pivot Tables

노정훈·2023년 7월 26일
0

Pandas

목록 보기
9/12
  • The pivot table takes simple column-wise data as input, and groups the entries into a two-dimensional table that provides a multidimensional summarization of the data.
  • Pivot tables as essentially a multidimensional version of groupby aggregation.
  • That is, you split-apply-combine, but both the split and the combine happen accross not a one-dimensional index, but across a two-dimensional grid.

Motivating Pivot Tables

  • We'll use the database of passengers on the Titanic, available through the Seaborn library.
# In[1]
titanic=sns.load_dataset('titanic')
titanic.head()
# Out[1]
  survived	pclass	   sex	 age	sibsp	parch	   fare	embarked	class	  who	adult_male	deck	embark_town	alive	alone
0	     0	     3	  male	22.0	    1	    0	 7.2500	       S	Third	  man	      True	 NaN	Southampton	   no	False
1	     1	     1	female	38.0	    1	    0	71.2833	       C	First	woman	     False	   C	  Cherbourg	  yes	False
2	     1	     3	female	26.0	    0	    0	 7.9250	       S	Third	woman	     False	 NaN	Southampton	  yes	 True
3	     1	     1	female	35.0	    1	    0	53.1000	       S	First	woman	     False	   C	Southampton	  yes	False
4	     0	     3	  male	35.0	    0	    0	 8.0500	       S	Third	  man	      True	 NaN	Southampton	   no	 True

Pivot Tables by Hand

  • We might begin by grouping according to sex, survival status, or combination thereof.
# In[2]
titanic.groupby('sex')[['survived']].mean() # make a DataFrame
"""titanic.groupby('sex')['survived'].mean() return Series"""
# Out[2]
        survived
sex	
female	0.742038
male	0.188908

# In[3]
titanic.groupby(['sex','class'])['survived'].aggregate('mean').unstack()
# Out[3]
class	First	    Second	    Third
sex			
female	0.968085	0.921053	0.500000
male	0.368852	0.157407	0.135447
  • The above code gives us a better idea of how both sex and class affected survival, but the code is starting to look a bit garbled(=confused).
  • The long string of code is not particularly easy to read or use.
  • This two-dimensional groupby is common enough that Pandas includes a convenience routine, pivot_table, which succinctly(간단명료한) handles this type of multidimensional aggregation.

Pivot Table Syntax

  • Here is the equivalent to the preceding operation using the DataFrame.pivot_table method
# In[4]
titanic.pivot_table('survived',index='sex',columns='class',aggfunc='mean')
# Out[4]
class	First	    Second	    Third
sex			
female	0.968085	0.921053	0.500000
male	0.368852	0.157407	0.135447

Multilevel Pivot Tables

  • Just as in a groupby, the grouping in pivot tables can be specified with multiple levels and via a number of options.
  • We'll bin the age using the pd.cut function.
# In[5]
age=pd.cut(titanic['age'],[0,18,80])
titanic.pivot_table('survived',['sex',age],'class')
# Out[5]
class	        First	    Second	    Third
sex	   age			
female	(0, 18]	0.909091	1.000000	0.511628
       (18, 80]	0.972973	0.900000	0.423729
male	(0, 18]	0.800000	0.600000	0.215686
       (18, 80]	0.375000	0.071429	0.133663
  • We can apply the same strategy when working with the columns as well; let's add info on the fare paid, using pd.qcut to automatically compute quantiles.
# In[6]
fare=pd.qcut(titanic['fare'],2)
titanic.pivot_table('survived',['sex',age],[fare,'class'])
# Out[6]
fare	       (-0.001, 14.454]	                  (14.454, 512.329]
class	       First	  Second	   Third	   First	  Second	   Third
sex	    age						
female	(0, 18]	 NaN	1.000000	0.714286	0.909091	1.000000	0.318182
       (18, 80]	 NaN	0.880000	0.444444	0.972973	0.914286	0.391304
male	(0, 18]	 NaN	0.000000	0.260870	0.800000	0.818182	0.178571
       (18, 80]	 0.0	0.098039	0.125000	0.391304	0.030303	0.192308
  • The result is a four-dimensional aggregation with hierarchical indices, shown in a grid demonstrating the relationship between the values.

If you don't know about the pd.cut and pd.qcut, please reference this url :
Difference of pd.cut() and pd.qcut()

Additional Pivot Table Options

  • The full call signature of the DataFrame.pivot_table method is as follows.
DataFrame.pivot_table(data, values=None, index=None, columns=None, 
aggfunc='mean', fill_value=None, margins=False, dropna=True, 
margins_name='All', observed=False, sort=True)
  • fill_value and dropna have to do with missing data and are fairly straightforward.

  • The aggfunc keyword controls what type of aggregation is applied, which is mean by default.
  • As with groupby, the aggregation specification can be a string representing one of several common choices like sum , mean, count etc.
  • Or a function that implements an aggregation like np.sum(), min(), sum() etc.
  • Additionally, it can be specified as a dictionary mapping a column to any of the desired options.
# In[7]
titanic.pivot_table(index='sex',columns='class',
aggfunc={'survived':sum,'fare':'mean'})
# Out[7]
        fare	                            survived
class	First	    Second	    Third	    First	Second	Third
sex						
female	106.125798	21.970121	16.118810	91	    70	    72
male	 67.226127	19.741782	12.661633	45	    17	    47
  • We've omitted the values keyword here; when specifying a mapping for aggfunc, this is determined automatically.

  • At times it's useful to compute totals along each grouping. This can be done via the margins keyword.

# In[8]
titanic.pivot_table('survived',index='sex',columns='class',margins=True)
# Out[8]
class	First	    Second	    Third	    All
sex				
female	0.968085	0.921053	0.500000	0.742038
male	0.368852	0.157407	0.135447	0.188908
All	    0.629630	0.472826	0.242363	0.383838
  • The margin label can be specified with the margins_name keyword; it defaults to "All"
profile
노정훈

0개의 댓글