Pivot Tables

노정훈·2023년 7월 26일

Pandas

목록 보기

9/12

The pivot table takes simple column-wise data as input, and groups the entries into a two-dimensional table that provides a multidimensional summarization of the data.
Pivot tables as essentially a multidimensional version of groupby aggregation.
That is, you split-apply-combine, but both the split and the combine happen accross not a one-dimensional index, but across a two-dimensional grid.

Motivating Pivot Tables

We'll use the database of passengers on the Titanic, available through the Seaborn library.

# In[1]
titanic=sns.load_dataset('titanic')
titanic.head()

# Out[1]
  survived	pclass	   sex	 age	sibsp	parch	   fare	embarked	class	  who	adult_male	deck	embark_town	alive	alone
0	     0	     3	  male	22.0	    1	    0	 7.2500	       S	Third	  man	      True	 NaN	Southampton	   no	False
1	     1	     1	female	38.0	    1	    0	71.2833	       C	First	woman	     False	   C	  Cherbourg	  yes	False
2	     1	     3	female	26.0	    0	    0	 7.9250	       S	Third	woman	     False	 NaN	Southampton	  yes	 True
3	     1	     1	female	35.0	    1	    0	53.1000	       S	First	woman	     False	   C	Southampton	  yes	False
4	     0	     3	  male	35.0	    0	    0	 8.0500	       S	Third	  man	      True	 NaN	Southampton	   no	 True

Pivot Tables by Hand

We might begin by grouping according to sex, survival status, or combination thereof.

# In[2]
titanic.groupby('sex')[['survived']].mean() # make a DataFrame
"""titanic.groupby('sex')['survived'].mean() return Series"""

# Out[2]
        survived
sex	
female	0.742038
male	0.188908

# In[3]
titanic.groupby(['sex','class'])['survived'].aggregate('mean').unstack()

# Out[3]
class	First	    Second	    Third
sex			
female	0.968085	0.921053	0.500000
male	0.368852	0.157407	0.135447

The above code gives us a better idea of how both sex and class affected survival, but the code is starting to look a bit garbled(=confused).
The long string of code is not particularly easy to read or use.
This two-dimensional groupby is common enough that Pandas includes a convenience routine, pivot_table, which succinctly(간단명료한) handles this type of multidimensional aggregation.

Pivot Table Syntax

Here is the equivalent to the preceding operation using the DataFrame.pivot_table method

# In[4]
titanic.pivot_table('survived',index='sex',columns='class',aggfunc='mean')

# Out[4]
class	First	    Second	    Third
sex			
female	0.968085	0.921053	0.500000
male	0.368852	0.157407	0.135447

Multilevel Pivot Tables

Just as in a groupby, the grouping in pivot tables can be specified with multiple levels and via a number of options.
We'll bin the age using the pd.cut function.

# In[5]
age=pd.cut(titanic['age'],[0,18,80])
titanic.pivot_table('survived',['sex',age],'class')

# Out[5]
class	        First	    Second	    Third
sex	   age			
female	(0, 18]	0.909091	1.000000	0.511628
       (18, 80]	0.972973	0.900000	0.423729
male	(0, 18]	0.800000	0.600000	0.215686
       (18, 80]	0.375000	0.071429	0.133663

We can apply the same strategy when working with the columns as well; let's add info on the fare paid, using pd.qcut to automatically compute quantiles.

# In[6]
fare=pd.qcut(titanic['fare'],2)
titanic.pivot_table('survived',['sex',age],[fare,'class'])

# Out[6]
fare	       (-0.001, 14.454]	                  (14.454, 512.329]
class	       First	  Second	   Third	   First	  Second	   Third
sex	    age						
female	(0, 18]	 NaN	1.000000	0.714286	0.909091	1.000000	0.318182
       (18, 80]	 NaN	0.880000	0.444444	0.972973	0.914286	0.391304
male	(0, 18]	 NaN	0.000000	0.260870	0.800000	0.818182	0.178571
       (18, 80]	 0.0	0.098039	0.125000	0.391304	0.030303	0.192308

The result is a four-dimensional aggregation with hierarchical indices, shown in a grid demonstrating the relationship between the values.

If you don't know about the pd.cut and pd.qcut, please reference this url :
Difference of pd.cut() and pd.qcut()

Additional Pivot Table Options

The full call signature of the DataFrame.pivot_table method is as follows.

DataFrame.pivot_table(data, values=None, index=None, columns=None, 
aggfunc='mean', fill_value=None, margins=False, dropna=True, 
margins_name='All', observed=False, sort=True)

fill_value and dropna have to do with missing data and are fairly straightforward.

The aggfunc keyword controls what type of aggregation is applied, which is mean by default.
As with groupby, the aggregation specification can be a string representing one of several common choices like sum , mean, count etc.
Or a function that implements an aggregation like np.sum(), min(), sum() etc.
Additionally, it can be specified as a dictionary mapping a column to any of the desired options.

# In[7]
titanic.pivot_table(index='sex',columns='class',
aggfunc={'survived':sum,'fare':'mean'})

# Out[7]
        fare	                            survived
class	First	    Second	    Third	    First	Second	Third
sex						
female	106.125798	21.970121	16.118810	91	    70	    72
male	 67.226127	19.741782	12.661633	45	    17	    47

We've omitted the values keyword here; when specifying a mapping for aggfunc, this is determined automatically.
At times it's useful to compute totals along each grouping. This can be done via the margins keyword.

# In[8]
titanic.pivot_table('survived',index='sex',columns='class',margins=True)

# Out[8]
class	First	    Second	    Third	    All
sex				
female	0.968085	0.921053	0.500000	0.742038
male	0.368852	0.157407	0.135447	0.188908
All	    0.629630	0.472826	0.242363	0.383838

The margin label can be specified with the margins_name keyword; it defaults to "All"

노정훈

이전 포스트

Aggregation and Grouping

다음 포스트

Pivot Tables

Pandas

Motivating Pivot Tables

Pivot Tables by Hand

Pivot Table Syntax

Multilevel Pivot Tables

Additional Pivot Table Options

Aggregation and Grouping

Vectorized String Operations

0개의 댓글