# In[1]
titanic=sns.load_dataset('titanic')
titanic.head()
# Out[1]
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True
# In[2]
titanic.groupby('sex')[['survived']].mean() # make a DataFrame
"""titanic.groupby('sex')['survived'].mean() return Series"""
# Out[2]
survived
sex
female 0.742038
male 0.188908
# In[3]
titanic.groupby(['sex','class'])['survived'].aggregate('mean').unstack()
# Out[3]
class First Second Third
sex
female 0.968085 0.921053 0.500000
male 0.368852 0.157407 0.135447
pivot_table
, which succinctly(간단명료한) handles this type of multidimensional aggregation.DataFrame.pivot_table
method# In[4]
titanic.pivot_table('survived',index='sex',columns='class',aggfunc='mean')
# Out[4]
class First Second Third
sex
female 0.968085 0.921053 0.500000
male 0.368852 0.157407 0.135447
pd.cut
function.# In[5]
age=pd.cut(titanic['age'],[0,18,80])
titanic.pivot_table('survived',['sex',age],'class')
# Out[5]
class First Second Third
sex age
female (0, 18] 0.909091 1.000000 0.511628
(18, 80] 0.972973 0.900000 0.423729
male (0, 18] 0.800000 0.600000 0.215686
(18, 80] 0.375000 0.071429 0.133663
pd.qcut
to automatically compute quantiles.# In[6]
fare=pd.qcut(titanic['fare'],2)
titanic.pivot_table('survived',['sex',age],[fare,'class'])
# Out[6]
fare (-0.001, 14.454] (14.454, 512.329]
class First Second Third First Second Third
sex age
female (0, 18] NaN 1.000000 0.714286 0.909091 1.000000 0.318182
(18, 80] NaN 0.880000 0.444444 0.972973 0.914286 0.391304
male (0, 18] NaN 0.000000 0.260870 0.800000 0.818182 0.178571
(18, 80] 0.0 0.098039 0.125000 0.391304 0.030303 0.192308
If you don't know about the pd.cut
and pd.qcut
, please reference this url :
Difference of pd.cut() and pd.qcut()
DataFrame.pivot_table
method is as follows.DataFrame.pivot_table(data, values=None, index=None, columns=None,
aggfunc='mean', fill_value=None, margins=False, dropna=True,
margins_name='All', observed=False, sort=True)
fill_value
and dropna
have to do with missing data and are fairly straightforward.aggfunc
keyword controls what type of aggregation is applied, which is mean by default.sum
, mean
, count
etc.np.sum()
, min()
, sum()
etc.# In[7]
titanic.pivot_table(index='sex',columns='class',
aggfunc={'survived':sum,'fare':'mean'})
# Out[7]
fare survived
class First Second Third First Second Third
sex
female 106.125798 21.970121 16.118810 91 70 72
male 67.226127 19.741782 12.661633 45 17 47
We've omitted the values
keyword here; when specifying a mapping for aggfunc
, this is determined automatically.
At times it's useful to compute totals along each grouping. This can be done via the margins
keyword.
# In[8]
titanic.pivot_table('survived',index='sex',columns='class',margins=True)
# Out[8]
class First Second Third All
sex
female 0.968085 0.921053 0.500000 0.742038
male 0.368852 0.157407 0.135447 0.188908
All 0.629630 0.472826 0.242363 0.383838
margins_name
keyword; it defaults to "All"