# In[1]
index=[('California',2010),('California',2020),('New York',2010),('New York',2020),('Texas',2010),('Texas',2020)]
populations=[37253956,39538223,19378102,20201249,25145561,29145505]
pop=pd.Series(populations,index=index)
pop
# Out[1]
(California, 2010) 37253956
(California, 2020) 39538223
(New York, 2010) 19378102
(New York, 2020) 20201249
(Texas, 2010) 25145561
(Texas, 2020) 29145505
dtype: int64
# In[2]
pop[('California',2020):('Texas',2010)]
# Out[2]
(California, 2020) 39538223
(New York, 2010) 19378102
(New York, 2020) 20201249
(Texas, 2010) 25145561
dtype: int64
# In[3]
index=pd.MultiIndex.from_tuples(index)
The MultiIndex represents multiple levels of indexing as well as multiple labels for each data point which encode these levels.
If we reindex our series with MultiIndex, we see the hierarchical representation of the data.
# In[4]
pop=pop.reindex(index)
pop
# Out[4]
California 2010 37253956
2020 39538223
New York 2010 19378102
2020 20201249
Texas 2010 25145561
2020 29145505
dtype: int64
Some entries are missing in the first column: in this multi-index representation, any blank entry indicates the same value as the line above it.
We can also use the Pandas slicing
# In[5]
pop[:,2020]
# Out[5]
California 39538223
New York 20201249
Texas 29145505
dtype: int64
unstack
method will quickly convert a multiply indexed Series into a conventionally indexed DataFrame.# In[6]
pop_df=pop.unstack()
pop_df
# Out[6]
2010 2020
California 37253956 39538223
New York 19378102 20201249
Texas 25145561 29145505
stack
method provides the opposite opperation# In[7]
pop_df.stack()
# Out[7]
California 2010 37253956
2020 39538223
New York 2010 19378102
2020 20201249
Texas 2010 25145561
2020 29145505
dtype: int64
# In[8]
pop_df=pd.DataFrame({'total':pop,'under18':[9284094,8898092,4318033,4181528,6879014,7432474]})
pop_df
# Out[8]
total under18
California 2010 37253956 9284094
2020 39538223 8898092
New York 2010 19378102 4318033
2020 20201249 4181528
Texas 2010 25145561 6879014
2020 29145505 7432474
# In[9]
f_u18=pop_df['under18']/pop_df['total']
f_u18.unstack()
# Out[9]
2010 2020
California 0.249211 0.225050
New York 0.222831 0.206994
Texas 0.273568 0.255013
If you want some more information about stack
and unstack
methods, reference these urls :
1. Pandas stack/unstack
2. Data reconstruction using stack/unstack
# In[10]
df=pd.DataFrame(np.random.rand(4,2),index=[['a','a','b','b'],[1,2,1,2]],columns=['data1','data2'])
df
# Out[10]
data1 data2
a 1 0.627660 0.158404
2 0.181580 0.043981
b 1 0.297599 0.338398
2 0.592384 0.886842
# In[11]
data={('California',2010):37253956,('California',2020):39538223,('New York',2010):19378102,('New York',2020):20201249,('Texas',2010):25145561,('Texas',2020):29145505}
pd.Series(data)
# Out[11]
California 2010 37253956
2020 39538223
New York 2010 19378102
2020 20201249
Texas 2010 25145561
2020 29145505
dtype: int64
pd.MultiIndex
class.# In[12]
pd.MultiIndex.from_arrays([['a','a','b','b'],[1,2,1,2]])
# Out[12]
MultiIndex([('a', 1),
('a', 2),
('b', 1),
('b', 2)],
)
# In[13]
pd.MultiIndex.from_tuples([('a',1),('a',2),('b',1),('b',2)])
# Out[13]
MultiIndex([('a', 1),
('a', 2),
('b', 1),
('b', 2)],
)
# In[14]
pd.MultiIndex.from_product([['a','b'],[1,2]])
# Out[14]
MultiIndex([('a', 1),
('a', 2),
('b', 1),
('b', 2)],
)
levels
(a list of lists containing available index values for each level) and codes
(a list of lists that reference these lables)# In[15]
pd.MultiIndex(levels=[['a','b'],[1,2]],codes=[[0,0,1,1],[0,1,0,1]])
# Out[15]
MultiIndex([('a', 1),
('a', 2),
('b', 1),
('b', 2)],
)
names
argument to any of the previously discussed MultiIndex constructors, or by setting the names
attribute of the index.# In[16]
pop.index.names=['state','year']
pop
# Out[16]
state year
California 2010 37253956
2020 39538223
New York 2010 19378102
2020 20201249
Texas 2010 25145561
2020 29145505
dtype: int64
# In[17]
# hierarchical indices and columns
index=pd.MultiIndex.from_product([[2013,2014],[1,2]],names=['year','visit'])
columns=pd.MultiIndex.from_product([['Bob','Guido','Sue'],['HR','Temp']],names=['subject','type'])
# mock some data
data=np.round(np.random.randn(4,6),1)
data[:, ::2]*=10
data+=37
# create the DataFrame
health_data=pd.DataFrame(data,index=index,columns=columns)
health_data
# Out[17]
subject Bob Guido Sue
type HR Temp HR Temp HR Temp
year visit
2013 1 41.0 37.2 36.0 36.0 43.0 37.9
2 50.0 35.8 29.0 35.8 41.0 37.8
2014 1 34.0 36.5 34.0 37.2 58.0 38.1
2 40.0 37.0 43.0 36.0 21.0 38.8
# In[18]
health_data['Guido']
# Out[18]
type HR Temp
year visit
2013 1 36.0 36.0
2 29.0 35.8
2014 1 34.0 37.2
2 43.0 36.0
# In[19]
pop
# Out[19]
state year
California 2010 37253956
2020 39538223
New York 2010 19378102
2020 20201249
Texas 2010 25145561
2020 29145505
dtype: int64
# In[20]
pop['California',2010]
# Out[20]
37253956
# In[21]
pop['California']
# Out[21]
year
2010 37253956
2020 39538223
dtype: int64
# In[22]
pop.loc['California':'New York']
# Out[22]
state year
California 2010 37253956
2020 39538223
New York 2010 19378102
2020 20201249
dtype: int64
# In[23]
pop[:,2010]
# Out[23]
state
California 37253956
New York 19378102
Texas 25145561
dtype: int64
# In[24]
pop[pop>22000000]
# Out[24]
state year
California 2010 37253956
2020 39538223
Texas 2010 25145561
2020 29145505
dtype: int64
# In[25]
pop[['California','Texas']]
# Out[25]
state year
California 2010 37253956
2020 39538223
Texas 2010 25145561
2020 29145505
dtype: int64
# In[26]
health_data
# Out[26]
subject Bob Guido Sue
type HR Temp HR Temp HR Temp
year visit
2013 1 41.0 37.2 36.0 36.0 43.0 37.9
2 50.0 35.8 29.0 35.8 41.0 37.8
2014 1 34.0 36.5 34.0 37.2 58.0 38.1
2 40.0 37.0 43.0 36.0 21.0 38.8
# In[27]
health_data['Guido','HR']
# Out[27]
year visit
2013 1 36.0
2 29.0
2014 1 34.0
2 43.0
Name: (Guido, HR), dtype: float64
loc
, iloc
, and ix
indexers.# In[28]
health_data.iloc[:2,:2]
# Out[28]
subject Bob
type HR Temp
year visit
2013 1 41.0 37.2
2 50.0 35.8
loc
and iloc
can be passed a tuple of multiple indices.# In[29]
health_data.loc[:,('Bob','HR')]
# Out[29]
year visit
2013 1 41.0
2 50.0
2014 1 34.0
2 40.0
Name: (Bob, HR), dtype: float64
slice
function, but a better way in this context is to use an IndexSlice
object, which Pandas provides for precisely this situation.# In[30]
idx=pd.IndexSlice
health_data.loc[idx[:,1],idx[:,'HR']]
# Out[30]
subject Bob Guido Sue
type HR HR HR
year visit
2013 1 41.0 36.0 43.0
2014 1 34.0 34.0 58.0
# In[31]
index=pd.MultiIndex.from_product([['a','c','b'],[1,2]])
data=pd.Series(np.random.rand(6),index=index)
data.index.names=['char','int']
data
# Out[31]
char int
a 1 0.601307
2 0.623240
c 1 0.194030
2 0.969886
b 1 0.931100
2 0.700467
dtype: float64
sort_index
and sortlevel
methods of the DataFrame.# In[32]
data=data.sort_index()
data
# Out[32]
char int
a 1 0.601307
2 0.623240
b 1 0.931100
2 0.700467
c 1 0.194030
2 0.969886
dtype: float64
# In[33]
data['a':'b']
# Out[33]
char int
a 1 0.689178
2 0.016826
b 1 0.230445
2 0.842501
dtype: float64
level
option, the specified index in the level
parameter, comes up as a column.# In[34]
pop.unstack()
# Out[34]
year 2010 2020
state
California 37253956 39538223
New York 19378102 20201249
Texas 25145561 29145505
# In[35]
pop.unstack(level=0) # pop.unstack(level='state')
# Out[35]
state California New York Texas
year
2010 37253956 19378102 25145561
2020 39538223 20201249 29145505
# In[36]
pop.unstack(level=1) # pop.unstack(level='year')
# Out[36]
year 2010 2020
state
California 37253956 39538223
New York 19378102 20201249
Texas 25145561 29145505
unstack
is stack
, which can be used to recover the original series# In[37]
pop.unstack().stack()
# Out[37]
state year
California 2010 37253956
2020 39538223
New York 2010 19378102
2020 20201249
Texas 2010 25145561
2020 29145505
dtype: int64
reset_index
method# In[38]
pop_flat=pop.reset_index(name='population')
pop_flat
# Out[38]
state year population
0 California 2010 37253956
1 California 2020 39538223
2 New York 2010 19378102
3 New York 2020 20201249
4 Texas 2010 25145561
5 Texas 2020 29145505
set_index
method of the DataFrame, which returns a multiply indexed DataFrame# In[39]
pop_flat.set_index(['state','year'])
# Out[39]
population
state year
California 2010 37253956
2020 39538223
New York 2010 19378102
2020 20201249
Texas 2010 25145561
2020 29145505
이런 유용한 정보를 나눠주셔서 감사합니다.