Hierarchical Indexing

노정훈·2023년 7월 23일
0

Pandas

목록 보기
5/12

A Multiply Indexed Series

Bad Way

# In[1]
index=[('California',2010),('California',2020),('New York',2010),('New York',2020),('Texas',2010),('Texas',2020)]
populations=[37253956,39538223,19378102,20201249,25145561,29145505]
pop=pd.Series(populations,index=index)
pop
# Out[1]
(California, 2010)    37253956
(California, 2020)    39538223
(New York, 2010)      19378102
(New York, 2020)      20201249
(Texas, 2010)         25145561
(Texas, 2020)         29145505
dtype: int64
  • With this indexing scheme, you can straightforwardly index or slice the series based on this tuple index
# In[2]
pop[('California',2020):('Texas',2010)]
# Out[2]
(California, 2020)    39538223
(New York, 2010)      19378102
(New York, 2020)      20201249
(Texas, 2010)         25145561
dtype: int64
  • But the convenience ends there. If you do something, you'll need to do some messy munging to make it happen.
  • It will be not as clean as the slicing syntax we've learned in Pandas.

Better Way: Pandas MultiIndex

  • We can create a multi-index from the tuples.
# In[3]
index=pd.MultiIndex.from_tuples(index)
  • The MultiIndex represents multiple levels of indexing as well as multiple labels for each data point which encode these levels.

  • If we reindex our series with MultiIndex, we see the hierarchical representation of the data.

# In[4]
pop=pop.reindex(index)
pop
# Out[4]
California  2010    37253956
            2020    39538223
New York    2010    19378102
            2020    20201249
Texas       2010    25145561
            2020    29145505
dtype: int64
  • Some entries are missing in the first column: in this multi-index representation, any blank entry indicates the same value as the line above it.

  • We can also use the Pandas slicing

# In[5]
pop[:,2020]
# Out[5]
California    39538223
New York      20201249
Texas         29145505
dtype: int64

MultiIndex as Extra Dimension

  • unstack method will quickly convert a multiply indexed Series into a conventionally indexed DataFrame.
# In[6]
pop_df=pop.unstack()
pop_df
# Out[6]
                2010	    2020
California	37253956	39538223
New York	19378102	20201249
Texas	    25145561	29145505
  • The stack method provides the opposite opperation
# In[7]
pop_df.stack()
# Out[7]
California  2010    37253956
            2020    39538223
New York    2010    19378102
            2020    20201249
Texas       2010    25145561
            2020    29145505
dtype: int64
  • We were able to use multi-indexing to manipulate two-dimensional data within a one-dimensional Series, we can also use it to manipulate data of three or more dimensions in a Series or DataFrame.
  • Each extra level in a multi-index represents an extra dimension of data.
  • We might want to add another column; with a MultiIndex this is as easy as adding another column to the DataFrame.
# In[8]
pop_df=pd.DataFrame({'total':pop,'under18':[9284094,8898092,4318033,4181528,6879014,7432474]})
pop_df
# Out[8]
                       total	under18
California	2010	37253956	9284094
            2020	39538223	8898092
New York	2010	19378102	4318033
            2020	20201249	4181528
Texas	    2010	25145561	6879014
            2020	29145505	7432474
  • In addition, all the ufuncs and other functionality work with hierarchical indices as well.
# In[9]
f_u18=pop_df['under18']/pop_df['total']
f_u18.unstack()
# Out[9]
                2010	    2020
California	0.249211	0.225050
New York	0.222831	0.206994
Texas	    0.273568	0.255013

If you want some more information about stack and unstack methods, reference these urls :
1. Pandas stack/unstack
2. Data reconstruction using stack/unstack

Methods of MultiIndex Creation

  • The most straightforward way to construct a multiply indexed Series and DataFrame is to simply pass a list of two or more index arrays to the constructor.
# In[10]
df=pd.DataFrame(np.random.rand(4,2),index=[['a','a','b','b'],[1,2,1,2]],columns=['data1','data2'])
df
# Out[10]
           data1	   data2
a	1	0.627660	0.158404
    2	0.181580	0.043981
b	1	0.297599	0.338398
    2	0.592384	0.886842
  • If you pass a dictionary with appropriate tuples as keys, Pandas will automatically recognize this and use a MultiIndex by default.
# In[11]
data={('California',2010):37253956,('California',2020):39538223,('New York',2010):19378102,('New York',2020):20201249,('Texas',2010):25145561,('Texas',2020):29145505}
pd.Series(data)
# Out[11]
California  2010    37253956
            2020    39538223
New York    2010    19378102
            2020    20201249
Texas       2010    25145561
            2020    29145505
dtype: int64

Explicit MultiIndex Constructors

  • For more flexibility in how the index is constructed, you can instead use the constructor methods available in the pd.MultiIndex class.
  • You can construct a MultiIndex from a simple list of arrays giving the index values within each level.
# In[12]
pd.MultiIndex.from_arrays([['a','a','b','b'],[1,2,1,2]])
# Out[12]
MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )
  • You can construct it from a list of tuples giving the multiple index values of each point.
# In[13]
pd.MultiIndex.from_tuples([('a',1),('a',2),('b',1),('b',2)])
# Out[13]
MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )
  • You can even construct it from a Cartesian product (데카르트 곱) of single indices.
# In[14]
pd.MultiIndex.from_product([['a','b'],[1,2]])
# Out[14]
MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )
  • Similarly, you can construct a MultiIndex directly using its internal encoding by passing levels (a list of lists containing available index values for each level) and codes (a list of lists that reference these lables)
# In[15]
pd.MultiIndex(levels=[['a','b'],[1,2]],codes=[[0,0,1,1],[0,1,0,1]])
# Out[15]
MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

MultiIndex Level Names

  • Sometimes it is convenient to name the levels of the MultiIndex
  • This can be accomplished by passing the names argument to any of the previously discussed MultiIndex constructors, or by setting the names attribute of the index.
# In[16]
pop.index.names=['state','year']
pop
# Out[16]
state       year
California  2010    37253956
            2020    39538223
New York    2010    19378102
            2020    20201249
Texas       2010    25145561
            2020    29145505
dtype: int64

MultiIndex for Columns

# In[17]
# hierarchical indices and columns
index=pd.MultiIndex.from_product([[2013,2014],[1,2]],names=['year','visit'])
columns=pd.MultiIndex.from_product([['Bob','Guido','Sue'],['HR','Temp']],names=['subject','type'])

# mock some data
data=np.round(np.random.randn(4,6),1)
data[:, ::2]*=10
data+=37

# create the DataFrame
health_data=pd.DataFrame(data,index=index,columns=columns)
health_data
# Out[17]
subject	         Bob	       Guido	         Sue
type	          HR	Temp	  HR	Temp	  HR	Temp
year	visit						
2013	1	    41.0	37.2	36.0	36.0	43.0	37.9
        2	    50.0	35.8	29.0	35.8	41.0	37.8
2014	1	    34.0	36.5	34.0	37.2	58.0	38.1
        2	    40.0	37.0	43.0	36.0	21.0	38.8
  • This is fundamentally four-dimensional data.
  • We can index the top-level column by the person's name and get a full DataFrame containing just that person's information.
# In[18]
health_data['Guido']
# Out[18]
type	          HR	Temp
year	visit		
2013	1	    36.0	36.0
        2	    29.0	35.8
2014	1	    34.0	37.2
        2	    43.0	36.0

Indexing and Slicing a MultiIndex

Multiply Indexed Series

# In[19]
pop
# Out[19]
state       year
California  2010    37253956
            2020    39538223
New York    2010    19378102
            2020    20201249
Texas       2010    25145561
            2020    29145505
dtype: int64
  • We can access single elements by indexing with multiple terms.
# In[20]
pop['California',2010]
# Out[20]
37253956
  • The MultiIndex also supports partial indexing, or indexing just one of the levels in the index.
# In[21]
pop['California']
# Out[21]
year
2010    37253956
2020    39538223
dtype: int64
  • Partial slicing is available as well, as long as the MultiIndex is sorted.
# In[22]
pop.loc['California':'New York']
# Out[22]
state       year
California  2010    37253956
            2020    39538223
New York    2010    19378102
            2020    20201249
dtype: int64
  • With sorted indices, partial indexing can be performed on lower levels by passing an empty slice in the first index
# In[23]
pop[:,2010]
# Out[23]
state
California    37253956
New York      19378102
Texas         25145561
dtype: int64
  • Other types of indexing and selection work as well
# In[24]
pop[pop>22000000]
# Out[24]
state       year
California  2010    37253956
            2020    39538223
Texas       2010    25145561
            2020    29145505
dtype: int64

# In[25]
pop[['California','Texas']]
# Out[25]
state       year
California  2010    37253956
            2020    39538223
Texas       2010    25145561
            2020    29145505
dtype: int64

Multiply Indexed DataFrames

# In[26]
health_data
# Out[26]
subject	         Bob	       Guido	         Sue
type	          HR	Temp	  HR	Temp	  HR	Temp
year	visit						
2013	1	    41.0	37.2	36.0	36.0	43.0	37.9
        2	    50.0	35.8	29.0	35.8	41.0	37.8
2014	1	    34.0	36.5	34.0	37.2	58.0	38.1
        2	    40.0	37.0	43.0	36.0	21.0	38.8
  • The syntax used for multiply indexed Series applies to the columns.
# In[27]
health_data['Guido','HR']
# Out[27]
year  visit
2013  1        36.0
      2        29.0
2014  1        34.0
      2        43.0
Name: (Guido, HR), dtype: float64
  • Also, as with the single-index case, we can use the loc, iloc, and ix indexers.
# In[28]
health_data.iloc[:2,:2]
# Out[28]
subject	         Bob
type	          HR	Temp
year	visit		
2013	1	    41.0	37.2
        2	    50.0	35.8
  • These indexers provide an array-like view of the underlying two-dimensional data, but each individual index in loc and iloc can be passed a tuple of multiple indices.
# In[29]
health_data.loc[:,('Bob','HR')]
# Out[29]
year  visit
2013  1        41.0
      2        50.0
2014  1        34.0
      2        40.0
Name: (Bob, HR), dtype: float64
  • Working with slices within these index tuples is not convenient.
  • If you trying to create a slice within a tuple, it will lead to a syntax error.
  • You could get around this by building the desired slice explicitly using Python's built-in slice function, but a better way in this context is to use an IndexSlice object, which Pandas provides for precisely this situation.
# In[30]
idx=pd.IndexSlice
health_data.loc[idx[:,1],idx[:,'HR']]
# Out[30]
subject	         Bob	Guido	 Sue
type	          HR	  HR	  HR
year	visit			
2013	1	    41.0	36.0	43.0
2014	1	    34.0	34.0	58.0

Rearranging Multi-Indexes

Sorted and Unsorted Indices

  • Many of the MultiIndex slicing operations will fail if the index is not sorted.
# In[31]
index=pd.MultiIndex.from_product([['a','c','b'],[1,2]])
data=pd.Series(np.random.rand(6),index=index)
data.index.names=['char','int']
data
# Out[31]
char  int
a     1      0.601307
      2      0.623240
c     1      0.194030
      2      0.969886
b     1      0.931100
      2      0.700467
dtype: float64
  • You can't take a partial silce of this index.
  • For various reasons, partial slices and other similar operations require the levels in the MultiIndex to be in sorted order.
  • Pandas provides sort_index and sortlevel methods of the DataFrame.
# In[32]
data=data.sort_index()
data
# Out[32]
char  int
a     1      0.601307
      2      0.623240
b     1      0.931100
      2      0.700467
c     1      0.194030
      2      0.969886
dtype: float64
  • With the index sorted in this way, partial slicing will work as expected.
# In[33]
data['a':'b']
# Out[33]
char  int
a     1      0.689178
      2      0.016826
b     1      0.230445
      2      0.842501
dtype: float64

Stacking and Unstacking Indices

  • It is possible to convert a dataset from a stacked multi-index to a simple two-dimensional representation, optionally specifying the level to use.
  • If you use level option, the specified index in the level parameter, comes up as a column.
# In[34]
pop.unstack()
# Out[34]
year	        2010	    2020
state		
California	37253956	39538223
New York	19378102	20201249
Texas	    25145561	29145505

# In[35]
pop.unstack(level=0) # pop.unstack(level='state')
# Out[35]
state	California	New York	Texas
year			
2010	37253956	19378102	25145561
2020	39538223	20201249	29145505

# In[36]
pop.unstack(level=1) # pop.unstack(level='year')
# Out[36]
year	        2010	    2020
state		
California	37253956	39538223
New York	19378102	20201249
Texas	    25145561	29145505
  • The opposite of unstack is stack, which can be used to recover the original series
# In[37]
pop.unstack().stack()
# Out[37]
state       year
California  2010    37253956
            2020    39538223
New York    2010    19378102
            2020    20201249
Texas       2010    25145561
            2020    29145505
dtype: int64

Index Setting and Resetting

  • Using reset_index method
# In[38]
pop_flat=pop.reset_index(name='population')
pop_flat
# Out[38]
         state	year	population
0	California	2010	37253956
1	California	2020	39538223
2	New York	2010	19378102
3	New York	2020	20201249
4	Texas	    2010	25145561
5	Texas	    2020	29145505
  • A common pattern is to build a MultiIndex from the column values.
  • This can be done with the set_index method of the DataFrame, which returns a multiply indexed DataFrame
# In[39]
pop_flat.set_index(['state','year'])
# Out[39]
                  population
state	    year	
California	2010	37253956
            2020	39538223
New York	2010	19378102
            2020	20201249
Texas	    2010	25145561
            2020	29145505
profile
노정훈

1개의 댓글

comment-user-thumbnail
2023년 7월 23일

이런 유용한 정보를 나눠주셔서 감사합니다.

답글 달기