Hierarchical Indexing

노정훈·2023년 7월 23일
0

Pandas

목록 보기
5/12

A Multiply Indexed Series

Bad Way

# In[1]
index=[('California',2010),('California',2020),('New York',2010),('New York',2020),('Texas',2010),('Texas',2020)]
populations=[37253956,39538223,19378102,20201249,25145561,29145505]
pop=pd.Series(populations,index=index)
pop
# Out[1]
(California, 2010)    37253956
(California, 2020)    39538223
(New York, 2010)      19378102
(New York, 2020)      20201249
(Texas, 2010)         25145561
(Texas, 2020)         29145505
dtype: int64
  • With this indexing scheme, you can straightforwardly index or slice the series based on this tuple index
# In[2]
pop[('California',2020):('Texas',2010)]
# Out[2]
(California, 2020)    39538223
(New York, 2010)      19378102
(New York, 2020)      20201249
(Texas, 2010)         25145561
dtype: int64
  • But the convenience ends there. If you do something, you'll need to do some messy munging to make it happen.
  • It will be not as clean as the slicing syntax we've learned in Pandas.

Better Way: Pandas MultiIndex

  • We can create a multi-index from the tuples.
# In[3]
index=pd.MultiIndex.from_tuples(index)
  • The MultiIndex represents multiple levels of indexing as well as multiple labels for each data point which encode these levels.

  • If we reindex our series with MultiIndex, we see the hierarchical representation of the data.

# In[4]
pop=pop.reindex(index)
pop
# Out[4]
California  2010    37253956
            2020    39538223
New York    2010    19378102
            2020    20201249
Texas       2010    25145561
            2020    29145505
dtype: int64
  • Some entries are missing in the first column: in this multi-index representation, any blank entry indicates the same value as the line above it.

  • We can also use the Pandas slicing

# In[5]
pop[:,2020]
# Out[5]
California    39538223
New York      20201249
Texas         29145505
dtype: int64

MultiIndex as Extra Dimension

  • unstack method will quickly convert a multiply indexed Series into a conventionally indexed DataFrame.
# In[6]
pop_df=pop.unstack()
pop_df
# Out[6]
                2010	    2020
California	37253956	39538223
New York	19378102	20201249
Texas	    25145561	29145505
  • The stack method provides the opposite opperation
# In[7]
pop_df.stack()
# Out[7]
California  2010    37253956
            2020    39538223
New York    2010    19378102
            2020    20201249
Texas       2010    25145561
            2020    29145505
dtype: int64
  • We were able to use multi-indexing to manipulate two-dimensional data within a one-dimensional Series, we can also use it to manipulate data of three or more dimensions in a Series or DataFrame.
  • Each extra level in a multi-index represents an extra dimension of data.
  • We might want to add another column; with a MultiIndex this is as easy as adding another column to the DataFrame.
# In[8]
pop_df=pd.DataFrame({'total':pop,'under18':[9284094,8898092,4318033,4181528,6879014,7432474]})
pop_df
# Out[8]
                       total	under18
California	2010	37253956	9284094
            2020	39538223	8898092
New York	2010	19378102	4318033
            2020	20201249	4181528
Texas	    2010	25145561	6879014
            2020	29145505	7432474
  • In addition, all the ufuncs and other functionality work with hierarchical indices as well.
# In[9]
f_u18=pop_df['under18']/pop_df['total']
f_u18.unstack()
# Out[9]
                2010	    2020
California	0.249211	0.225050
New York	0.222831	0.206994
Texas	    0.273568	0.255013

If you want some more information about stack and unstack methods, reference these urls :
1. Pandas stack/unstack
2. Data reconstruction using stack/unstack

Methods of MultiIndex Creation

  • The most straightforward way to construct a multiply indexed Series and DataFrame is to simply pass a list of two or more index arrays to the constructor.
# In[10]
df=pd.DataFrame(np.random.rand(4,2),index=[['a','a','b','b'],[1,2,1,2]],columns=['data1','data2'])
df
# Out[10]
           data1	   data2
a	1	0.627660	0.158404
    2	0.181580	0.043981
b	1	0.297599	0.338398
    2	0.592384	0.886842
  • If you pass a dictionary with appropriate tuples as keys, Pandas will automatically recognize this and use a MultiIndex by default.
# In[11]
data={('California',2010):37253956,('California',2020):39538223,('New York',2010):19378102,('New York',2020):20201249,('Texas',2010):25145561,('Texas',2020):29145505}
pd.Series(data)
# Out[11]
California  2010    37253956
            2020    39538223
New York    2010    19378102
            2020    20201249
Texas       2010    25145561
            2020    29145505
dtype: int64

Explicit MultiIndex Constructors

  • For more flexibility in how the index is constructed, you can instead use the constructor methods available in the pd.MultiIndex class.
  • You can construct a MultiIndex from a simple list of arrays giving the index values within each level.
# In[12]
pd.MultiIndex.from_arrays([['a','a','b','b'],[1,2,1,2]])
# Out[12]
MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )
  • You can construct it from a list of tuples giving the multiple index values of each point.
# In[13]
pd.MultiIndex.from_tuples([('a',1),('a',2),('b',1),('b',2)])
# Out[13]
MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )
  • You can even construct it from a Cartesian product (데카르트 곱) of single indices.
# In[14]
pd.MultiIndex.from_product([['a','b'],[1,2]])
# Out[14]
MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )
  • Similarly, you can construct a MultiIndex directly using its internal encoding by passing levels (a list of lists containing available index values for each level) and codes (a list of lists that reference these lables)
# In[15]
pd.MultiIndex(levels=[['a','b'],[1,2]],codes=[[0,0,1,1],[0,1,0,1]])
# Out[15]
MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

MultiIndex Level Names

  • Sometimes it is convenient to name the levels of the MultiIndex
  • This can be accomplished by passing the names argument to any of the previously discussed MultiIndex constructors, or by setting the names attribute of the index.
# In[16]
pop.index.names=['state','year']
pop
# Out[16]
state       year
California  2010    37253956
            2020    39538223
New York    2010    19378102
            2020    20201249
Texas       2010    25145561
            2020    29145505
dtype: int64

MultiIndex for Columns

# In[17]
# hierarchical indices and columns
index=pd.MultiIndex.from_product([[2013,2014],[1,2]],names=['year','visit'])
columns=pd.MultiIndex.from_product([['Bob','Guido','Sue'],['HR','Temp']],names=['subject','type'])

# mock some data
data=np.round(np.random.randn(4,6),1)
data[:, ::2]*=10
data+=37

# create the DataFrame
health_data=pd.DataFrame(data,index=index,columns=columns)
health_data
# Out[17]
subject	         Bob	       Guido	         Sue
type	          HR	Temp	  HR	Temp	  HR	Temp
year	visit						
2013	1	    41.0	37.2	36.0	36.0	43.0	37.9
        2	    50.0	35.8	29.0	35.8	41.0	37.8
2014	1	    34.0	36.5	34.0	37.2	58.0	38.1
        2	    40.0	37.0	43.0	36.0	21.0	38.8
  • This is fundamentally four-dimensional data.
  • We can index the top-level column by the person's name and get a full DataFrame containing just that person's information.
# In[18]
health_data['Guido']
# Out[18]
type	          HR	Temp
year	visit		
2013	1	    36.0	36.0
        2	    29.0	35.8
2014	1	    34.0	37.2
        2	    43.0	36.0

Indexing and Slicing a MultiIndex

Multiply Indexed Series

# In[19]
pop
# Out[19]
state       year
California  2010    37253956
            2020    39538223
New York    2010    19378102
            2020    20201249
Texas       2010    25145561
            2020    29145505
dtype: int64
  • We can access single elements by indexing with multiple terms.
# In[20]
pop['California',2010]
# Out[20]
37253956
  • The MultiIndex also supports partial indexing, or indexing just one of the levels in the index.
# In[21]
pop['California']
# Out[21]
year
2010    37253956
2020    39538223
dtype: int64
  • Partial slicing is available as well, as long as the MultiIndex is sorted.
# In[22]
pop.loc['California':'New York']
# Out[22]
state       year
California  2010    37253956
            2020    39538223
New York    2010    19378102
            2020    20201249
dtype: int64
  • With sorted indices, partial indexing can be performed on lower levels by passing an empty slice in the first index
# In[23]
pop[:,2010]
# Out[23]
state
California    37253956
New York      19378102
Texas         25145561
dtype: int64
  • Other types of indexing and selection work as well
# In[24]
pop[pop>22000000]
# Out[24]
state       year
California  2010    37253956
            2020    39538223
Texas       2010    25145561
            2020    29145505
dtype: int64

# In[25]
pop[['California','Texas']]
# Out[25]
state       year
California  2010    37253956
            2020    39538223
Texas       2010    25145561
            2020    29145505
dtype: int64

Multiply Indexed DataFrames

# In[26]
health_data
# Out[26]
subject	         Bob	       Guido	         Sue
type	          HR	Temp	  HR	Temp	  HR	Temp
year	visit						
2013	1	    41.0	37.2	36.0	36.0	43.0	37.9
        2	    50.0	35.8	29.0	35.8	41.0	37.8
2014	1	    34.0	36.5	34.0	37.2	58.0	38.1
        2	    40.0	37.0	43.0	36.0	21.0	38.8
  • The syntax used for multiply indexed Series applies to the columns.
# In[27]
health_data['Guido','HR']
# Out[27]
year  visit
2013  1        36.0
      2        29.0
2014  1        34.0
      2        43.0
Name: (Guido, HR), dtype: float64
  • Also, as with the single-index case, we can use the loc, iloc, and ix indexers.
# In[28]
health_data.iloc[:2,:2]
# Out[28]
subject	         Bob
type	          HR	Temp
year	visit		
2013	1	    41.0	37.2
        2	    50.0	35.8
  • These indexers provide an array-like view of the underlying two-dimensional data, but each individual index in loc and iloc can be passed a tuple of multiple indices.
# In[29]
health_data.loc[:,('Bob','HR')]
# Out[29]
year  visit
2013  1        41.0
      2        50.0
2014  1        34.0
      2        40.0
Name: (Bob, HR), dtype: float64
  • Working with slices within these index tuples is not convenient.
  • If you trying to create a slice within a tuple, it will lead to a syntax error.
  • You could get around this by building the desired slice explicitly using Python's built-in slice function, but a better way in this context is to use an IndexSlice object, which Pandas provides for precisely this situation.
# In[30]
idx=pd.IndexSlice
health_data.loc[idx[:,1],idx[:,'HR']]
# Out[30]
subject	         Bob	Guido	 Sue
type	          HR	  HR	  HR
year	visit			
2013	1	    41.0	36.0	43.0
2014	1	    34.0	34.0	58.0

Rearranging Multi-Indexes

Sorted and Unsorted Indices

  • Many of the MultiIndex slicing operations will fail if the index is not sorted.
# In[31]
index=pd.MultiIndex.from_product([['a','c','b'],[1,2]])
data=pd.Series(np.random.rand(6),index=index)
data.index.names=['char','int']
data
# Out[31]
char  int
a     1      0.601307
      2      0.623240
c     1      0.194030
      2      0.969886
b     1      0.931100
      2      0.700467
dtype: float64
  • You can't take a partial silce of this index.
  • For various reasons, partial slices and other similar operations require the levels in the MultiIndex to be in sorted order.
  • Pandas provides sort_index and sortlevel methods of the DataFrame.
# In[32]
data=data.sort_index()
data
# Out[32]
char  int
a     1      0.601307
      2      0.623240
b     1      0.931100
      2      0.700467
c     1      0.194030
      2      0.969886
dtype: float64
  • With the index sorted in this way, partial slicing will work as expected.
# In[33]
data['a':'b']
# Out[33]
char  int
a     1      0.689178
      2      0.016826
b     1      0.230445
      2      0.842501
dtype: float64

Stacking and Unstacking Indices

  • It is possible to convert a dataset from a stacked multi-index to a simple two-dimensional representation, optionally specifying the level to use.
  • If you use level option, the specified index in the level parameter, comes up as a column.
# In[34]
pop.unstack()
# Out[34]
year	        2010	    2020
state		
California	37253956	39538223
New York	19378102	20201249
Texas	    25145561	29145505

# In[35]
pop.unstack(level=0) # pop.unstack(level='state')
# Out[35]
state	California	New York	Texas
year			
2010	37253956	19378102	25145561
2020	39538223	20201249	29145505

# In[36]
pop.unstack(level=1) # pop.unstack(level='year')
# Out[36]
year	        2010	    2020
state		
California	37253956	39538223
New York	19378102	20201249
Texas	    25145561	29145505
  • The opposite of unstack is stack, which can be used to recover the original series
# In[37]
pop.unstack().stack()
# Out[37]
state       year
California  2010    37253956
            2020    39538223
New York    2010    19378102
            2020    20201249
Texas       2010    25145561
            2020    29145505
dtype: int64

Index Setting and Resetting

  • Using reset_index method
# In[38]
pop_flat=pop.reset_index(name='population')
pop_flat
# Out[38]
         state	year	population
0	California	2010	37253956
1	California	2020	39538223
2	New York	2010	19378102
3	New York	2020	20201249
4	Texas	    2010	25145561
5	Texas	    2020	29145505
  • A common pattern is to build a MultiIndex from the column values.
  • This can be done with the set_index method of the DataFrame, which returns a multiply indexed DataFrame
# In[39]
pop_flat.set_index(['state','year'])
# Out[39]
                  population
state	    year	
California	2010	37253956
            2020	39538223
New York	2010	19378102
            2020	20201249
Texas	    2010	25145561
            2020	29145505
profile
노정훈

2개의 댓글

comment-user-thumbnail
2023년 7월 23일

이런 유용한 정보를 나눠주셔서 감사합니다.

답글 달기
comment-user-thumbnail
2025년 6월 30일

Hierarchical indexing, also known as multi-level indexing, is a powerful feature in data analysis tools like pandas. It allows data to be stored and accessed in multiple dimensions, making complex data manipulation more efficient. With hierarchical indexing, analysts can easily group, filter, and reshape large datasets for deeper insights. Whether working with time-series data or nested categories, this approach simplifies operations. For those looking to master data handling techniques, understanding hierarchical indexing is crucial. Want to learn more? Check out our 블로그 공감 for in-depth examples!

답글 달기