Data Indexing and Selection

노정훈·2023년 7월 19일
0

Pandas

목록 보기
2/12

Data Selection in Series

  • As we saw, a Series object acts in many ways like a one-dimensional Numpy array, and in many ways like a standard Python dictionary

Series as Dictionary

  • Like a dictionary, the Series object provides a mapping from a collection of keys to a collection of values.
# In[1]
data=pd.Series([0.25,0.5,0.75,1.0],index=['a','b','c','d'])
data
# Out[1]
a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64
  • We can also use dictionary-like Python expression and methods to examine the keys/indices and values.
# In[2]
'a' in data
# Out[2]
True

# In[3]
data.keys()
# Out[3]
Index(['a', 'b', 'c', 'd'], dtype='object')

# In[4]
list(data.items())
# Out[4]
[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]
  • Series object can also be modified with a dictionary-like syntax
#In[5]
data['e']=1.25
# Out[5]
a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

Series as one-dimensional array

# In[6]
# slicing by explicit index
data['a':'c']
# Out[6]
a    0.25
b    0.50
c    0.75
dtype: float64

# In[7]
# slicing by implicit integer index
data[0:2]
# Out[7]
a    0.25
b    0.50
dtype: float64
  • Notice that when slicing with an explicit index, the final index in included in the slice, while when slicing with an implicit index, the final index is excluded from the slice.
# In[8]
# masking
data[(data>0.3)&(data<0.8)]
# Out[8]
b    0.50
c    0.75
dtype: float64

# In[9]
# fancy indexing
data[['a','e']]
# Out[9]
a    0.25
e    1.25
dtype: float64

Indexers: loc and iloc

# In[10]
data=pd.Series(['a','b','c'],index=[1,3,5])
data
# Out[10]
1    a
3    b
5    c
dtype: object
  • Because of potential confusion in the case of integer indexes, Pandas provides some special indexer attributes that explicitly expose certain indexing schemes.
  • These are not functional methods, but attributes that expose a particular slicing interface to the data in the Series
  • loc attribute allows indexing and slicing that always references the explicit index
# In[11]
data.loc[1]
# Out[11]
'a'

# In[12]
data.loc[1:3]
# Out[12]
1    a
3    b
dtype: object
  • iloc attribute allows indexing and slicing that always references the implicit Python-style index
# In[13]
data.iloc[1]
# Out[13]
'b'

# In[14]
data.iloc[1:3]
# Out[14]
3    b
5    c
dtype: object

Data Selection in DataFrames

  • As we saw, a DataFrame acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of Series structures sharing the same index.

DataFrame as Dictionary

# In[15]
area=pd.Series({'California':423967,'Texas':695662,'Florida':170312,'New York':141297,'Pennsylvania':119280})
pop=pd.Series({'California':39538223,'Texas':29145505,'Florida':21538187,'New York':20201249,'Pennsylvania':13002700})
data=pd.DataFrame({'area':area,'pop':pop})
data
# Out[15]
                  area	     pop
California	    423967	39538223
Texas	        695662	29145505
Florida	        170312	21538187
New York	    141297	20201249
Pennsylvania	119280	13002700
  • The individual Series that make up the columns of the DataFrame can be accessed via dictionary-style indexing of the column name (like data['area'])
  • Equivalently, we can use attribute-style access with column names that are strings.
# In[16]
data.area
# Out[16]
California      423967
Texas           695662
Florida         170312
New York        141297
Pennsylvania    119280
Name: area, dtype: int64
  • Keep in mind that it does not work for all cases
  • For example, if the column names are not strings, or if the column names conflict with methods of the DataFrame, this attribute-style access is not possible
  • You should avoid the temptation to try column assignment via attributes.
# In[17]
data.pop is data['pop']
# Out[17]
False
  • You can add a new column.
# In[18]
data['density']=data['pop']/data['area']
data
# Out[18]
                  area	     pop	  density
California	    423967	39538223	93.257784
Texas	        695662	29145505	41.896072
Florida	        170312	21538187	126.463121
New York	    141297	20201249	142.970120
Pennsylvania	119280	13002700	109.009893

DataFrame as two-dimensional array

  • We can examine the raw underlying data array using the values attribute.
# In[19]
data.values
# Out[19]
array([[4.23967000e+05, 3.95382230e+07, 9.32577842e+01],
       [6.95662000e+05, 2.91455050e+07, 4.18960717e+01],
       [1.70312000e+05, 2.15381870e+07, 1.26463121e+02],
       [1.41297000e+05, 2.02012490e+07, 1.42970120e+02],
       [1.19280000e+05, 1.30027000e+07, 1.09009893e+02]])
  • Many familiar array-like operations can be done on the DataFrame itself. For example, we can transpose the full DataFrame to swap rows and columns.
# In[20]
data.T
# Out[20]
          California	       Texas	     Florida	    New York	Pennsylvania
area	4.239670e+05	6.956620e+05	1.703120e+05	1.412970e+05	1.192800e+05
pop	    3.953822e+07	2.914550e+07	2.153819e+07	2.020125e+07	1.300270e+07
density	9.325778e+01	4.189607e+01	1.264631e+02	1.429701e+02	1.090099e+02
  • When it comes to indexing of a DataFrame object, however, it is clear that the dictionary-style indexing of columns precludes our ability to simply treat it as a Numpy array.
  • In particular, passing a single index to an array accesses a row.
# In[21]
data.values[0]
# Out[21]
array([4.23967000e+05, 3.95382230e+07, 9.32577842e+01])
  • And passing a single index to a DataFrame accesses a column
# In[22]
data['area']
# Out[22]
California      423967
Texas           695662
Florida         170312
New York        141297
Pennsylvania    119280
Name: area, dtype: int64
  • So, we can use loc and iloc indexers.
  • Using iloc indexer, we can index the underlying array as if it were a simple Numpy array, but the DataFrame index and column labels are maintained in the result.
# In[23]
data.iloc[:3,:2]
# Out[23]
              area	     pop
California	423967	39538223
Texas	    695662	29145505
Florida	    170312	21538187
  • Similarly, using the loc indexer we can index the underlying data in an array-like style but using the explicit index and column names.
# In[24]
data.loc[:'Florida',:'pop']
# Out[24]
              area	     pop
California	423967	39538223
Texas	    695662	29145505
Florida	    170312	21538187
  • Any of the familiar Numpy-style data access patterns can be used within these indexers. For example, we can combine masking and fancy indexing.
# In[25]
data.loc[data.density>120,['pop','density']]
# Out[25]
                 pop	   density
Florida	    21538187	126.463121
New York	20201249	142.970120
  • Any of these indexing conventions may also be used to set or modified values.
# In[26]
data.iloc[0,2]=90
data
# Out[26]
                  area	     pop	  density
California	    423967	39538223	90.000000
Texas	        695662	29145505	41.896072
Florida	        170312	21538187	126.463121
New York	    141297	20201249	142.970120
Pennsylvania	119280	13002700	109.009893

Additional Indexing Conventions

  • While indexing refers to columns, slicing refers to rows
# In[27]
data['Florida':'New York']
# Out[27]
              area	     pop	   density
Florida	    170312	21538187	126.463121
New York	141297	20201249	142.970120
  • Such slices can also refer to rows by number rather than by index
# In[28]
data[1:3]
# Out[28]
          area	     pop	  density
Texas	695662	29145505	 41.896072
Florida	170312	21538187	126.463121
  • Similarly, direct masking operations are interpreted row-wise rather than column-wise.
# In[29]
data[data.density>120]
# Out[29]
              area	     pop	   density
Florida	    170312	21538187	126.463121
New York	141297	20201249	142.970120
profile
노정훈

1개의 댓글

comment-user-thumbnail
2023년 7월 19일

이 글은 제게 많은 도움이 되었습니다.

답글 달기