Data Indexing and Selection

노정훈·2023년 7월 19일

Pandas

목록 보기

2/12

Data Selection in Series

As we saw, a Series object acts in many ways like a one-dimensional Numpy array, and in many ways like a standard Python dictionary

Series as Dictionary

Like a dictionary, the Series object provides a mapping from a collection of keys to a collection of values.

# In[1]
data=pd.Series([0.25,0.5,0.75,1.0],index=['a','b','c','d'])
data

# Out[1]
a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

We can also use dictionary-like Python expression and methods to examine the keys/indices and values.

# In[2]
'a' in data

# Out[2]
True

# In[3]
data.keys()

# Out[3]
Index(['a', 'b', 'c', 'd'], dtype='object')

# In[4]
list(data.items())

# Out[4]
[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

Series object can also be modified with a dictionary-like syntax

#In[5]
data['e']=1.25

# Out[5]
a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

Series as one-dimensional array

# In[6]
# slicing by explicit index
data['a':'c']

# Out[6]
a    0.25
b    0.50
c    0.75
dtype: float64

# In[7]
# slicing by implicit integer index
data[0:2]

# Out[7]
a    0.25
b    0.50
dtype: float64

Notice that when slicing with an explicit index, the final index in included in the slice, while when slicing with an implicit index, the final index is excluded from the slice.

# In[8]
# masking
data[(data>0.3)&(data<0.8)]

# Out[8]
b    0.50
c    0.75
dtype: float64

# In[9]
# fancy indexing
data[['a','e']]

# Out[9]
a    0.25
e    1.25
dtype: float64

Indexers: `loc` and `iloc`

# In[10]
data=pd.Series(['a','b','c'],index=[1,3,5])
data

# Out[10]
1    a
3    b
5    c
dtype: object

Because of potential confusion in the case of integer indexes, Pandas provides some special indexer attributes that explicitly expose certain indexing schemes.
These are not functional methods, but attributes that expose a particular slicing interface to the data in the Series
loc attribute allows indexing and slicing that always references the explicit index

# In[11]
data.loc[1]

# Out[11]
'a'

# In[12]
data.loc[1:3]

# Out[12]
1    a
3    b
dtype: object

iloc attribute allows indexing and slicing that always references the implicit Python-style index

# In[13]
data.iloc[1]

# Out[13]
'b'

# In[14]
data.iloc[1:3]

# Out[14]
3    b
5    c
dtype: object

Data Selection in DataFrames

As we saw, a DataFrame acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of Series structures sharing the same index.

DataFrame as Dictionary

# In[15]
area=pd.Series({'California':423967,'Texas':695662,'Florida':170312,'New York':141297,'Pennsylvania':119280})
pop=pd.Series({'California':39538223,'Texas':29145505,'Florida':21538187,'New York':20201249,'Pennsylvania':13002700})
data=pd.DataFrame({'area':area,'pop':pop})
data

# Out[15]
                  area	     pop
California	    423967	39538223
Texas	        695662	29145505
Florida	        170312	21538187
New York	    141297	20201249
Pennsylvania	119280	13002700

The individual Series that make up the columns of the DataFrame can be accessed via dictionary-style indexing of the column name (like data['area'])
Equivalently, we can use attribute-style access with column names that are strings.

# In[16]
data.area

# Out[16]
California      423967
Texas           695662
Florida         170312
New York        141297
Pennsylvania    119280
Name: area, dtype: int64

Keep in mind that it does not work for all cases
For example, if the column names are not strings, or if the column names conflict with methods of the DataFrame, this attribute-style access is not possible
You should avoid the temptation to try column assignment via attributes.

# In[17]
data.pop is data['pop']

# Out[17]
False

You can add a new column.

# In[18]
data['density']=data['pop']/data['area']
data

# Out[18]
                  area	     pop	  density
California	    423967	39538223	93.257784
Texas	        695662	29145505	41.896072
Florida	        170312	21538187	126.463121
New York	    141297	20201249	142.970120
Pennsylvania	119280	13002700	109.009893

DataFrame as two-dimensional array

We can examine the raw underlying data array using the values attribute.

# In[19]
data.values

# Out[19]
array([[4.23967000e+05, 3.95382230e+07, 9.32577842e+01],
       [6.95662000e+05, 2.91455050e+07, 4.18960717e+01],
       [1.70312000e+05, 2.15381870e+07, 1.26463121e+02],
       [1.41297000e+05, 2.02012490e+07, 1.42970120e+02],
       [1.19280000e+05, 1.30027000e+07, 1.09009893e+02]])

Many familiar array-like operations can be done on the DataFrame itself. For example, we can transpose the full DataFrame to swap rows and columns.

# In[20]
data.T

# Out[20]
          California	       Texas	     Florida	    New York	Pennsylvania
area	4.239670e+05	6.956620e+05	1.703120e+05	1.412970e+05	1.192800e+05
pop	    3.953822e+07	2.914550e+07	2.153819e+07	2.020125e+07	1.300270e+07
density	9.325778e+01	4.189607e+01	1.264631e+02	1.429701e+02	1.090099e+02

When it comes to indexing of a DataFrame object, however, it is clear that the dictionary-style indexing of columns precludes our ability to simply treat it as a Numpy array.
In particular, passing a single index to an array accesses a row.

# In[21]
data.values[0]

# Out[21]
array([4.23967000e+05, 3.95382230e+07, 9.32577842e+01])

And passing a single index to a DataFrame accesses a column

# In[22]
data['area']

# Out[22]
California      423967
Texas           695662
Florida         170312
New York        141297
Pennsylvania    119280
Name: area, dtype: int64

So, we can use loc and iloc indexers.
Using iloc indexer, we can index the underlying array as if it were a simple Numpy array, but the DataFrame index and column labels are maintained in the result.

# In[23]
data.iloc[:3,:2]

# Out[23]
              area	     pop
California	423967	39538223
Texas	    695662	29145505
Florida	    170312	21538187

Similarly, using the loc indexer we can index the underlying data in an array-like style but using the explicit index and column names.

# In[24]
data.loc[:'Florida',:'pop']

# Out[24]
              area	     pop
California	423967	39538223
Texas	    695662	29145505
Florida	    170312	21538187

Any of the familiar Numpy-style data access patterns can be used within these indexers. For example, we can combine masking and fancy indexing.

# In[25]
data.loc[data.density>120,['pop','density']]

# Out[25]
                 pop	   density
Florida	    21538187	126.463121
New York	20201249	142.970120

Any of these indexing conventions may also be used to set or modified values.

# In[26]
data.iloc[0,2]=90
data

# Out[26]
                  area	     pop	  density
California	    423967	39538223	90.000000
Texas	        695662	29145505	41.896072
Florida	        170312	21538187	126.463121
New York	    141297	20201249	142.970120
Pennsylvania	119280	13002700	109.009893

Additional Indexing Conventions

While indexing refers to columns, slicing refers to rows

# In[27]
data['Florida':'New York']

# Out[27]
              area	     pop	   density
Florida	    170312	21538187	126.463121
New York	141297	20201249	142.970120

Such slices can also refer to rows by number rather than by index

# In[28]
data[1:3]

# Out[28]
          area	     pop	  density
Texas	695662	29145505	 41.896072
Florida	170312	21538187	126.463121

Similarly, direct masking operations are interpreted row-wise rather than column-wise.

# In[29]
data[data.density>120]

# Out[29]
              area	     pop	   density
Florida	    170312	21538187	126.463121
New York	141297	20201249	142.970120

노정훈

이전 포스트

Introducing Pandas Object

다음 포스트

Operating on Data in Pandas

1개의 댓글

이태균

2023년 7월 19일

이 글은 제게 많은 도움이 되었습니다.

답글 달기

Data Indexing and Selection

Pandas

Data Selection in Series

Series as Dictionary

Series as one-dimensional array

Indexers: loc and iloc

Data Selection in DataFrames

DataFrame as Dictionary

DataFrame as two-dimensional array

Additional Indexing Conventions

Introducing Pandas Object

Operating on Data in Pandas

1개의 댓글

Indexers: `loc` and `iloc`