Data Selection in Series
- As we saw, a Series object acts in many ways like a one-dimensional Numpy array, and in many ways like a standard Python dictionary
Series as Dictionary
- Like a dictionary, the Series object provides a mapping from a collection of keys to a collection of values.
# In[1]
data=pd.Series([0.25,0.5,0.75,1.0],index=['a','b','c','d'])
data
# Out[1]
a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64
- We can also use dictionary-like Python expression and methods to examine the keys/indices and values.
# In[2]
'a' in data
# Out[2]
True
# In[3]
data.keys()
# Out[3]
Index(['a', 'b', 'c', 'd'], dtype='object')
# In[4]
list(data.items())
# Out[4]
[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]
- Series object can also be modified with a dictionary-like syntax
#In[5]
data['e']=1.25
# Out[5]
a 0.25
b 0.50
c 0.75
d 1.00
e 1.25
dtype: float64
Series as one-dimensional array
# In[6]
# slicing by explicit index
data['a':'c']
# Out[6]
a 0.25
b 0.50
c 0.75
dtype: float64
# In[7]
# slicing by implicit integer index
data[0:2]
# Out[7]
a 0.25
b 0.50
dtype: float64
- Notice that when slicing with an explicit index, the final index in included in the slice, while when slicing with an implicit index, the final index is excluded from the slice.
# In[8]
# masking
data[(data>0.3)&(data<0.8)]
# Out[8]
b 0.50
c 0.75
dtype: float64
# In[9]
# fancy indexing
data[['a','e']]
# Out[9]
a 0.25
e 1.25
dtype: float64
Indexers: loc
and iloc
# In[10]
data=pd.Series(['a','b','c'],index=[1,3,5])
data
# Out[10]
1 a
3 b
5 c
dtype: object
- Because of potential confusion in the case of integer indexes, Pandas provides some special indexer attributes that explicitly expose certain indexing schemes.
- These are not functional methods, but attributes that expose a particular slicing interface to the data in the Series
loc
attribute allows indexing and slicing that always references the explicit index
# In[11]
data.loc[1]
# Out[11]
'a'
# In[12]
data.loc[1:3]
# Out[12]
1 a
3 b
dtype: object
iloc
attribute allows indexing and slicing that always references the implicit Python-style index
# In[13]
data.iloc[1]
# Out[13]
'b'
# In[14]
data.iloc[1:3]
# Out[14]
3 b
5 c
dtype: object
Data Selection in DataFrames
- As we saw, a DataFrame acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of Series structures sharing the same index.
DataFrame as Dictionary
# In[15]
area=pd.Series({'California':423967,'Texas':695662,'Florida':170312,'New York':141297,'Pennsylvania':119280})
pop=pd.Series({'California':39538223,'Texas':29145505,'Florida':21538187,'New York':20201249,'Pennsylvania':13002700})
data=pd.DataFrame({'area':area,'pop':pop})
data
# Out[15]
area pop
California 423967 39538223
Texas 695662 29145505
Florida 170312 21538187
New York 141297 20201249
Pennsylvania 119280 13002700
- The individual Series that make up the columns of the DataFrame can be accessed via dictionary-style indexing of the column name (like
data['area']
)
- Equivalently, we can use attribute-style access with column names that are strings.
# In[16]
data.area
# Out[16]
California 423967
Texas 695662
Florida 170312
New York 141297
Pennsylvania 119280
Name: area, dtype: int64
- Keep in mind that it does not work for all cases
- For example, if the column names are not strings, or if the column names conflict with methods of the DataFrame, this attribute-style access is not possible
- You should avoid the temptation to try column assignment via attributes.
# In[17]
data.pop is data['pop']
# Out[17]
False
- You can add a new column.
# In[18]
data['density']=data['pop']/data['area']
data
# Out[18]
area pop density
California 423967 39538223 93.257784
Texas 695662 29145505 41.896072
Florida 170312 21538187 126.463121
New York 141297 20201249 142.970120
Pennsylvania 119280 13002700 109.009893
DataFrame as two-dimensional array
- We can examine the raw underlying data array using the values attribute.
# In[19]
data.values
# Out[19]
array([[4.23967000e+05, 3.95382230e+07, 9.32577842e+01],
[6.95662000e+05, 2.91455050e+07, 4.18960717e+01],
[1.70312000e+05, 2.15381870e+07, 1.26463121e+02],
[1.41297000e+05, 2.02012490e+07, 1.42970120e+02],
[1.19280000e+05, 1.30027000e+07, 1.09009893e+02]])
- Many familiar array-like operations can be done on the DataFrame itself. For example, we can transpose the full DataFrame to swap rows and columns.
# In[20]
data.T
# Out[20]
California Texas Florida New York Pennsylvania
area 4.239670e+05 6.956620e+05 1.703120e+05 1.412970e+05 1.192800e+05
pop 3.953822e+07 2.914550e+07 2.153819e+07 2.020125e+07 1.300270e+07
density 9.325778e+01 4.189607e+01 1.264631e+02 1.429701e+02 1.090099e+02
- When it comes to indexing of a DataFrame object, however, it is clear that the dictionary-style indexing of columns precludes our ability to simply treat it as a Numpy array.
- In particular, passing a single index to an array accesses a row.
# In[21]
data.values[0]
# Out[21]
array([4.23967000e+05, 3.95382230e+07, 9.32577842e+01])
- And passing a single index to a DataFrame accesses a column
# In[22]
data['area']
# Out[22]
California 423967
Texas 695662
Florida 170312
New York 141297
Pennsylvania 119280
Name: area, dtype: int64
- So, we can use
loc
and iloc
indexers.
- Using
iloc
indexer, we can index the underlying array as if it were a simple Numpy array, but the DataFrame index and column labels are maintained in the result.
# In[23]
data.iloc[:3,:2]
# Out[23]
area pop
California 423967 39538223
Texas 695662 29145505
Florida 170312 21538187
- Similarly, using the
loc
indexer we can index the underlying data in an array-like style but using the explicit index and column names.
# In[24]
data.loc[:'Florida',:'pop']
# Out[24]
area pop
California 423967 39538223
Texas 695662 29145505
Florida 170312 21538187
- Any of the familiar Numpy-style data access patterns can be used within these indexers. For example, we can combine masking and fancy indexing.
# In[25]
data.loc[data.density>120,['pop','density']]
# Out[25]
pop density
Florida 21538187 126.463121
New York 20201249 142.970120
- Any of these indexing conventions may also be used to set or modified values.
# In[26]
data.iloc[0,2]=90
data
# Out[26]
area pop density
California 423967 39538223 90.000000
Texas 695662 29145505 41.896072
Florida 170312 21538187 126.463121
New York 141297 20201249 142.970120
Pennsylvania 119280 13002700 109.009893
Additional Indexing Conventions
- While indexing refers to columns, slicing refers to rows
# In[27]
data['Florida':'New York']
# Out[27]
area pop density
Florida 170312 21538187 126.463121
New York 141297 20201249 142.970120
- Such slices can also refer to rows by number rather than by index
# In[28]
data[1:3]
# Out[28]
area pop density
Texas 695662 29145505 41.896072
Florida 170312 21538187 126.463121
- Similarly, direct masking operations are interpreted row-wise rather than column-wise.
# In[29]
data[data.density>120]
# Out[29]
area pop density
Florida 170312 21538187 126.463121
New York 141297 20201249 142.970120
이 글은 제게 많은 도움이 되었습니다.