Introducing Pandas Object

노정훈·2023년 7월 18일
0

Pandas

목록 보기
1/12
  • There are three fundamental Pandas structures : Series, DataFrame, and Index
# In[1]
import numpy as np 
import pandas as pd

Pandas Series Object

  • A Pandas Series is a one-dimensional array of indexed data.
# In[2]
data=pd.Series([0.25,0.5,0.75,1.0])
data
# Out[2]
0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64
  • Series combines a sequence of values with an explicit sequence of indices, which we can access with the values and index attributes
# In[3]
print(data.values)
print(data.index)
# Out[3]
[0.25 0.5  0.75 1.  ]
RangeIndex(start=0, stop=4, step=1)
  • Like with a Numpy array, data can be accessed by the associated index via the familiar Python square-bracket.
# In[4]
print(data[1])
print(data[1:3])
# Out[4]
0.5
1    0.50
2    0.75
dtype: float64
  • Pandas Series is much more general and flexible than the one-dimensional Numpy array that is emulates.

Series as Generalized Numpy array

  • Numpy array has an implicitly defined integer index used to access the values
  • Pandas Series has an explicitly defined index associated with the values.
  • This explicit index definition gives the Series object additional capabilities.
# In[5]
data=pd.Series([0.25,0.5,0.75,1.0],index=['b','a','d','c'])
data
# Out[5]
b    0.25
a    0.50
d    0.75
c    1.00
dtype: float64

# In[6]
data['b']
# Out[6]
0.25

Series as Specialized Dictionary

  • A dictionary is a structure that maps arbitrary keys to a set of arbitrary values
  • Series is a structure that maps types keys to set of types values.
  • The type-specific compiled code behind a Numpy array makes it more efficient than a Python list for certain operations, the type information of a Pandas Series makes it more efficient than Python dictionaries for certain operations.
# In[7]
population_dict={'California':39538223,'Texas':29145505,'Florida':21538187,'New York':20201249,'Pennsylvania':13002700}
population=pd.Series(population_dict)
population
# Out[7]
California      39538223
Texas           29145505
Florida         21538187
New York        20201249
Pennsylvania    13002700
dtype: int64

# In[8]
population['California']
# Out[8]
39538223
  • Unlike a dictionary, though, the Series also supports array-style operations such as slicing.
# In[9]
population['California':'Florida']
# Out[9]
California    39538223
Texas         29145505
Florida       21538187
dtype: int64

Constructing Series Objects

  • Pandas Series following pd.Series(data,index=index)
  • index is an optional argument, and data can be one of may entities
  • data can be a list or Numpy array like this
# In[10]
pd.Series([2,4,6])
# Out[10]
0    2
1    4
2    6
dtype: int64

-data can be a scalar, which is repeated to fill the specified index

# In[11]
pd.Series(5,index=[100,200,300])
# Out[11]
100    5
200    5
300    5
dtype: int64
  • Or it can be a dictionary, in which case index defaults to the dictionary keys
# In[12]
pd.Series({2:'a',1:'b',3:'c'})
# Out[12]
2    a
1    b
3    c
dtype: object
  • The index can be explicitly set to control the order or the subset of keys used.
# In[13]
pd.Series({2:'a',1:'b',3:'c'},index=[1,2])
# Out[13]
1    b
2    a
dtype: object

Pandas DataFrame Object

DataFrame as Generalized Numpy Array

  • If a Series is an analog of a one-dimensional array with explicit indices, a DataFrame is an analog of a two-dimensional array with explicit row and column indices.
# In[14]
area_dict={'California':423967,'Texas':695662,'Florida':170312,'New York':141297,'Pennsylvania':119280}
area=pd.Series(area_dict)
area
# Out[14]
California      423967
Texas           695662
Florida         170312
New York        141297
Pennsylvania    119280
dtype: int64

# In[15]
states=pd.DataFrame({'population':population,'area':area})
states
# Out[15]
              population	  area
California	    39538223	423967
Texas	        29145505	695662
Florida	        21538187	170312
New York	    20201249	141297
Pennsylvania	13002700	119280
  • Like Series object, the DataFrame has an index attribute that gives access to the index labels.
# In[16]
states.index
# Out[16]
Index(['California', 'Texas', 'Florida', 'New York', 'Pennsylvania'], dtype='object')
  • Additionally, the DataFrame has a columns attribute, which is an Index object holding the column labels.
# In[17]
states.columns
# Out[17]
Index(['population', 'area'], dtype='object')

DataFrame as Specialized Dictionary

  • Where a dictionary maps a key to a value, a DataFrame maps a column name to a Series of column data.
# In[18]
states['area']
# Out[18]
California      423967
Texas           695662
Florida         170312
New York        141297
Pennsylvania    119280
Name: area, dtype: int64

Constructing DataFrame Object

From a single Series object

  • A DataFrame is a collection of Series objects, and a single-column DataFrame can be constructed from a single Series.
# In[19]
pd.DataFrame(population,columns=['population'])
# Out[19]
              population
California	    39538223
Texas	        29145505
Florida	        21538187
New York	    20201249
Pennsylvania	13002700

From a list of dicts

# In[20]
data=[{'a':i,'b':2*i} for i in range(3)]
pd.DataFrame(data)
# Out[20]
    a	b
0	0	0
1	1	2
2	2	4
  • If some keys in the dictionary are missing, Pandas will fill them in with NaN(Not a Number) values.
# In[21]
pd.DataFrame([{'a':1,'b':2},{'b':3,'c':4}])
# Out[21]
      a	b	  c
0	1.0	2	NaN
1	NaN	3	4.0

From a dictionary of Series objects

  • A DataFrame can be constructed from a dictionary of Series object
  • We saw this before. Please refer # In[15]

From a two-dimensional Numpy array

  • Given a two-dimensional array of data, we can create a DataFrame with any specified column and index names.
  • If omitted, an integer index will be used for each.
# In[22]
pd.DataFrame(np.random.rand(3,2),columns=['foo','bar'],index=['a','b','c'])
# Out[22]
         foo	     bar
a	0.466496	0.888614
b	0.228347	0.613272
c	0.912784	0.961023

From a Numpy structured array

  • A Pandas DataFrame operates much like a structured array, and can be created directly from one.
# In[23]
A=np.zeros(3,dtype=[('A','i8'),('B','f8')])
A
# Out[23]
array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])

# In[24]
pd.DataFrame(A)
# Out[24]
    A	  B
0	0	0.0
1	0	0.0
2	0	0.0

Pandas Index Object

  • The Series and DataFrame objects both contain an explicit index that let you reference and modify data.
  • Index object is an interesting structure in itself, and it can be thought of either as an immutable array or as an ordered set.
# In[25]
ind=pd.Index([2,3,5,7,11])
ind
# Out[25]
Int64Index([2, 3, 5, 7, 11], dtype='int64')

Index as Immutable array

  • The Index in many ways operates like an array.
# In[26]
print(ind[1])
print(ind[::2])
print(ind.size, ind.shape, ind.ndim, ind.dtype)
# Out[26]
3
Int64Index([2, 5, 11], dtype='int64')
5 (5,) 1 int64
  • One difference between Index objects and Numpy arrays is that the indices are immutable.
  • That is, they cannot be modified via the normal means.

Index as Ordered Set

  • The Index object follows many of the conventions used by Python's built-in set data structure, so that unions, intersections, differences, and other combinations can be computed in a familiar way.
# In[27]
indA=pd.Index([1,3,5,7,9])
indB=pd.Index([2,3,5,7,11])

# In[28]
print(indA.intersection(indB))
print(indA.union(indB))
print(indA.symmetric_difference(indB))
# Out[28]
Int64Index([3, 5, 7], dtype='int64')
Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')
Int64Index([1, 2, 9, 11], dtype='int64')
profile
노정훈

1개의 댓글

comment-user-thumbnail
2023년 7월 18일

글이 많은 도움이 되었습니다, 감사합니다.

답글 달기