Introducing Pandas Object

노정훈·2023년 7월 18일

Pandas

목록 보기

1/12

There are three fundamental Pandas structures : Series, DataFrame, and Index

# In[1]
import numpy as np 
import pandas as pd

Pandas Series Object

A Pandas Series is a one-dimensional array of indexed data.

# In[2]
data=pd.Series([0.25,0.5,0.75,1.0])
data

# Out[2]
0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

Series combines a sequence of values with an explicit sequence of indices, which we can access with the values and index attributes

# In[3]
print(data.values)
print(data.index)

# Out[3]
[0.25 0.5  0.75 1.  ]
RangeIndex(start=0, stop=4, step=1)

Like with a Numpy array, data can be accessed by the associated index via the familiar Python square-bracket.

# In[4]
print(data[1])
print(data[1:3])

# Out[4]
0.5
1    0.50
2    0.75
dtype: float64

Pandas Series is much more general and flexible than the one-dimensional Numpy array that is emulates.

Series as Generalized Numpy array

Numpy array has an implicitly defined integer index used to access the values
Pandas Series has an explicitly defined index associated with the values.
This explicit index definition gives the Series object additional capabilities.

# In[5]
data=pd.Series([0.25,0.5,0.75,1.0],index=['b','a','d','c'])
data

# Out[5]
b    0.25
a    0.50
d    0.75
c    1.00
dtype: float64

# In[6]
data['b']

# Out[6]
0.25

Series as Specialized Dictionary

A dictionary is a structure that maps arbitrary keys to a set of arbitrary values
Series is a structure that maps types keys to set of types values.
The type-specific compiled code behind a Numpy array makes it more efficient than a Python list for certain operations, the type information of a Pandas Series makes it more efficient than Python dictionaries for certain operations.

# In[7]
population_dict={'California':39538223,'Texas':29145505,'Florida':21538187,'New York':20201249,'Pennsylvania':13002700}
population=pd.Series(population_dict)
population

# Out[7]
California      39538223
Texas           29145505
Florida         21538187
New York        20201249
Pennsylvania    13002700
dtype: int64

# In[8]
population['California']

# Out[8]
39538223

Unlike a dictionary, though, the Series also supports array-style operations such as slicing.

# In[9]
population['California':'Florida']

# Out[9]
California    39538223
Texas         29145505
Florida       21538187
dtype: int64

Constructing Series Objects

Pandas Series following pd.Series(data,index=index)
index is an optional argument, and data can be one of may entities
data can be a list or Numpy array like this

# In[10]
pd.Series([2,4,6])

# Out[10]
0    2
1    4
2    6
dtype: int64

-data can be a scalar, which is repeated to fill the specified index

# In[11]
pd.Series(5,index=[100,200,300])

# Out[11]
100    5
200    5
300    5
dtype: int64

Or it can be a dictionary, in which case index defaults to the dictionary keys

# In[12]
pd.Series({2:'a',1:'b',3:'c'})

# Out[12]
2    a
1    b
3    c
dtype: object

The index can be explicitly set to control the order or the subset of keys used.

# In[13]
pd.Series({2:'a',1:'b',3:'c'},index=[1,2])

# Out[13]
1    b
2    a
dtype: object

Pandas DataFrame Object

DataFrame as Generalized Numpy Array

If a Series is an analog of a one-dimensional array with explicit indices, a DataFrame is an analog of a two-dimensional array with explicit row and column indices.

# In[14]
area_dict={'California':423967,'Texas':695662,'Florida':170312,'New York':141297,'Pennsylvania':119280}
area=pd.Series(area_dict)
area

# Out[14]
California      423967
Texas           695662
Florida         170312
New York        141297
Pennsylvania    119280
dtype: int64

# In[15]
states=pd.DataFrame({'population':population,'area':area})
states

# Out[15]
              population	  area
California	    39538223	423967
Texas	        29145505	695662
Florida	        21538187	170312
New York	    20201249	141297
Pennsylvania	13002700	119280

Like Series object, the DataFrame has an index attribute that gives access to the index labels.

# In[16]
states.index

# Out[16]
Index(['California', 'Texas', 'Florida', 'New York', 'Pennsylvania'], dtype='object')

Additionally, the DataFrame has a columns attribute, which is an Index object holding the column labels.

# In[17]
states.columns

# Out[17]
Index(['population', 'area'], dtype='object')

DataFrame as Specialized Dictionary

Where a dictionary maps a key to a value, a DataFrame maps a column name to a Series of column data.

# In[18]
states['area']

# Out[18]
California      423967
Texas           695662
Florida         170312
New York        141297
Pennsylvania    119280
Name: area, dtype: int64

Constructing DataFrame Object

From a single Series object

A DataFrame is a collection of Series objects, and a single-column DataFrame can be constructed from a single Series.

# In[19]
pd.DataFrame(population,columns=['population'])

# Out[19]
              population
California	    39538223
Texas	        29145505
Florida	        21538187
New York	    20201249
Pennsylvania	13002700

From a list of dicts

# In[20]
data=[{'a':i,'b':2*i} for i in range(3)]
pd.DataFrame(data)

# Out[20]
    a	b
0	0	0
1	1	2
2	2	4

If some keys in the dictionary are missing, Pandas will fill them in with NaN(Not a Number) values.

# In[21]
pd.DataFrame([{'a':1,'b':2},{'b':3,'c':4}])

# Out[21]
      a	b	  c
0	1.0	2	NaN
1	NaN	3	4.0

From a dictionary of Series objects

A DataFrame can be constructed from a dictionary of Series object
We saw this before. Please refer # In[15]

From a two-dimensional Numpy array

Given a two-dimensional array of data, we can create a DataFrame with any specified column and index names.
If omitted, an integer index will be used for each.

# In[22]
pd.DataFrame(np.random.rand(3,2),columns=['foo','bar'],index=['a','b','c'])

# Out[22]
         foo	     bar
a	0.466496	0.888614
b	0.228347	0.613272
c	0.912784	0.961023

From a Numpy structured array

A Pandas DataFrame operates much like a structured array, and can be created directly from one.

# In[23]
A=np.zeros(3,dtype=[('A','i8'),('B','f8')])
A

# Out[23]
array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])

# In[24]
pd.DataFrame(A)

# Out[24]
    A	  B
0	0	0.0
1	0	0.0
2	0	0.0

Pandas Index Object

The Series and DataFrame objects both contain an explicit index that let you reference and modify data.
Index object is an interesting structure in itself, and it can be thought of either as an immutable array or as an ordered set.

# In[25]
ind=pd.Index([2,3,5,7,11])
ind

# Out[25]
Int64Index([2, 3, 5, 7, 11], dtype='int64')

Index as Immutable array

The Index in many ways operates like an array.

# In[26]
print(ind[1])
print(ind[::2])
print(ind.size, ind.shape, ind.ndim, ind.dtype)

# Out[26]
3
Int64Index([2, 5, 11], dtype='int64')
5 (5,) 1 int64

One difference between Index objects and Numpy arrays is that the indices are immutable.
That is, they cannot be modified via the normal means.

Index as Ordered Set

The Index object follows many of the conventions used by Python's built-in set data structure, so that unions, intersections, differences, and other combinations can be computed in a familiar way.

# In[27]
indA=pd.Index([1,3,5,7,9])
indB=pd.Index([2,3,5,7,11])

# In[28]
print(indA.intersection(indB))
print(indA.union(indB))
print(indA.symmetric_difference(indB))

# Out[28]
Int64Index([3, 5, 7], dtype='int64')
Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')
Int64Index([1, 2, 9, 11], dtype='int64')

노정훈

다음 포스트

Data Indexing and Selection

1개의 댓글

happy

2023년 7월 18일

글이 많은 도움이 되었습니다, 감사합니다.

답글 달기