[Machine Learning] 02. 판다스(1)

Nina·2021년 1월 27일

pandas python 머신러닝 파이썬

파이썬 머신러닝 완벽가이드

목록 보기

2/6

「권철민(2020).파이썬 머신러닝 완벽가이드(개정판).위키북스」 책으로 공부한 뒤 정리한 내용.

1. 판다스

(1) 판다스

판다스(Pandas)는 데이터 처리를 위한 파이썬 라이브러리로, 행과 열로 이루어진 2차원 데이터를 효율적으로 가공/처리할 수 있는 기능을 제공한다. 판다스는 많은 부분이 넘파이를 기반으로 하지만, 넘파이보다 훨씬 유연한 데이터 핸들링을 가능하게 한다.

(2) DataFrame

판다스의 핵심 개체인 DataFrame은 여러 개의 행과 열로 이루어진 2차원 데이터를 담는 구조체이다. DataFrame과 Series는 공통적으로 Index를 키값으로 가지고 있지만, Series는 컬럼이 하나인 반면 DataFrame은 컬럼이 여러개라는 차이점이 있다. 즉, DataFrame은 여러 개의 Series로 이루어져 있다고 볼 수 있다.

2. csv파일을 DataFrame으로 로딩

(1) read_csv()

>>> import pandas as pd
>>> spotify_df = pd.read_csv(r'~/Downloads/top50.csv', encoding="ISO-8859-1")
>>> spotify_df.head(3)
   Unnamed: 0                     Track.Name    Artist.Name           Genre  ...  Length.  Acousticness..  Speechiness.  Popularity
0           1                       Señorita   Shawn Mendes    canadian pop  ...      191               4             3          79
1           2                          China       Anuel AA  reggaeton flow  ...      302               8             9          92
2           3  boyfriend (with Social House)  Ariana Grande       dance pop  ...      186              12            46          85

[3 rows x 14 columns]
>>> type(spotify_df)
<class 'pandas.core.frame.DataFrame'>
>>> spotify_df.shape
(50, 14)

read_csv()를 이용해 필드 구분 문자 기반의 파일 포맷을 DataFrame으로 변환할 수 있다. 별다른 파라미터의 지정이 없으면 파일의 맨 처음 로우를 컬럼명으로 인지하고, 맨 왼쪽에 판다스의 인덱스 값이 생성된다.

(2) info()

>>> spotify_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   Unnamed: 0        50 non-null     int64
 1   Track.Name        50 non-null     object
 2   Artist.Name       50 non-null     object
 3   Genre             50 non-null     object
 4   Beats.Per.Minute  50 non-null     int64
 5   Energy            50 non-null     int64
 6   Danceability      50 non-null     int64
 7   Loudness..dB..    50 non-null     int64
 8   Liveness          50 non-null     int64
 9   Valence.          50 non-null     int64
 10  Length.           50 non-null     int64
 11  Acousticness..    50 non-null     int64
 12  Speechiness.      50 non-null     int64
 13  Popularity        50 non-null     int64
dtypes: int64(11), object(3)
memory usage: 5.6+ KB

info() 메소드를 사용해 총 데이터 건수, 데이터 타입, Null 건수 등을 확인할 수 있다.

(3) describe()

>>> spotify_df.describe()
       Unnamed: 0  Beats.Per.Minute     Energy  Danceability  ...     Length.  Acousticness..  Speechiness.  Popularity
count    50.00000         50.000000  50.000000      50.00000  ...   50.000000       50.000000     50.000000   50.000000
mean     25.50000        120.060000  64.060000      71.38000  ...  200.960000       22.160000     12.480000   87.500000
std      14.57738         30.898392  14.231913      11.92988  ...   39.143879       18.995553     11.161596    4.491489
min       1.00000         85.000000  32.000000      29.00000  ...  115.000000        1.000000      3.000000   70.000000
25%      13.25000         96.000000  55.250000      67.00000  ...  176.750000        8.250000      5.000000   86.000000
50%      25.50000        104.500000  66.500000      73.50000  ...  198.000000       15.000000      7.000000   88.000000
75%      37.75000        137.500000  74.750000      79.75000  ...  217.500000       33.750000     15.000000   90.750000
max      50.00000        190.000000  88.000000      90.00000  ...  309.000000       75.000000     46.000000   95.000000

[8 rows x 11 columns]

describe() 메소드는 컬럼별 숫자형 데이터값의 n-percentile 분포도, 평균값, 최댓값, 최소값을 반환한다. 동 메소드만으로 정확한 분포도를 알기는 어렵지만, 대략적인 분포도를 확인할 수 있어 유용하다. 이러한 데이터 분포도를 아는 것은 머신러닝 알고리즘의 성능을 향상시키는 중요한 요소가 된다.

(4) value_counts()

>>> spotify_df['Popularity'].head(3)
0    79
1    92
2    85
Name: Popularity, dtype: int64
>>> value_counts = spotify_df['Popularity'].value_counts()
>>> value_counts
88    8
89    8
91    7
87    4
90    3
84    3
92    3
82    2
83    2
86    2
94    1
93    1
70    1
78    1
85    1
80    1
79    1
95    1
Name: Popularity, dtype: int64

DataFrame 뒤 대괄호에 컬럼명을 입력하면 Series 형태로 동 컬럼의 데이터 세트(인덱스와 값)가 반환된다. 이렇게 반환된 Series 객체에 value_counts() 메소드를 호출하면 해당 컬럼값의 유형과 건수를 확인할 수 있다.

2. DataFrame의 칼럼 데이터 세트 관리

(1) 생성

>>> spotify_df['rating_0']=0
>>> spotify_df.head(3)
   Unnamed: 0                     Track.Name    Artist.Name           Genre  ...  Acousticness..  Speechiness.  Popularity  rating_0
0           1                       Señorita   Shawn Mendes    canadian pop  ...               4             3          79         0
1           2                          China       Anuel AA  reggaeton flow  ...               8             9          92         0
2           3  boyfriend (with Social House)  Ariana Grande       dance pop  ...              12            46          85         0

[3 rows x 15 columns]

딕셔너리와 마찬가지로, "DataFrame['컬럼명']=값"으로 컬럼을 생성할 수 있다.

(2) 수정

>>> spotify_df['rating_0']=spotify_df['Popularity']*10
>>> spotify_df.head(3)
   Unnamed: 0                     Track.Name    Artist.Name           Genre  ...  Acousticness..  Speechiness.  Popularity  rating_0
0           1                       Señorita   Shawn Mendes    canadian pop  ...               4             3          79       790
1           2                          China       Anuel AA  reggaeton flow  ...               8             9          92       920
2           3  boyfriend (with Social House)  Ariana Grande       dance pop  ...              12            46          85       850

[3 rows x 15 columns]

수정도 생성과 유사하다.

(3) 삭제

DataFrame에서 데이터의 삭제는 drop() 메소드를 이용한다. drop() 메소드의 원형은 다음과 같다.

DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')

이 중 axis는 삭제 대상 축(0: 로우, 1: 컬럼)을 의미하며, axis=1일 때 labels에는 컬럼명이, 0일 때 인덱스가 들어간다.

>>> drop_result = spotify_df.drop(labels=['rating_0'], axis=1)
>>> drop_result.head(3)
   Unnamed: 0                     Track.Name    Artist.Name           Genre  ...  Length.  Acousticness..  Speechiness.  Popularity
0           1                       Señorita   Shawn Mendes    canadian pop  ...      191               4             3          79
1           2                          China       Anuel AA  reggaeton flow  ...      302               8             9          92
2           3  boyfriend (with Social House)  Ariana Grande       dance pop  ...      186              12            46          85

[3 rows x 14 columns]
>>> spotify_df.head(3)
   Unnamed: 0                     Track.Name    Artist.Name           Genre  ...  Acousticness..  Speechiness.  Popularity  rating_0
0           1                       Señorita   Shawn Mendes    canadian pop  ...               4             3          79       790
1           2                          China       Anuel AA  reggaeton flow  ...               8             9          92       920
2           3  boyfriend (with Social House)  Ariana Grande       dance pop  ...              12            46          85       850

[3 rows x 15 columns]

inplace의 디폴트값은 False이다. inplace가 False일 때, 데이터를 삭제한 것이 원본 DataFrame에는 적용되지 않았다.

>>> drop_result = spotify_df.drop(labels=['rating_0'], axis=1, inplace=True)
>>> spotify_df.head(3)
   Unnamed: 0                     Track.Name    Artist.Name           Genre  ...  Length.  Acousticness..  Speechiness.  Popularity
0           1                       Señorita   Shawn Mendes    canadian pop  ...      191               4             3          79
1           2                          China       Anuel AA  reggaeton flow  ...      302               8             9          92
2           3  boyfriend (with Social House)  Ariana Grande       dance pop  ...      186              12            46          85

[3 rows x 14 columns]
>>> drop_result

하지만 inplace가 True일 때는 원본 DataFrame에서 삭제되었음을 확인할 수 있다. 이 경우 반환되는 값은 None이 되므로, 원본 DataFrame에 할당하면 안된다.

>>> drop_result = spotify_df.drop(labels=[0,2], axis=0, inplace=True)
>>> spotify_df.head(3)
   Unnamed: 0                       Track.Name  Artist.Name           Genre  ...  Length.  Acousticness..  Speechiness.  Popularity
1           2                            China     Anuel AA  reggaeton flow  ...      302               8             9          92
3           4  Beautiful People (feat. Khalid)   Ed Sheeran             pop  ...      198              12            19          86
4           5      Goodbyes (Feat. Young Thug)  Post Malone         dfw rap  ...      175              45             7          94

[3 rows x 14 columns]

같은 방식으로 row도 삭제할 수 있다.

3. Index 객체

(1) .index

>>> indexes = spotify_df.index
>>> indexes
RangeIndex(start=0, stop=50, step=1)
>>> indexes.values
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49])

인덱스 객체는 슬라이싱이 가능하지만, 값을 바꾸는 작업은 수행할 수 없다.

>>> indexes[1]
1
>>> indexes[:4].values
array([0, 1, 2, 3])
>>> indexes[0]=10
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/nina/miniconda3/envs/machine-learning/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 4277, in __setitem__
    raise TypeError("Index does not support mutable operations")
TypeError: Index does not support mutable operations

(2) reset_index()

reset_index() 메소드로 새로운 인덱스를 연속 숫자 형으로 할당할 수 있다.

>>> spotify_reset_df = spotify_df.reset_index(inplace=False)
>>> spotify_reset_df.head(3)
   index  Unnamed: 0                     Track.Name    Artist.Name  ... Length.  Acousticness..  Speechiness.  Popularity
0      0           1                       Señorita   Shawn Mendes  ...     191               4             3          79
1      1           2                          China       Anuel AA  ...     302               8             9          92
2      2           3  boyfriend (with Social House)  Ariana Grande  ...     186              12            46          85

[3 rows x 15 columns]
>>> spotify_reset_drop_df = spotify_reset_df.drop(labels=[1,2],axis=0)
>>> spotify_reset_drop_df.head(3)
   index  Unnamed: 0                       Track.Name   Artist.Name  ... Length.  Acousticness..  Speechiness.  Popularity
0      0           1                         Señorita  Shawn Mendes  ...     191               4             3          79
3      3           4  Beautiful People (feat. Khalid)    Ed Sheeran  ...     198              12            19          86
4      4           5      Goodbyes (Feat. Young Thug)   Post Malone  ...     175              45             7          94

[3 rows x 15 columns]
>>> spotify_reset_drop_df.reset_index(inplace=True)
>>> spotify_reset_drop_df.head(3)
   level_0  index  Unnamed: 0                       Track.Name  ... Length. Acousticness..  Speechiness.  Popularity
0        0      0           1                         Señorita  ...     191              4             3          79
1        3      3           4  Beautiful People (feat. Khalid)  ...     198             12            19          86
2        4      4           5      Goodbyes (Feat. Young Thug)  ...     175             45             7          94

[3 rows x 16 columns]

4. DataFrame과 리스트, 딕셔너리, ndarray 간의 상호 변환

사이킷런의 많은 api는 DataFrame을 인자로 입력받는 것이 가능하지만, ndarray를 사용하는 경우가 대부분이다. 따라서 DataFrame과 ndarray 간의 변환은 매우 빈번하다.

(1) 리스트, 딕셔너리, ndarray ➜ DataFrame

>>> import numpy as np
>>> column1 = ['col1']
>>> list1 = [1,2,3]
>>> array1 = np.array(list1)
>>> df_array1 = pd.DataFrame(array1, columns=column1)
>>> df_list1
   col1
0     1
1     2
2     3
>>> df_array1
   col1
0     1
1     2
2     3

1차원의 list와 ndarray로 컬럼이 한 개인 DataFrame을 생성하였다.

>>> column2 = ['col1','col2','col3']
>>> list2 = [[1,2,3],[10,20,30]]
>>> array2 = np.array(list2)
>>> df_list2 = pd.DataFrame(list2, columns=column2)
>>> df_array2 = pd.DataFrame(array2, columns=column2)
>>> df_list2
   col1  col2  col3
0     1     2     3
1    10    20    30
>>> df_array2
   col1  col2  col3
0     1     2     3
1    10    20    30

2차원의 리스트와 ndarray로 컬럼이 여러개인 DataFrame도 생성할 수 있다.

>>> dict = {'col1':[1,10],'col2':[2,20],'col3':[3,30]}
>>> df_dict = pd.DataFrame(dict)
>>> df_dict
   col1  col2  col3
0     1     2     3
1    10    20    30

딕셔너리의 키를 컬럼명으로 하는 DataFrame 생성

(2) DataFrame ➜ 리스트, 딕셔너리, ndarray

>>> array3 = df_dict.values
>>> type(array3)
<class 'numpy.ndarray'>
>>> array3
array([[ 1,  2,  3],
       [10, 20, 30]])
>>> list3 = df_dict.values.tolist()
>>> type(list3)
<class 'list'>
>>> list3
[[1, 2, 3], [10, 20, 30]]

.values를 이용해 DataFrame을 ndarray로 변환할 수 있다. 이렇게 변환된 ndarray에 tolist() 메소드를 사용하면 리스트를 반환한다.

>>> dict3 = df_dict.to_dict('list')
>>> dict3
{'col1': [1, 10], 'col2': [2, 20], 'col3': [3, 30]}
>>> dict4 = df_dict.to_dict()
>>> dict4
{'col1': {0: 1, 1: 10}, 'col2': {0: 2, 1: 20}, 'col3': {0: 3, 1: 30}}

DataFrame에 to_dict() 메소드를 사용하면 딕셔너리로 변환한다. 괄호 안에 'list'를 넣음으로써 딕셔너리의 값을 리스트형으로 반환할 수 있다.

Nina

https://dev.to/ninahwang

이전 포스트

[Machine Learning] 01. 넘파이

다음 포스트