๐Ÿ–ฅ๏ธ[Python] 9. Pandas ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

thisk336ยท2023๋…„ 10์›” 1์ผ
0

Python

๋ชฉ๋ก ๋ณด๊ธฐ
15/17
post-thumbnail

pandas

  • pandas๋ž€ Python Data Analysis Library์˜ ์•ฝ์ž๋กœ ์ •ํ˜• ๋ฐ์ดํ„ฐ ๋ถ„์„์— ์ตœ์ ํ™”๋œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ด๋‹ค.
  • 2008๋…„์— ๋งŒ๋“ค์–ด์กŒ์œผ๋ฉฐ, 2009๋…„์— ์˜คํ”ˆ์†Œ์Šค๊ฐ€ ๋˜์—ˆ๋‹ค.
  • ์ •ํ˜• ๋ฐ์ดํ„ฐ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋Š” DataFrame ํ˜•ํƒœ๋กœ ๋ชจ๋“  ๋ฐ์ดํ„ฐ๋ฅผ ํ‘œํ˜„ํ•œ๋‹ค.
  • ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ ์กฐ์ž‘ ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•œ๋‹ค.
  • ๋ฒกํ„ฐ ์—ฐ์‚ฐ์— ์ตœ์ ํ™”๋˜์–ด ์žˆ์œผ๋ฉฐ, ์ด๋Š” numpy์™€ ์—ฐ๊ด€์„ฑ์ด ์žˆ๋‹ค.
import pandas as pd

df = pd.DataFrame(np.random.randn(5, 3))
df.head()

pandas๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•˜๋Š” ์ด์œ 

  • ๋Œ€๋ถ€๋ถ„์˜ ์ •์ œ๋œ ๋ฐ์ดํ„ฐ๋“ค์€ ํ…Œ์ด๋ธ” ํ˜•ํƒœ๋กœ ํ‘œํ˜„๋˜๋ฉฐ, ์ด๋Ÿฐ ํ…Œ์ด๋ธ” ํ˜•ํƒœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ํ•˜๊ธฐ์— ์ตœ์ ์˜ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ด๋‹ค.
  • ์ •ํ˜•ํ™”๋œ ๋ฐ์ดํ„ฐ ์—ฐ์‚ฐ์— ์ตœ์ ํ™” ๋˜์–ด ์žˆ์œผ๋ฉฐ, ์„ฑ๋Šฅ์ด ๋งค์šฐ ๋›ฐ์–ด๋‚˜๋‹ค.
  • ๋‹ค์–‘ํ•œ ์ •ํ˜• ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ฉ ๊ด€๋ฆฌํ•  ์ˆ˜ ์žˆ๊ณ , json, html, csv, xlsx, hdf5, sql ๋“ฑ์„ DataFrame์œผ๋กœ ํ†ต์ผํ•ด์„œ ํ‘œํ˜„๋  ์ˆ˜ ์žˆ๋‹ค.
  • ์—‘์…€์—์„œ ์ œ๊ณตํ•˜๋Š” ์—ฐ์‚ฐ ๊ธฐ๋Šฅ์„ ๊ฑฐ์˜ ๋‹ค ์ œ๊ณตํ•œ๋‹ค.

DataFrame

  • pandas ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ์‚ฌ์šฉํ•˜๋Š” ๊ธฐ๋ณธ ์ž๋ฃŒ๊ตฌ์กฐ๋กœ 2์ฐจ์› ํ…Œ์ด๋ธ” ๊ตฌ์กฐ๋ฅผ ๋งํ•œ๋‹ค.
  • DataFrame์—์„œ 1์ฐจ์›์˜ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ ธ์˜ค๋ฉด series๋‹ค.
  • row, column์œผ๋กœ ๋ชจ๋“  ์›์†Œ๋ฅผ ๊ตฌ๋ถ„ํ•œ๋‹ค.
  • index, columns, values๋ผ๋Š” ๊ฐ์ฒด ๋ณ€์ˆ˜๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.
  • ๊ด€๊ณ„ํ˜• DB์™€ ์™„์ „ํžˆ ํ˜ธํ™˜๋œ๋‹ค.
  • ํ•˜๋‚˜์˜ column์„ ๊ธฐ์ค€์œผ๋กœ ๋ชจ๋“  ์›์†Œ์˜ data type์ด ๋™์ผํ•˜๋‹ค.
  • DataFrame์€ numpy array๋ฅผ ์ƒ์œ„ ํ˜ธํ™˜ํ•˜๋Š” ๊ฐœ๋…์œผ๋กœ universal function์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

DataFrame ์ƒ์„ฑ

import pandas as pd
import numpy as np

# 1, 3, 5, np.nan, 6, 8์„ ์›์†Œ๋กœ ๊ฐ€์ง€๋Š” pandas.Series ์ƒ์„ฑ
pd.Series([1, 3, 5, np.nan, 6, 8])

# 12x4 ํ–‰๋ ฌ์— 1๋ถ€ํ„ฐ 48๊นŒ์ง€์˜ ์ˆซ์ž๋ฅผ ์›์†Œ๋ฅผ ๊ฐ€์ง€๊ณ , index๋Š” 0๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜๊ณ , coulmns์€ ์ˆœ์„œ๋Œ€๋กœ X1, X2, X3, X4๋กœ ํ•˜๋Š” DataFrame ์ƒ์„ฑ
df = pd.DataFrame(data=np.arange(1, 49).reshape(12, 4),
                  index=np.arange(12),
                  columns=["X1", "X2", "X3", "X4"])

๊ธฐ๋ณธ ๊ฐ’ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

# dataframe index
df.index

# dataframe columns
df.columns

# dataframe values
df.values

# ํŠน์ • column์„ ๊ฐ€์ ธ์˜ค๊ธฐ
df["X2"] # dictionary like

# X1 column์— 2 ๋”ํ•˜๊ธฐ
#df["X1"] = df["X1"] + 2
df["X1"] + 2

DataFrame ๊ธฐ์ดˆ ํ•จ์ˆ˜

# dataframe์˜ ๋งจ ์œ„ ๋‹ค์„ฏ์ค„์„ ๋ณด์—ฌ์ฃผ๋Š” head() --> 5์ค„ display
df.head()

# 10์ค„
df.head(10)

# dataframe์˜ ๋งจ ๋ฐ‘ ๋‹ค์„ฏ์ค„์„ ๋ณด์—ฌ์ฃผ๋Š” head() --> 5์ค„ display
df.tail()

# dataframe์— ๋Œ€ํ•œ ์ „์ฒด์ ์ธ ์š”์•ฝ์ •๋ณด ํ‘œ์‹œ
df.info()

# dataframe์— ๋Œ€ํ•œ ์ „์ฒด์ ์ธ ํ†ต๊ณ„์ •๋ณด ํ‘œ์‹œ
df.describe()

# X2 column๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋‚ด๋ฆผ์ฐจ์ˆœ ์ •๋ ฌ
df.sort_values(by="X2", ascending=False)

Fancy Indexing

  • iloc์€ index๋ฅผ ํ™œ์šฉํ•œ location ์ง€์ • ๋ฐฉ๋ฒ•.
  • loc์€ index๋ฅผ ํ™œ์šฉํ•˜์ง€ ์•Š๊ณ  ์ง์ ‘ index ๋ฐ column๋ช…์„ ํ†ตํ•ด ์ง€์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•.
# X1 column์„ indexing == df["X1"]
df.X1

# ์•ž์—์„œ 3์ค„์„ slicing
df[:3]

# df์—์„œ index value๋ฅผ ๊ธฐ์ค€์œผ๋กœ indexing๋„ ๊ฐ€๋Šฅ == df.index[0]
df.loc[0]

# loc 2์ฐจ์› indexing
df.loc[[0, 1, 4, 6, 10], ["X1", "X3"]]

# dataframe์— ์กฐ๊ฑด์‹์„ ์ ์šฉ
df[df["X3"] % 3 == 0]

# integer-location based indexing
df.iloc[5]

# iloc 2์ฐจ์› indexing
df.iloc[[3, 4], [0, 1]]

์—ฌ๋Ÿฌ DataFrame ํ•ฉ์น˜๊ธฐ

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'], 
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']},
                   index=[0, 1, 2, 3])

df2 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']},
                   index=[0, 1, 2, 3])

df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                    'B': ['B8', 'B9', 'B10', 'B11'],
                    'C': ['C8', 'C9', 'C10', 'C11'],
                    'D': ['D8', 'D9', 'D10', 'D11']},
                   index=[0, 1, 2, 3])

๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์ด ์ฃผ์–ด์กŒ์„๋•Œ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ํ•ฉ์น˜๊ณ  ์‹ถ์œผ๋ฉด merge ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

# SQL๊ณผ ๊ฐ™์ด join operation์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค
pd.merge(df1, df2, on="A", how="outer")

# left join
pd.merge(df1, df2, on="A", how="left")

# ๊ทธ๋ƒฅ ํ•ฉ์น˜๊ธฐ (concatenation)
pd.concat([df1, df2, df3], axis=0).reset_index(drop=True)

0๊ฐœ์˜ ๋Œ“๊ธ€