๐Ÿ’ฆ ์™œ ๋‹ค๋“ค ๋‚ด ์ฝ”๋“œ๋ฅผ ๋ณด๊ณ  ํ™”์žฅ์‹ค์„ ๊ฐ€๋Š”์ง€ ๋ชจ๋ฅด๊ฒ ์–ด

4past12ยท2023๋…„ 9์›” 18์ผ
0

ํ‚น๊ฐ“ datamanim ๋•์— ๊ณ ์งˆ์ ์ธ ์ „์ฒ˜๋ฆฌ์žฅ์• ๋ฅผ ์–ด๋А์ •๋„ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๊ฒฝ๋ฐฐ.
https://www.datamanim.com/dataset/99_pandas/pandasMain.html

  1. Column ์ด๋ฆ„์— ํ•œ๊ธ€์ด ํฌํ•จ๋˜์–ด ์žˆ์„ ๊ฒฝ์šฐ ์–ด๋–ป๊ฒŒ ๋ถˆ๋Ÿฌ์˜ค๋‹ˆ?
# cp949 / euc-kr ์ด์šฉํ•˜์—ฌ UnicodeDecodeError ํ•ด๊ฒฐ
import pandas as pd
df = pd.read_csv("web์ฃผ์†Œ or ํŒŒ์ผ๊ฒฝ๋กœ",encoding='cp949' or 'euc-kr')
  1. ์ˆ˜์น˜ํ˜• ๋ณ€์ˆ˜๋ฅผ ๊ฐ€์ง„ column์„ ์ถœ๋ ฅํ•ด๋ณด๊ฑฐ๋ผ
# df.info()๋กœ ๊ฐ ํ–‰์˜ type์ด ๋ฌด์—‡์ธ์ง€ ์šฐ์„  ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.
# ๋ฒ”์ฃผํ˜•์€ type = object
df_num = df.select_dtypes(include=['int','float'])
df_num.columns
  1. ๊ฐ column์˜ ๊ฒฐ์ธก์น˜ ์ˆซ์ž๋ฅผ ํŒŒ์•…ํ•˜๋ผ
df.isnull().sum(axis=0)
# axis=1์ด๋ผ๋ฉด ํ–‰ ๊ธฐ์ค€ sum. ๊ธฐ๋ณธ๊ฐ’์€ axis=0์œผ๋กœ ์„ค์ •๋˜์–ด ์žˆ๋‹ค.
  1. ๊ฐ ์ˆ˜์น˜ํ˜• ๋ณ€์ˆ˜์˜ ๋ถ„ํฌ(์‚ฌ๋ถ„์œ„, ํ‰๊ท , ํ‘œ์ค€ํŽธ์ฐจ, ์ตœ๋Œ€ , ์ตœ์†Œ)๋ฅผ ํ™•์ธํ•˜๋ผ
# ์ผ์ผ์ด ์ˆ˜์น˜ํ˜• ๋ณ€์ˆ˜ column ์‚ด๋ ค์ฃผ์ง€ ์•Š์•„๋„ ํ•œ๋ฒˆ์— ๊ฐ€๋Šฅ
df.describe()
  1. ํ‰๊ท  ์†๋„์˜ 4๋ถ„์œ„์ˆ˜(IQR)๋ฅผ ๊ตฌํ•˜์—ฌ๋ผ
df["ํ‰๊ท  ์†๋„"].quantile(0.75) - df["ํ‰๊ท  ์†๋„"].quantile(0.25)
  1. ์๋ฉด๋™๋ช… Column์˜ ์œ ์ผ๊ฐ’ ๊ฐฏ์ˆ˜๋ฅผ ์ถœ๋ ฅํ•˜๋ผ
len(df["์๋ฉด๋™๋ช…"].unique())
# ์œ„์ฒ˜๋Ÿผ ๋ณต์žกํ•˜๊ธฐ ํ•  ์ˆ˜๋„ ์žˆ์ง€๋งŒ, nunique๋ฅผ ์ด์šฉํ•˜๋ฉด ๋” ๋น ๋ฅด๋‹ค
df["์๋ฉด๋™๋ช…"].nunique() 

Group 2. Filtering & Sorting

  1. quantity column ๊ฐ’์ด 3์ธ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ•˜์—ฌ ์ฒซ 5ํ–‰์„ ์ถœ๋ ฅํ•˜๋ผ
# df[์กฐ๊ฑด] ์ด๋ผ๊ณ  ์“ฐ๋ฉด ์กฐ๊ฑด์„ ๋งŒ์กฑํ•˜๋Š” ์• ๋“ค๋งŒ ๋”ฐ๋กœ ๋ฝ‘์•„์„œ DataFrame ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค. 
df[df["quantity"] == 3).head()
  1. quantity, item_name ๋‘๊ฐœ์˜ column์œผ๋กœ ๊ตฌ์„ฑ๋œ ์ƒˆ๋กœ์šด DataFrame ์ •์˜ํ•˜๋ผ
df.loc[:,["quantity","item_name"]]
  1. item_price ์ปฌ๋Ÿผ์˜ ๋‹ฌ๋Ÿฌํ‘œ์‹œ ๋ฌธ์ž๋ฅผ ์ œ๊ฑฐํ•˜๊ณ  float ํƒ€์ž…์œผ๋กœ ์ €์žฅํ•˜์—ฌ new_price ์ปฌ๋Ÿผ์— ์ €์žฅํ•˜๋ผ
df["new_price"] = df["item_price"].str[1:].astype('float')
  1. new_price๊ฐ’์ด 9 ์ดํ•˜์ด๊ณ  item_name ๊ฐ’์ด Chicken Salad Bowl ์ธ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ์ถ”์ถœํ•˜๋ผ
df_c = df.loc[(df["item_name"]== 'Chicken Salad Bowl') & (df["new_price"]<=9)]
  1. df์˜ item_name ์ปฌ๋Ÿผ ๊ฐ’์ค‘ Chips ํฌํ•จํ•˜๋Š” ๊ฒฝ์šฐ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ถœ๋ ฅํ•˜๋ผ
# .str.contains ๋ฉ”์„œ๋“œ๋กœ ํŠน์ • ๋ฌธ๊ตฌ ํฌํ•จํ•˜๋Š” ๋ฐ์ดํ„ฐ ์ถœ๋ ฅํ•จ
df_chips = df.loc[df["item_name"].str.contains('Chips')]

Group 4. Apply, Map

  1. Income_Category์˜ ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ map ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ณ€๊ฒฝํ•˜์—ฌ newIncome ์ปฌ๋Ÿผ์— ๋งคํ•‘ํ•˜๋ผ Unknown : N
    Less than $40K : a
    $40K - $60K : b
    $60K - $80K : c
    $80K - $120K : d
    $120K +โ€™ : e
incomedic = { 
    'Unknown' : 'N',
    'Less than $40K' : 'a',
    '$40K - $60K' : 'b',
    '$60K - $80K' : 'c',
    '$80K - $120K' : 'd',
    '$120K +' : 'e'}
df["new_income"] = df.Income_Category.map(lambda x : incomedic[x])
df.head()
  1. Income_Category์˜ ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ apply ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ณ€๊ฒฝํ•˜์—ฌ newIncome ์ปฌ๋Ÿผ์— ๋งคํ•‘ํ•˜๋ผ
def changeCategory(x):
    if x =='Unknown':
        return 'N'
    elif x =='Less than $40K':
        return 'a'
    elif x =='$40K - $60K':   
        return 'b'
    elif x =='$60K - $80K':    
        return 'c'
    elif x =='$80K - $120K':   
        return 'd'
    elif x =='$120K +' :     
        return 'e'

df['newIncome']  =df.Income_Category.apply(changeCategory)
  1. Customer_Age์˜ ๊ฐ’์„ ์ด์šฉํ•˜์—ฌ ๋‚˜์ด ๊ตฌ๊ฐ„์„ AgeState ์ปฌ๋Ÿผ์œผ๋กœ ์ •์˜ํ•˜๋ผ. (0~9 : 0 , 10~19 :10 , 20~29 :20 โ€ฆ ๊ฐ ๊ตฌ๊ฐ„์˜ ๋นˆ๋„์ˆ˜๋ฅผ ์ถœ๋ ฅํ•˜๋ผ
df["Age_State"] = df.Customer_Age.map(lambda x : (x//10) * 10)
df.Age_State.value_counts().sort_index()
  1. Education_Level์˜ ๊ฐ’์ค‘ Graduate๋‹จ์–ด๊ฐ€ ํฌํ•จ๋˜๋Š” ๊ฐ’์€ 1 ๊ทธ๋ ‡์ง€ ์•Š์€ ๊ฒฝ์šฐ์—๋Š” 0์œผ๋กœ ๋ณ€๊ฒฝํ•˜์—ฌ newEduLevel ์ปฌ๋Ÿผ์„ ์ •์˜ํ•˜๊ณ  ๋นˆ๋„์ˆ˜๋ฅผ ์ถœ๋ ฅํ•˜๋ผ
df['newEduLevel'] = df.Education_Level.map(lambda x : 1 if 'Graduate' in x else 0)
Ans = df['newEduLevel'].value_counts()
  1. Marital_Status ์ปฌ๋Ÿผ๊ฐ’์ด Married ์ด๊ณ  Card_Category ์ปฌ๋Ÿผ์˜ ๊ฐ’์ด Platinum์ธ ๊ฒฝ์šฐ 1 ๊ทธ์™ธ์˜ ๊ฒฝ์šฐ์—๋Š” ๋ชจ๋‘ 0์œผ๋กœ ํ•˜๋Š” newState์ปฌ๋Ÿผ์„ ์ •์˜ํ•˜๋ผ. newState์˜ ๊ฐ ๊ฐ’๋“ค์˜ ๋นˆ๋„์ˆ˜๋ฅผ ์ถœ๋ ฅํ•˜๋ผ
# ์—ฌ๊ธฐ์„œ๋Š” x๊ฐ€ data frame์œผ๋กœ ์ธ์‹๋˜๋„๋ก ํ•จ. apply ํ•  ๋•Œ๋„ df๋ ˆ๋ฒจ์—์„œ ์ ์šฉํ•œ๋‹ค.
def check(x):
    if x.Marital_Status =='Married' and x.Card_Category =='Platinum':
        return 1
    else:
        return 0
df['newState'] = df.apply(check,axis=1)
# column ๋ฐฉํ–ฅ ๊ณ„์‚ฐ์ด๋ฏ€๋กœ axis = 1
df['newState'].value_counts()
  1. Gender ์ปฌ๋Ÿผ๊ฐ’ M์ธ ๊ฒฝ์šฐ male F์ธ ๊ฒฝ์šฐ female๋กœ ๊ฐ’์„ ๋ณ€๊ฒฝํ•˜์—ฌ Gender ์ปฌ๋Ÿผ์— ์ƒˆ๋กญ๊ฒŒ ์ •์˜ํ•˜๋ผ. ๊ฐ value์˜ ๋นˆ๋„๋ฅผ ์ถœ๋ ฅํ•˜๋ผ
def change(x):
    if x["Gender"] == 'M':
        return 'Male'
    elif x["Gender"] == 'F':
        return 'Female'
df["Gender"] = df.apply(change,axis=1)
df["Gender"].value_counts()

Group 5. Time Series

65-0. ๊ฐ Column์˜ ๋ฐ์ดํ„ฐ ํƒ€์ž… ํŒŒ์•…ํ•˜๊ธฐ

df=pd.read_csv("ํŒŒ์ผ ๊ฒฝ๋กœ ํ˜น์€ Url.csv")
df.info()
  1. Yr_Mo_Dy์„ ํŒ๋‹ค์Šค์—์„œ ์ธ์‹ํ•  ์ˆ˜ ์žˆ๋Š” datetime64ํƒ€์ž…์œผ๋กœ ๋ณ€๊ฒฝํ•˜๋ผ
df.Yr_Mo_Dy = pd.to_datetime(df.Yr_Mo_Dy)
ํ˜น์€, 
df.Yr_Mo_Dy = df.Yr_Mo_Dy.astype('datetime64')
  1. Yr_Mo_Dy์— ์กด์žฌํ•˜๋Š” ๋…„๋„์˜ ์œ ์ผ๊ฐ’์„ ๋ชจ๋‘ ์ถœ๋ ฅํ•˜๋ผ
# datetime ํ˜•์‹์œผ๋กœ ๋งž์ถ˜ ์ •๋ณด๋“ค์€ dt.year dt.month ๋”ฐ์œ„๋กœ ํ˜ธ์ถœ์ด ๊ฐ€๋Šฅํ•˜๋‹ค!
df.Yr_Mo_Dy.dt.year.unique()

Group 6. Pivot

  1. Indicator์„ ์‚ญ์ œํ•˜๊ณ  First Tooltip ์ปฌ๋Ÿผ์—์„œ ์‹ ๋ขฐ๊ตฌ๊ฐ„์— ํ•ด๋‹นํ•˜๋Š” ํ‘œํ˜„์„ ์ง€์›Œ๋ผ
df2.drop('Indicator',axis = 1)
df2["First Tooltip"] = df2["First Tooltip"].map(lambda x : float(x.split("[")[0]))
  1. 84๋ฒˆ ๋ฌธ์ œ์—์„œ ์ถ”์ถœํ•œ ๋ฐ์ดํ„ฐ๋กœ ์•„๋ž˜์™€ ๊ฐ™์ด ๋‚˜๋ผ์— ๋”ฐ๋ฅธ ๋…„๋„๋ณ„ ์‚ฌ๋ง๋ฅ ์„ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ํ™” ํ•˜๋ผ

Group 7. Merge & Concat

91-0. DataFrame ๋ถˆ๋Ÿฌ์˜ฌ ๋•Œ index column์ด ์ž๋™์ ์œผ๋กœ ์ƒ๊ธฐ๊ฒŒ ๋˜๋Š”๋ฐ(0,1,2...), index_col = 0 ์œผ๋กœ ๋ณด์ • ๊ฐ€๋Šฅ

df=pd.read_csv("ํŒŒ์ผ ๊ฒฝ๋กœ ํ˜น์€ Url.csv",index_col=0) 
  1. df1๊ณผ df2 ๋ฐ์ดํ„ฐ๋ฅผ ํ•˜๋‚˜์˜ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์œผ๋กœ ํ•ฉ์ณ๋ผ
# column์˜ ๊ฐœ์ˆ˜ ๊ฐ™์„ ๊ฒฝ์šฐ ๋’ค์— ์ด์–ด์„œ ๋ถ™์ด๋Š” ์ž‘์—…
pd.concat([df1,df2])

92-3. df3๊ณผ df4 ๋ฐ์ดํ„ฐ๋ฅผ ํ•˜๋‚˜์˜ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์œผ๋กœ ํ•ฉ์ณ๋ผ. concat()

case 1. join = 'inner'
# ๋‘ DataFrame์˜ row์™€ column์ค‘ ๊ฒน์น˜๋Š” column์— ๋Œ€ํ•ด์„œ๋งŒ ํ•ฉ์นจ
pd.concat([df3,df4],join='inner')

case 2. join = 'outer'
# ๋‘ DataFrame์˜ row์™€ column ๋ชจ๋‘ ์‚ด๋ ค์„œ ๊ฒฐ์ธก์น˜๋Š” NaN์œผ๋กœ ์ฑ„์›€
pd.concat([df3,df4],join='outer').fillna(0)

94-95. df5์™€ df6 ๋ฐ์ดํ„ฐ๋ฅผ ํ•˜๋‚˜์˜ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์œผ๋กœ ํ•ฉ์ณ๋ผ. merge()

# concat๊ณผ์˜ ์ฐจ์ด๋Š” 'ํŠน์ • ๊ณตํ†ต์—ด' ๊ธฐ์ค€์œผ๋กœ ํ•ฉ์น˜๋ ค๊ณ  ํ•  ๋•Œ
Algeria์ปฌ๋Ÿผ์„ key๋กœ ํ•˜๊ณ  ๋‘ ๋ฐ์ดํ„ฐ ๋ชจ๋‘ ํฌํ•จํ•˜๋Š” ๋ฐ์ดํ„ฐ๋งŒ ์ถœ๋ ฅํ•˜๋ผ
pd.merge(df5,df6,on='Algeria',how='inner')
profile
์ง๋ฌด์ „ํ™˜ ๋ฌด์ƒˆ. ๋”ด์ƒ๊ฐ ์˜ค์ง€๊ฒŒ ํ•จ. ๊ทผ๋ฐ 1๋…„์ฐจ ์ง์žฅ์ธใ…‹.

0๊ฐœ์˜ ๋Œ“๊ธ€