แ„‚ ๐Ÿ˜„ [7 ์ผ์ฐจ] : FUNDAMENTAL

๋ฐฑ๊ฑดยท2022๋…„ 1์›” 21์ผ
0
post-thumbnail

๋ฐ์ดํ„ฐ๋ฅผ ํ•œ๋ˆˆ์—! Visualization



  • ํ•™์Šต ๋ชฉํ‘œ

- ํŒŒ์ด์ฌ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ(Pandas, Matplotlib, Seaborn)์„ ์ด์šฉ - ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ฆฌ๋Š” ๋ฒ•์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
- ์‹ค์ „ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ์ง์ ‘ ์‹œ๊ฐํ™”ํ•ด๋ณด๋ฉฐ ๋ฐ์ดํ„ฐ ๋ถ„์„์— ํ•„์š”ํ•œ ํƒ์ƒ‰์  ๋ฐ์ดํ„ฐ ๋ถ„์„(EDA)์„ ํ•˜๊ณ  ์ธ์‚ฌ์ดํŠธ๋ฅผ ๋„์ถœํ•ด ๋ด…๋‹ˆ๋‹ค.



  • ํ•™์Šต ๋ชฉ์ฐจ

1, ํŒŒ์ด์ฌ์œผ๋กœ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ฆฐ๋‹ค๋Š” ๊ฑด?
2. ๊ฐ„๋‹จํ•œ ๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ
3. ๊ทธ๋ž˜ํ”„ 4๋Œ€ ์ฒœ์™•: ๋ง‰๋Œ€๊ทธ๋ž˜ํ”„, ์„ ๊ทธ๋ž˜ํ”„, ์‚ฐ์ ๋„, ํžˆ์Šคํ† ๊ทธ๋žจ
4. ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ ์‹œ๊ฐํ™”ํ•˜๊ธฐ
5. Heatmap

ํŒŒ์ด์ฌ์œผ๋กœ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ฆฐ๋‹ค๋Š” ๊ฑด?

์ค€๋น„๋ฌผ


  • Pandas, Matplotlib, Seaborn ๋“ฑ : ์‹œ๊ฐํ™” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์ œ๊ณต

Matplotlib์™€ Seaborn pip์„ ์ด์šฉ ์„ค์น˜

pip list | grep matplotlib
pip list | grep seaborn

๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ ๊ฐœ์š”


๋„ํ™”์ง€๋ฅผ ํŽผ์น˜๊ณ  ์ถ•์„ ๊ทธ๋ฆฌ๊ณ  ๊ทธ ์•ˆ์— ๋ฐ์ดํ„ฐ๋ฅผ ๊ทธ๋ฆผ

  • ์•„๋ž˜๋Š” ๋ง‰๋Œ€ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ฆฌ๋Š” ์ฝ”๋“œ
import matplotlib.pyplot as plt
%matplotlib inline

# ๊ทธ๋ž˜ํ”„ ๋ฐ์ดํ„ฐ 
subject = ['English', 'Math', 'Korean', 'Science', 'Computer']
points = [40, 90, 50, 60, 100]

# ์ถ• ๊ทธ๋ฆฌ๊ธฐ
fig = plt.figure()
ax1 = fig.add_subplot(1,1,1)

# ๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ
ax1.bar(subject, points)

# ๋ผ๋ฒจ, ํƒ€์ดํ‹€ ๋‹ฌ๊ธฐ
plt.xlabel('Subject')
plt.ylabel('Points')
plt.title("Yuna's Test Result")

# ๋ณด์—ฌ์ฃผ๊ธฐ
plt.savefig('./barplot.png')  # ๊ทธ๋ž˜ํ”„๋ฅผ ์ด๋ฏธ์ง€๋กœ ์ถœ๋ ฅ
plt.show()                    # ๊ทธ๋ž˜ํ”„๋ฅผ ํ™”๋ฉด์œผ๋กœ ์ถœ๋ ฅ

๋ง‰๋Œ€๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ

๋ฐ์ดํ„ฐ ์ •์˜


import ํ•˜๊ธฐ

import matplotlib.pyplot as plt   #๋ชจ๋“ˆ์„ ๋ถˆ๋Ÿฌ์˜ค๊ณ  
%matplotlib inline                
# ๋งค์ง ๋ฉ”์„œ๋“œ
# ๊ทธ๋ž˜ํ”„ ๋ฐ์ดํ„ฐ 
subject = ['English', 'Math', 'Korean', 'Science', 'Computer']
points = [40, 90, 50, 60, 100]

์ถ•๊ทธ๋ฆฌ๊ธฐ

# ์ถ• ๊ทธ๋ฆฌ๊ธฐ
fig = plt.figure()           #๋„ํ™”์ง€(๊ทธ๋ž˜ํ”„) ๊ฐ์ฒด ์ƒ์„ฑ
ax1 = fig.add_subplot(1,1,1) #figure()๊ฐ์ฒด์— add_subplot ๋ฉ”์„œ๋“œ๋ฅผ ์ด์šฉํ•ด ์ถ•์„ ๊ทธ๋ ค์ค€๋‹ค.


fig = plt.figure()
<Figure size 432x288 with 0 Axes>
fig = plt.figure(figsize=(5,2)) #figsize ์ธ์ž ๊ฐ’์„ ์ฃผ์–ด ๊ทธ๋ž˜ํ”„์˜ ํฌ๊ธฐ๋ฅผ ์ •ํ•  ์ˆ˜ ์žˆ์Œ
ax1 = fig.add_subplot(1,1,1) # (nrows, ncols, index)


fig = plt.figure()
ax1 = fig.add_subplot(2,2,1)
ax2 = fig.add_subplot(2,2,2)
ax3 = fig.add_subplot(2,2,4)

๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ

# ๊ทธ๋ž˜ํ”„ ๋ฐ์ดํ„ฐ 
subject = ['English', 'Math', 'Korean', 'Science', 'Computer']
points = [40, 90, 50, 60, 100]

# ์ถ• ๊ทธ๋ฆฌ๊ธฐ
fig = plt.figure() 
ax1 = fig.add_subplot(1,1,1)

# ๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ
ax1.bar(subject,points)
<BarContainer object of 5 artists>


$ ํ”Œ๋กฏ ํ…Œ์ŠคํŠธ $

fig = plt.figure()
ax1 = fig.add_subplot(3,3,1)
ax2 = fig.add_subplot(3,3,2)
ax3 = fig.add_subplot(3,3,3)
ax4 = fig.add_subplot(3,3,5)


๊ทธ๋ž˜ํ”„ ์š”์†Œ ์ถ”๊ฐ€


label, title


x๋ผ๋ฒจ, y๋ผ๋ฒจ, ์ œ๋ชฉ์„ ์ถ”๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š”
xlabel() ๋ฉ”์„œ๋“œ์™€ ylabel() ๋ฉ”์„œ๋“œ title() ๋ฉ”์„œ๋“œ๋ฅผ ์ด์šฉ

plt.xlabel('Subject')  
plt.ylabel('Points')  
plt.title("Yuna's Test Result") 
Text(0.5, 1.0, "Yuna's Test Result")


# ๊ทธ๋ž˜ํ”„ ๋ฐ์ดํ„ฐ 
subject = ['English', 'Math', 'Korean', 'Science', 'Computer']
points = [40, 90, 50, 60, 100]

# ์ถ• ๊ทธ๋ฆฌ๊ธฐ
fig = plt.figure()
ax1 = fig.add_subplot(1,1,1)

# ๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ
ax1.bar(subject, points)

# ๋ผ๋ฒจ, ํƒ€์ดํ‹€ ๋‹ฌ๊ธฐ
plt.xlabel('Subject')
plt.ylabel('Points')
plt.title("Yuna's Test Result")
Text(0.5, 1.0, "Yuna's Test Result")


์„  ๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ

๋ฐ์ดํ„ฐ ์ •์˜


๊ณผ๊ฑฐ ์•„๋งˆ์กด ์ฃผ๊ฐ€ ๋ฐ์ดํ„ฐ
AMZN

from datetime import datetime
import pandas as pd
import os

# ๊ทธ๋ž˜ํ”„ ๋ฐ์ดํ„ฐ 
csv_path = os.getenv("HOME") + "/aiffel/data_visualization/data/AMZN.csv"
data = pd.read_csv(csv_path ,index_col=0, parse_dates=True)
price = data['Close']

# ์ถ• ๊ทธ๋ฆฌ๊ธฐ ๋ฐ ์ขŒํ‘œ์ถ• ์„ค์ •
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
price.plot(ax=ax, style='black')
plt.ylim([1600,2200])
plt.xlim(['2019-05-01','2020-03-01'])

# ์ฃผ์„๋‹ฌ๊ธฐ
important_data = [(datetime(2019, 6, 3), "Low Price"),(datetime(2020, 2, 19), "Peak Price")]
for d, label in important_data:
    ax.annotate(label, xy=(d, price.asof(d)+10), # ์ฃผ์„์„ ๋‹ฌ ์ขŒํ‘œ(x,y)
                xytext=(d,price.asof(d)+100), # ์ฃผ์„ ํ…์ŠคํŠธ๊ฐ€ ์œ„์ฐจํ•  ์ขŒํ‘œ(x,y)
                arrowprops=dict(facecolor='red')) # ํ™”์‚ดํ‘œ ์ถ”๊ฐ€ ๋ฐ ์ƒ‰ ์„ค์ •

# ๊ทธ๋ฆฌ๋“œ, ํƒ€์ดํ‹€ ๋‹ฌ๊ธฐ
plt.grid()
ax.set_title('StockPrice')

# ๋ณด์—ฌ์ฃผ๊ธฐ
plt.show()


Pands Series ๋ฐ์ดํ„ฐ ํ™œ์šฉ.

Pandas์˜ Series๋Š” ์„  ๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ ์ตœ์ ํ™”
price = data['Close']๊ฐ€ ๋ฐ”๋กœ Pandas์˜ Series

price.plot(ax=ax, style='black')์—์„œ
Pandas์˜ plot์„ ์‚ฌ์šฉํ•˜๋ฉด์„œ,
matplotlib์—์„œ ์ •์˜ํ•œ subplot ๊ณต๊ฐ„ ax๋ฅผ ์‚ฌ์šฉ

์ขŒํ‘œ์ถ• ์„ค์ •

plt.xlim(), plt.ylim()์„ ํ†ตํ•ด x, y ์ขŒํ‘œ์ถ•์˜ ์ ๋‹นํ•œ ๋ฒ”์œ„๋ฅผ ์„ค์ •

์ฃผ์„

๊ทธ๋ž˜ํ”„ ์•ˆ์— ์ถ”๊ฐ€์ ์œผ๋กœ ๊ธ€์ž๋‚˜ ํ™”์‚ดํ‘œ ๋“ฑ ์ฃผ์„์„ ๊ทธ๋ฆด ๋•Œ๋Š” annotate() ๋ฉ”์„œ๋“œ๋ฅผ ์ด์šฉ

๊ทธ๋ฆฌ๋“œ

grid() ๋ฉ”์„œ๋“œ๋ฅผ ์ด์šฉํ•˜๋ฉด ๊ทธ๋ฆฌ๋“œ(๊ฒฉ์ž๋ˆˆ๊ธˆ)๋ฅผ ์ถ”๊ฐ€

plot ์‚ฌ์šฉ๋ฒ• ์ƒ์„ธ

plt.plot()๋กœ ๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ


๊ธฐ๋ณธ์ ์œผ๋กœ
figure() ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•˜๊ณ  add_subplot()์œผ๋กœ ์„œ๋ธŒํ”Œ๋กฏ์„ ์ƒ์„ฑํ•˜๋ฉฐ plot์„ ๊ทธ๋ฆผ

  • plt.plot() ๋ช…๋ น์œผ๋กœ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ฆฌ๋ฉด
    matplotlib์€ ๊ฐ€์žฅ ์ตœ๊ทผ์˜ figure ๊ฐ์ฒด์™€ ๊ทธ ์„œ๋ธŒํ”Œ๋กฏ์„ ๊ทธ๋ฆผ
    ๋งŒ์•ฝ ์„œ๋ธŒํ”Œ๋กฏ์ด ์—†์œผ๋ฉด ์„œ๋ธŒํ”Œ๋กฏ ํ•˜๋‚˜๋ฅผ ์ƒ์„ฑ

plt.plot()์˜ ์ธ์ž๋กœ x ๋ฐ์ดํ„ฐ, y ๋ฐ์ดํ„ฐ, ๋งˆ์ปค ์˜ต์…˜, ์ƒ‰์ƒ ๋“ฑ์˜ ์ธ์ž๋ฅผ ์ด์šฉ

import numpy as np
x = np.linspace(0, 10, 100) #0์—์„œ 10๊นŒ์ง€ ๊ท ๋“ฑํ•œ ๊ฐ„๊ฒฉ์œผ๋กœ  100๊ฐœ์˜ ์ˆซ์ž๋ฅผ ๋งŒ๋“ค๋ผ๋Š” ๋œป์ž…๋‹ˆ๋‹ค.
plt.plot(x, np.sin(x),'o')
plt.plot(x, np.cos(x),'--', color='black') 
plt.show()


์„œ๋ธŒํ”Œ๋กฏ๋„ plt.subplot์„ ์ด์šฉํ•ด ์ถ”๊ฐ€

x = np.linspace(0, 10, 100) 

plt.subplot(2,1,1)
plt.plot(x, np.sin(x),'orange','o')

plt.subplot(2,1,2)
plt.plot(x, np.cos(x), 'orange') 
plt.show()


linestyle, marker ์˜ต์…˜

x = np.linspace(0, 10, 100) 

plt.plot(x, x + 0, linestyle='solid') 
plt.plot(x, x + 1, linestyle='dashed') 
plt.plot(x, x + 2, linestyle='dashdot') 
plt.plot(x, x + 3, linestyle='dotted')
plt.plot(x, x + 0, '-g') # solid green 
plt.plot(x, x + 1, '--c') # dashed cyan 
plt.plot(x, x + 2, '-.k') # dashdot black 
plt.plot(x, x + 3, ':r'); # dotted red
plt.plot(x, x + 4, linestyle='-') # solid 
plt.plot(x, x + 5, linestyle='--') # dashed 
plt.plot(x, x + 6, linestyle='-.') # dashdot 
plt.plot(x, x + 7, linestyle=':'); # dotted


Pandas๋กœ ๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ


  • plot()๋ฉ”์„œ๋“œ ์ด์šฉ

Pandas.plot ๋ฉ”์„œ๋“œ ์ธ์ž

  • label: ๊ทธ๋ž˜ํ”„์˜ ๋ฒ”๋ก€ ์ด๋ฆ„.
  • ax: ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ฆด matplotlib์˜ ์„œ๋ธŒํ”Œ๋กฏ ๊ฐ์ฒด.
  • style: matplotlib์— ์ „๋‹ฌํ•  'ko--'๊ฐ™์€ ์Šคํƒ€์ผ์˜ ๋ฌธ์ž์—ด
  • alpha: ํˆฌ๋ช…๋„ (0 ~1)
  • kind: ๊ทธ๋ž˜ํ”„์˜ ์ข…๋ฅ˜: line, bar, barh, kde
  • logy: Y์ถ•์— ๋Œ€ํ•œ ๋กœ๊ทธ ์Šค์ผ€์ผ
  • use_index: ๊ฐ์ฒด์˜ ์ƒ‰์ธ์„ ๋ˆˆ๊ธˆ ์ด๋ฆ„์œผ๋กœ ์‚ฌ์šฉํ• ์ง€์˜ ์—ฌ๋ถ€
  • rot: ๋ˆˆ๊ธˆ ์ด๋ฆ„์„ ๋กœํ…Œ์ด์…˜(0 ~ 360)
  • xticks, yticks: x์ถ•, y์ถ•์œผ๋กœ ์‚ฌ์šฉํ•  ๊ฐ’
  • xlim, ylim: x์ถ•, y์ถ• ํ•œ๊ณ„
  • grid: ์ถ•์˜ ๊ทธ๋ฆฌ๋“œ ํ‘œ์‹œํ• ์ง€ ์—ฌ๋ถ€

pandas์˜ data๊ฐ€ DataFrame ์ผ ๋•Œ plot ๋ฉ”์„œ๋“œ ์ธ์ž

  • subplots: ๊ฐ DataFrame์˜ ์นผ๋Ÿผ์„ ๋…๋ฆฝ๋œ ์„œ๋ธŒํ”Œ๋กฏ์— ๊ทธ๋ฆฐ๋‹ค.
  • sharex: subplots=True ๋ฉด ๊ฐ™์€ X ์ถ•์„ ๊ณต์œ ํ•˜๊ณ  ๋ˆˆ๊ธˆ๊ณผ ํ•œ๊ณ„๋ฅผ ์—ฐ๊ฒฐํ•œ๋‹ค.
  • sharey: subplots=True ๋ฉด ๊ฐ™์€ Y ์ถ•์„ ๊ณต์œ ํ•œ๋‹ค.
  • figsize: ๊ทธ๋ž˜ํ”„์˜ ํฌ๊ธฐ, ํŠœํ”Œ๋กœ ์ง€์ •
  • title: ๊ทธ๋ž˜ํ”„์˜ ์ œ๋ชฉ์„ ๋ฌธ์ž์—ด๋กœ ์ง€์ •
  • sort_columns: ์นผ๋Ÿผ์„ ์•ŒํŒŒ๋ฒณ ์ˆœ์„œ๋กœ ๊ทธ๋ฆฐ๋‹ค.

๋ง‰๋Œ€ ๊ทธ๋ฆฌํ”„ kind -> bar

fig, axes = plt.subplots(2, 1)
data = pd.Series(np.random.rand(5), index=list('abcde'))
data.plot(kind='bar', ax=axes[0], color='blue', alpha=1)
data.plot(kind='barh', ax=axes[1], color='red', alpha=0.3)
<AxesSubplot:>


์„  ๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๋Š” ๋ฒ•

df = pd.DataFrame(np.random.rand(6,4), columns=pd.Index(['A','B','C','D']))
df.plot(kind='line')
<AxesSubplot:>


์ •๋ฆฌํ•˜๊ธฐ

์ •๋ฆฌ


  1. fig = plt.figure(): figure ๊ฐ์ฒด๋ฅผ ์„ ์–ธํ•ด '๋„ํ™”์ง€๋ฅผ ํŽผ์ณ' ์ค€๋‹ค.
  2. ax1 = fig.add_subplot(1,1,1) : ์ถ•์„ ๊ทธ๋ฆฐ๋‹ค.
  3. ax1.bar(x, y) ์ถ•์•ˆ์— ์–ด๋–ค ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ฆด์ง€ ๋ฉ”์„œ๋“œ๋ฅผ ์„ ํƒํ•œ ๋‹ค์Œ, ์ธ์ž๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋„ฃ์–ด์ค€๋‹ค.
  4. ๊ทธ๋ž˜ํ”„ ํƒ€์ดํ‹€ ์ถ•์˜ ๋ ˆ์ด๋ธ” ๋“ฑ์„ plt์˜ ์—ฌ๋Ÿฌ ๋ฉ”์„œ๋“œ grid, xlabel, ylabel ์„ ์ด์šฉํ•ด์„œ ์ถ”๊ฐ€ํ•œ๋‹ค
  5. plt.savefig ๋ฉ”์„œ๋“œ๋ฅผ ์ด์šฉํ•ด ์ €์žฅํ•œ๋‹ค.

๊ทธ๋ž˜ํ”„

๋ฐ์ดํ„ฐ ์ค€๋น„


๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

seaborn์˜ load_dataset() ๋ฉ”์„œ๋“œ๋ฅผ ์ด์šฉ
๋ฉ”์„œ๋“œ๋ฅผ ์‹คํ–‰ํ•˜๋ฉด home directory์— seaborn-data๊ฐ€ ์ž๋™ ๋‹ค์šด๋กœ๋“œํ•˜์—ฌ ์ €์žฅ๋จ

import pandas as pd
import seaborn as sns

tips = sns.load_dataset("tips")

Tips ๋ฐ์ดํ„ฐ๋ฅผ ๋ถˆ๋Ÿฌ์˜ค์ž
tips.csv

๋ฐ์ดํ„ฐ ์‚ดํŽด๋ณด๊ธฐ(EDA)

Pandas์˜ dataframe๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ๊ตฌ์„ฑ ํ™•์ธํ•˜๊ธฐ

df = pd.DataFrame(tips)
df.head()
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
df.shape
(244, 7)
df.describe()
total_bill tip size
count 244.000000 244.000000 244.000000
mean 19.785943 2.998279 2.569672
std 8.902412 1.383638 0.951100
min 3.070000 1.000000 1.000000
25% 13.347500 2.000000 2.000000
50% 17.795000 2.900000 2.000000
75% 24.127500 3.562500 3.000000
max 50.810000 10.000000 6.000000
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB

๊ฒฐ์ธก๊ฐ’์ด ์—†์–ด ๊ฒฐ์ธก๊ฐ’ ์ฒ˜๋ฆฌ ํ•„์š”๊ฐ€ ์—†๋‹ค.

๋ฐ์ดํ„ฐ ๋ณ€์ˆ˜ ์ค‘ sex,smoker, day, time์€ ๋ฒ”์ฃผํ˜•

tips, total_bill, size๋Š” ์ˆ˜์น˜ํ˜• / size ๋ฒ”์ฃผํ˜•์œผ๋กœ (ํ…Œ์ด๋ธ” ์ธ์›์„ ์˜๋ฏธ)

print(df['sex'].value_counts())
print("===========================")


print(df['time'].value_counts())
print("===========================")


print(df['smoker'].value_counts())
print("===========================")


print(df['day'].value_counts())
print("===========================")


print(df['size'].value_counts())
print("===========================")
Male      157
Female     87
Name: sex, dtype: int64
===========================
Dinner    176
Lunch      68
Name: time, dtype: int64
===========================
No     151
Yes     93
Name: smoker, dtype: int64
===========================
Sat     87
Sun     76
Thur    62
Fri     19
Name: day, dtype: int64
===========================
2    156
3     38
4     37
5      5
1      4
6      4
Name: size, dtype: int64
===========================

๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ

๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ


๋ง‰๋Œ€๊ทธ๋ž˜ํ”„(bar graph)

Pandas์™€ Matplotlib๋ฅผ ํ™œ์šฉํ•œ ๋ฐฉ๋ฒ•

matplotlib์— ๋ฐ์ดํ„ฐ๋ฅผ ์ธ์ž๋กœ ๋„ฃ๊ธฐ ์œ„ํ•ด์„œ

pandas ๋ฐ์ดํ„ฐ ๋ฐ”๋กœ ์ด์šฉ ๋ถˆ๊ฐ€
๋ฐ์ดํ„ฐ๋ฅผ x series ๋˜๋Š” list,
y์— list ํ˜•ํƒœ๋กœ ๊ฐ๊ฐ ๋‚˜๋ˆ ์คŒ

#df์˜ ์ฒซ 5ํ–‰์„ ํ™•์ธํ•ด๋ด…์‹œ๋‹ค. 
df.head()
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
# tip ์ปฌ๋Ÿผ์„ ์„ฑ๋ณ„์— ๋Œ€ํ•œ ํ‰๊ท ์œผ๋กœ ๋‚˜ํƒ€๋‚ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

# pandas์˜ groupby ๋ฉ”์„œ๋“œ๋ฅผ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค.

grouped = df['tip'].groupby(df['sex']) # df['tip'] ์ปฌ๋Ÿผ์„ groupby() ํ•œ๋‹ค -> ํŒ์„ ์„ฑ๋ณ„์— ๋”ฐ๋ผ ๊ทธ๋ฃนํ™”ํ•œ๋‹ค.

# ->  ๊ฐ ์„ฑ๋ณ„ ๊ทธ๋ฃน์— ๋Œ€ํ•œ ์ •๋ณด(์ดํ•ฉ, ํ‰๊ท , ๋ฐ์ดํ„ฐ ๋Ÿ‰ ๋“ฑ)๊ฐ€ grouped ๊ฐ์ฒด์— ์ €์žฅ
# ํ‰๊ท ๊ณผ ๋ฐ์ดํ„ฐ๋Ÿ‰ ํ™•์ธ
grouped.mean() # ์„ฑ๋ณ„์— ๋”ฐ๋ฅธ ํŒ์˜ ํ‰๊ท .
sex
Male      3.089618
Female    2.833448
Name: tip, dtype: float64
grouped.size() # ์„ฑ๋ณ„์— ๋”ฐ๋ฅธ ๋ฐ์ดํ„ฐ ๋Ÿ‰(ํŒ ํšŸ์ˆ˜)
sex
Male      157
Female     87
Name: tip, dtype: int64
# ์„ฑ๋ณ„์— ๋”ฐ๋ฅธ ํŒ ์•ก์ˆ˜์˜ ํ‰๊ท ์„ ๋ง‰๋Œ€๊ทธ๋ž˜ํ”„๋กœ ๊ทธ๋ฆฌ๋ฉด
import numpy as np
sex = dict(grouped.mean()) #ํ‰๊ท  ๋ฐ์ดํ„ฐ๋ฅผ ๋”•์…”๋„ˆ๋ฆฌ ํ˜•ํƒœ๋กœ ๋ฐ”๊ฟ”์ค๋‹ˆ๋‹ค.
sex
{'Male': 3.0896178343949043, 'Female': 2.833448275862069}
x = list(sex.keys())  # x์ถ• ๋ฆฌ์ŠคํŠธ ํ˜•ํƒœ๋กœ
x
['Male', 'Female']
y = list(sex.values()) # y์ถ• ๋ฆฌ์ŠคํŠธ ํ˜•ํƒœ๋กœ
y
[3.0896178343949043, 2.833448275862069]
import matplotlib.pyplot as plt # ๋ง‰๋Œ€ ๊ทธ๋ž˜ํ”„ ์ธํฌํŠธํ•ด์„œ

plt.bar(x = x, height = y) # x์ถ• x, ๋†’์ด๋Š” y๋กœ 
plt.ylabel('tip[$]')       # y์ถœ ๋ผ๋ฒจ์€ ํŒ์œผ๋กœ
plt.title('Tip by Sex')    # x์ถ• ๋ผ๋ฒจ์€ ์„ฑ๋ณ„๋กด                                                                                 ใ… ใ…ใ…‡ใ…‡ใ…‡ใ…‡ใ…‡ใ…‡ใ…‡ ใ…Šใ…Šใ…Šใ…‹ 4ใ…Š ใ„นใ„นใ„นใ„นใ„นใ…
Text(0.5, 1.0, 'Tip by Sex')


Seaborn๊ณผ Matplotlib์„ ํ™œ์šฉํ•œ ๊ฐ„๋‹จํ•œ ๋ฐฉ๋ฒ•

sns.barplot์˜ ์ธ์ž๋กœ df๋ฅผ ๋„ฃ๊ณ  ์›ํ•˜๋Š” ์ปฌ๋Ÿผ์„ ์ง€์ •.

sns.barplot(data=df, x='sex', y='tip')
<AxesSubplot:xlabel='sex', ylabel='tip'>


# Matplot๊ณผ ํ•จ๊ป˜ ์‚ฌ์šฉ figsize, title  ๋“ฑ ๊ทธ๋ž˜ํ”„์— ๋‹ค์–‘ํ•œ ์˜ต์…˜
plt.figure(figsize=(10,6)) # ๋„ํ™”์ง€ ์‚ฌ์ด์ฆˆ๋ฅผ ์ •ํ•ฉ๋‹ˆ๋‹ค.
sns.barplot(data=df, x='sex', y='tip')
plt.ylim(0, 4) # y๊ฐ’์˜ ๋ฒ”์œ„๋ฅผ ์ •ํ•ฉ๋‹ˆ๋‹ค..
plt.title('Tip by sex') # ๊ทธ๋ž˜ํ”„ ์ œ๋ชฉ์„ ์ •ํ•ฉ๋‹ˆ๋‹ค.
Text(0.5, 1.0, 'Tip by sex')


# ์š”์ผ์— ๋”ฐ๋ฅธ tips
plt.figure(figsize=(10,6))
sns.barplot(data=df, x='day', y='tip')
plt.ylim(0, 4)
plt.title('Tip by day')
Text(0.5, 1.0, 'Tip by day')


# Subplot์„ ํ™œ์šฉ, ๋ฒ”์ฃผํ˜• ๊ทธ๋ž˜ํ”„๋ฅผ ๋‚˜ํƒ€๋‚ด๊ธฐ์— ์ข‹์€ ๊ฒƒ : violin plot ์‚ฌ์šฉ๊ฐ€๋Šฅ
# palette ์˜ต์…ฅ - ์ƒ‰์ƒ ์‚ฌ์šฉ.

fig = plt.figure(figsize=(10,7))

ax1 = fig.add_subplot(2,2,1)
sns.barplot(data=df, x='day', y='tip',palette="ch:.25")

ax2 = fig.add_subplot(2,2,2)
sns.barplot(data=df, x='sex', y='tip')

ax3 = fig.add_subplot(2,2,4)
sns.violinplot(data=df, x='sex', y='tip')

ax4 = fig.add_subplot(2,2,3)
sns.violinplot(data=df, x='day', y='tip',palette="ch:.25")
<AxesSubplot:xlabel='day', ylabel='tip'>

# catplot์„ ์‚ฌ์šฉ
sns.catplot(x="day", y="tip", jitter=False, data=tips)
<seaborn.axisgrid.FacetGrid at 0x7fa90ef12190>


์ˆ˜์น˜ํ˜• ๋ฐ์ดํ„ฐ.

์‚ฐ์ ๋„, ์„  ๊ทธ๋ž˜ํ”„ ์‚ฌ์šฉ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ข‹๋‹ค.
์ „์ฒด ๊ฐ€๊ฒฉ total_bill์— ๋”ฐ๋ฅธ tip ๋ฐ์ดํ„ฐ๋ฅผ ์‹œ๊ฐํ™”

๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ

์‚ฐ์ ๋„(scatter plot)

# hue์ธ์ž์— 'day'๋ฅผ ์ฃผ์–ด ์š”์ผ(day)์— ๋”ฐ๋ฅธ tip๊ณผ total_bill์˜ ๊ด€๊ณ„๋ฅผ ์‹œ๊ฐํ™”
sns.scatterplot(data=df , x='total_bill', y='tip', palette="ch:r=-.2,d=.3_r")
<AxesSubplot:xlabel='total_bill', ylabel='tip'>


sns.scatterplot(data=df , x='total_bill', y='tip', hue='day')
<AxesSubplot:xlabel='total_bill', ylabel='tip'>


์„  ๊ทธ๋ž˜ํ”„ (line graph)

  • plot์˜ ๊ธฐ๋ณธ
  • numpy๋ฅผ ์ด์šฉ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ
#np.random.randn ํ•จ์ˆ˜๋Š” ํ‘œ์ค€ ์ •๊ทœ๋ถ„ํฌ์—์„œ ๋‚œ์ˆ˜๋ฅผ ์ƒ์„ฑํ•˜๋Š” ํ•จ์ˆ˜์ž…๋‹ˆ๋‹ค. 
#cumsum()์€ ๋ˆ„์ ํ•ฉ์„ ๊ตฌํ•˜๋Š” ํ•จ์ˆ˜์ž…๋‹ˆ๋‹ค.
plt.plot(np.random.randn(50).cumsum())
[<matplotlib.lines.Line2D at 0x7fa90c127130>]


x = np.linspace(0, 10, 100) 
plt.plot(x, np.sin(x), 'o')
plt.plot(x, np.cos(x)) 
plt.show()


# Seaborn์„ ํ™œ์šฉ.
sns.lineplot(x=x, y=np.sin(x))
sns.lineplot(x=x, y=np.cos(x))
<AxesSubplot:>


ํžˆ์Šคํ† ๊ทธ๋žจ

ํžˆ์Šคํ† ๊ทธ๋žจ ๊ฐœ๋…

โ†”๊ฐ€๋กœ์ถ•
๊ณ„๊ธ‰: ๋ณ€์ˆ˜์˜ ๊ตฌ๊ฐ„, bin (or bucket)

โ†•์„ธ๋กœ์ถ•
๋„์ˆ˜: ๋นˆ๋„์ˆ˜, frequency
์ „์ฒด ์ด๋Ÿ‰: n

# x1์€ ํ‰๊ท ์€ 100์ด๊ณ  ํ‘œ์ค€ํŽธ์ฐจ๋Š” 15์ธ ์ •๊ทœ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค.
# x2๋Š” ํ‰๊ท ์€ 130์ด๊ณ  ํ‘œ์ค€ํŽธ์ฐจ๋Š” 15์ธ ์ •๊ทœ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค.
# ๋„์ˆ˜๋ฅผ 50๊ฐœ์˜ ๊ตฌ๊ฐ„์œผ๋กœ ํ‘œ์‹œํ•˜๋ฉฐ, ํ™•๋ฅ  ๋ฐ€๋„๊ฐ€ ์•„๋‹Œ ๋นˆ๋„๋กœ ํ‘œ๊ธฐํ•ฉ๋‹ˆ๋‹ค.

#๊ทธ๋ž˜ํ”„ ๋ฐ์ดํ„ฐ 
mu1, mu2, sigma = 100, 130, 15
x1 = mu1 + sigma*np.random.randn(10000)
x2 = mu2 + sigma*np.random.randn(10000)

# ์ถ• ๊ทธ๋ฆฌ๊ธฐ
fig = plt.figure()
ax1 = fig.add_subplot(1,1,1)

# ๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ
patches = ax1.hist(x1, bins=50, density=False) #bins๋Š” x๊ฐ’์„ ์ด 50๊ฐœ ๊ตฌ๊ฐ„์œผ๋กœ ๋‚˜๋ˆˆ๋‹ค๋Š” ๋œป์ž…๋‹ˆ๋‹ค.
patches = ax1.hist(x2, bins=50, density=False, alpha=0.5)
ax1.xaxis.set_ticks_position('bottom') # x์ถ•์˜ ๋ˆˆ๊ธˆ์„ ์•„๋ž˜ ํ‘œ์‹œ 
ax1.yaxis.set_ticks_position('left') #y์ถ•์˜ ๋ˆˆ๊ธˆ์„ ์™ผ์ชฝ์— ํ‘œ์‹œ

# ๋ผ๋ฒจ, ํƒ€์ดํ‹€ ๋‹ฌ๊ธฐ
plt.xlabel('Bins')
plt.ylabel('Number of Values in Bin')
ax1.set_title('Two Frequency Distributions')

# ๋ณด์—ฌ์ฃผ๊ธฐ
plt.show()


tips ๋ฐ์ดํ„ฐ ํ™•์ธ

# tips์˜ total_bill๊ณผ tips์— ๋Œ€ํ•œ ํžˆ์Šคํ† ๊ทธ๋žจ ํ™•์ธ
sns.histplot(df['total_bill'], label = "total_bill")
sns.histplot(df['tip'], label = "tip").legend()# legend()๋ฅผ ์ด์šฉํ•˜์—ฌ label์„ ํ‘œ์‹œํ•ด ์ค๋‹ˆ๋‹ค.
<matplotlib.legend.Legend at 0x7fa90be91700>


# ๊ฒฐ์ œ ๊ธˆ์•ก ๋Œ€๋น„ ํŒ์˜ ๋น„์œจ
df['tip_pct'] = df['tip'] / df['total_bill']
df['tip_pct'].hist(bins=50)
<AxesSubplot:>


# kind='kde'๋กœ ํ™•๋ฅ  ๋ฐ€๋„ ๊ทธ๋ž˜ํ”„
df['tip_pct'].plot(kind='kde')
<AxesSubplot:ylabel='Density'>


์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ ์‹œ๊ฐํ™”

๋ฐ์ดํ„ฐ ๊ฐ€์ ธ์˜ค๊ธฐ

flight.csv

csv_path = os.getenv("HOME") + "/aiffel/data_visualization/data/flights.csv"
data = pd.read_csv(csv_path)
flights = pd.DataFrame(data)
flights
year month passengers
0 1949 January 112
1 1949 February 118
2 1949 March 132
3 1949 April 129
4 1949 May 121
... ... ... ...
139 1960 August 606
140 1960 September 508
141 1960 October 461
142 1960 November 390
143 1960 December 432

144 rows ร— 3 columns

๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ

sns.barplot(data=flights, x='year', y='passengers')
<AxesSubplot:xlabel='year', ylabel='passengers'>


sns.pointplot(data=flights, x='year', y='passengers')
<AxesSubplot:xlabel='year', ylabel='passengers'>

sns.lineplot(data=flights, x='year', y='passengers')
<AxesSubplot:xlabel='year', ylabel='passengers'>


# ๋‹ฌ๋ณ„๋กœ ๋‚˜๋ˆ„์–ด ๋ณด๊ธฐ ์œ„ํ•ด hue ์ธ์ž์— 'month'๋ฅผ ํ• ๋‹น
sns.lineplot(data=flights, x='year', y='passengers', hue='month', palette='ch:.50')
plt.legend(bbox_to_anchor=(1.03, 1), loc=2) #legend ๊ทธ๋ž˜ํ”„ ๋ฐ–์— ์ถ”๊ฐ€ํ•˜๊ธฐ
<matplotlib.legend.Legend at 0x7fa90bcc53d0>


ํžˆ์Šคํ† ๊ทธ๋žจ

sns.histplot(flights['passengers'])
<AxesSubplot:xlabel='passengers', ylabel='Count'>


Heatmap

  • ์—ด์ง€๋„ ใ…‹ ๋ฐฉ๋Œ€ํ•œ ์–‘์˜ ๋ฐ์ดํ„ฐ์™€ ํ˜„์ƒ์„ ์ˆ˜์น˜์— ๋”ฐ๋ผ ์ƒ‰์ƒ์œผ๋กœ ์ฃผ๋กœ 2์ฐจ์›์œผ๋กœ ํ‘œ์‹œ
  • pivot์ด ํ•„์š”ํ•˜๊ธฐ๋„ ํ•จ.
  • pandas์˜ dataframe ์˜ pivot() ๋ฉ”์„œ๋„ ์‚ฌ์šฉ
# flights(DataFrame)์„ ํƒ‘์Šน๊ฐ ์ˆ˜๋ฅผ year๊ณผ month๋กœ pivot
pivot = flights.pivot(index='year', columns='month', values='passengers')
pivot
month April August December February January July June March May November October September
year
1949 129 148 118 118 112 148 135 132 121 104 119 136
1950 135 170 140 126 115 170 149 141 125 114 133 158
1951 163 199 166 150 145 199 178 178 172 146 162 184
1952 181 242 194 180 171 230 218 193 183 172 191 209
1953 235 272 201 196 196 264 243 236 229 180 211 237
1954 227 293 229 188 204 302 264 235 234 203 229 259
1955 269 347 278 233 242 364 315 267 270 237 274 312
1956 313 405 306 277 284 413 374 317 318 271 306 355
1957 348 467 336 301 315 465 422 356 355 305 347 404
1958 348 505 337 318 340 491 435 362 363 310 359 404
1959 396 559 405 342 360 548 472 406 420 362 407 463
1960 461 606 432 391 417 622 535 419 472 390 461 508
sns.heatmap(pivot)
<AxesSubplot:xlabel='month', ylabel='year'>


# ์—ฌ๊ธฐ์— ์˜ต์…˜ ์ถ”๊ฐ€
sns.heatmap(pivot, linewidths=.2, annot=True, fmt="d")
<AxesSubplot:xlabel='month', ylabel='year'>


sns.heatmap(pivot, cmap="YlGnBu")
<AxesSubplot:xlabel='month', ylabel='year'>


profile
๋งˆ์ผ€ํŒ…์„ ์œ„ํ•œ ์ธ๊ณต์ง€๋Šฅ ์„ค๊ณ„์™€ ์Šคํƒ€ํŠธ์—… Log

0๊ฐœ์˜ ๋Œ“๊ธ€