Ch4 Model Evaluation 27-35 (머신러닝 8-9)

김민지·2023년 5월 21일

Part 09. 머신러닝

목록 보기

3/4

모델평가의 개념

회귀모델은 실제 값과의 에러치를 가지고 계산해서 평가 가능
분류모델의 평가 항목이 조금 다양함(정확도, 오차행렬, 정밀도, 재현율, ..)
이진 분류모델(0,1을 맞힘)의 평가는 4가지(TP, FN, TN, FP)

TP (True Positive) : 실제 Positive를 Positive라고 맞힌 경우
FN (False Negative) : 실제 Positive를 Negative라고 틀리게 예측한 경우
TN (True Negative) : 실제 Negative를 Negative라고 맞힌 경우
FP (False Positive) : 실제 Negative를 Positive라고 틀리게 예측한 경우
-> Positive(1), Negative(0)

<모델 평가 지표>
1) Accuracy : 전체 데이터 중 맞게 예측한 것의 비율

2) Precision : 양성(1)이라고 예측한 것 중에서 실제 양성(1)의 비율
-> TP / (TP + FP)
3) Recall : 참(1)인 데이터들 중에서 참(1)이라고 예측한 것(재현율)
-> TP / (TP + FN)
4) Fall-Out : 실제 양성이 아닌데, 양성이라고 잘못 예측한 경우
(실제 0인것들 중에서 1이라고 예측한 것)
-> FP / (FP + TN)

Recall은 암환자(1)를 암이 아니라고(0) 진단할 가능성을 놓치지 않기 위한 상황에서 필요함
Precision은 중요한 메일을 스팸메일이라고 오해할 가능성을 놓치지 않기 위한 상황에서 필요함
분류 모델은 그 결과를 속할 비율(확률)을 반환함
-> 지금까지는 그 비율에서 threshold를 0.5라고 하고, 그것보다 크면 1, 작으면 0으로 결과를 반영함
threshold를 변경해 가면서 모델 평가 지표들 관찰해보기

<정리>
1) Recall은 실제 참인 데이터 중에서 참이라고 예측한 데이터의 비율
2) Precision은 참이라고 예측한 것 중에서 실제 참인 데이터의 비율
3) 실제 양성인 데이터를 음성이라고 판단하면 안 되는 경우라면 Recall이 중요하고, 이 경우는 Threshold를 0.3 혹은 0.4로 선정해야 함 (ex-암환자 선별)
4) 실제 음성인 데이터를 양성이라고 판단하면 안 되는 경우라면 Precision이 중요하고, 이 경우는 Threshold를 0.8 혹은 0.9로 선정해야 함 (ex-스팸메일)
-> 그러나 Recall과 Precision은 서로 영향을 주기 때문에 한 쪽을 극단적으로 높게 설정해서는 안 됨

F1-Score : Recall과 Precision을 결합한 지표
-> Recall과 Precision이 어느 한쪽으로 치우치지 않고 둘 다 높은 값을 가질수록 높은 값을 가짐

ROC와 AUC

1) ROC 곡선

FPR(False Positive Rate)이 변할 때, TPR(True Positive Rate)의 변화를 그린 그림
FPR을 x축, TPR을 y축으로 놓고 그림
TPR은 Recall, FPR은 Fall-out을 의미함
직선에 가까울수록 머신러닝 모델의 성능이 떨어지는 것으로 판단함
그래프가 y축에 가깝게 붙는것이 성능이 높은 것

2) AUC

ROC 곡선 아래의 면적
일반적으로 1에 가까울수록 좋은 수치
기울기가 1인 직선 아래의 면적이 0.5 -> AUC는 0.5보다 커야 함

3) ROC 커브 그려보기

import pandas as pd

red_url = 'https://raw.githubusercontent.com/PinkWink/ML_tutorial' + \
                                        '/master/dataset/winequality-red.csv'
white_url = 'https://raw.githubusercontent.com/PinkWink/ML_tutorial' + \
                                        '/master/dataset/winequality-white.csv'

red_wine = pd.read_csv(red_url, sep=';')
white_wine = pd.read_csv(white_url, sep=';')

red_wine['color'] = 1.
white_wine['color'] = 0.

wine = pd.concat([red_wine, white_wine])
wine['taste'] = [1. if grade>5 else 0. for grade in wine['quality']]

X = wine.drop(['taste', 'quality'], axis=1)
y = wine['taste']

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=13)

wine_tree = DecisionTreeClassifier(max_depth=2, random_state=13)
wine_tree.fit(X_train, y_train)

y_pred_tr = wine_tree.predict(X_train)
y_pred_test = wine_tree.predict(X_test)

print('Train Acc : ', accuracy_score(y_train, y_pred_tr))
print('Test Acc : ', accuracy_score(y_test, y_pred_test))

모델 평가 지표들 확인

from sklearn.metrics import accuracy_score, precision_score
from sklearn.metrics import recall_score, f1_score
from sklearn.metrics import roc_auc_score, roc_curve

print('Accuracy : ', accuracy_score(y_test, y_pred_test))
print('Recall : ', recall_score(y_test, y_pred_test))
print('Precision : ', precision_score(y_test, y_pred_test))
print('AUC Score : ', roc_auc_score(y_test, y_pred_test))
print('F1 Score : ', f1_score(y_test, y_pred_test))

ROC 곡선 그리기

import matplotlib.pyplot as plt
%matplotlib inline

pred_proba = wine_tree.predict_proba(X_test)[:, 1] # 모든 행의 첫 번째만 (1일 확률만)
fpr, tpr, thresholds = roc_curve(y_test, pred_proba)

plt.figure(figsize=(10,8))
plt.plot([0,1], [0,1])  # (0,0), (1,1)을 잇는 직선그래프 그림 (보조선 역할)
plt.plot(fpr, tpr)
plt.grid()
plt.show()

수학의 기초

다항함수

import numpy as np
import matplotlib.pyplot as plt


x = np.linspace(-3, 2, 100)  # -3부터 2까지 숫자 중 균등한 간격의 100개를 가져옴
y = 3 * x**2 + 2

plt.figure(figsize=(12,8))
plt.plot(x, y)
plt.grid()
plt.xlabel('$x$')  # $$ 사이에 쓰면 italic으로 기울어지는등 수학식으로 나옴 
plt.ylabel('$3x^2 +2$')

x = np.linspace(-5, 5, 100)
y1 = 3*x**2 + 2
y2 = 3*(x+1)**2 + 2

plt.figure(figsize=(12,8))
plt.plot(x, y1, lw=2, ls='dashed', label='$y=3x^2 +2$')
plt.plot(x, y2, label='$y=3(x+1)^2 +2$')
plt.legend(fontsize=15)

plt.xlabel('$x$', fontsize=25)
plt.ylabel('$y$', fontsize=25)
plt.show()

-> x 자리에 x+1이 들어가면, x축 방향으로 -1 움직임 (평행이동)

지수함수

x = np.linspace(-2, 2, 100)

a11, a12, a13 = 2, 3, 4
y11, y12, y13 = a11**x, a12**x, a13**x

a21, a22, a23 = 1/2, 1/3, 1/4
y21, y22, y23 = a21**x, a22**x, a23**x

fig, ax = plt.subplots(1, 2, figsize=(12, 6))

ax[0].plot(x, y11, color='k', label='$2^x$')
ax[0].plot(x, y12, '--', color='k', label='$3^x$')
ax[0].plot(x, y13, ':', color='k', label='$4^x$')
ax[0].legend(fontsize=20)


ax[1].plot(x, y21, color='k', label='$(1/2)^x$')
ax[1].plot(x, y22, '--', color='k', label='$(1/3)^x$')
ax[1].plot(x, y23, ':', color='k', label='$(1/4)^x$')
ax[1].legend(fontsize=20)

plt.show()

-> x 자리에 -x가 들어가면 좌우 대칭의 형태가 됨

지수 증가

x = np.linspace(0, 10)

plt.figure(figsize=(6,6))
plt.plot(x, x**2, '--', color='k', label='$x^2$')
plt.plot(x, 2**x, color='k', label='$2^x$')

로그함수

def log(x, base) :
	return np.log(x)/np.log(base)

x1 = np.linspace(0.0001, 5, 1000)
x2 = np.linspace(0.01, 5, 100)

y11, y12 = log(x1, 10), log(x2, np.e)
y21, y22 = log(x1, 1/10), log(x2, 1/np.e)

fig, ax = plt.subplots(1, 2, figsize=(12, 6))

ax[0].plot(x1, y11, label=r"$\log_{10} x$", color='k')
ax[0].plot(x2, y12, '--', label="$\log_{e} x$", color='k')

ax[0].set_xlabel('$x$', fontsize=25)
ax[0].set_ylabel('$y$', fontsize=25)
ax[0].legend(fontsize=20, loc='lower right')

ax[1].plot(x1, y21, label=r"$\log_{1/10} x$", color='k')
ax[1].plot(x2, y22, '--', label="$\log_{1/e} x$", color='k')

ax[1].set_xlabel('$x$', fontsize=25)
ax[1].set_ylabel('$y$', fontsize=25)
ax[1].legend(fontsize=20, loc='upper right')

plt.show()

시그모이드(Sigmoid)

z = np.linspace(-10, 10, 100)
sigma = 1/(1+np.exp(-z))

plt.figure(figsize=(12,8))
plt.plot(z, sigma)
plt.xlabel('$z$', fontsize=25)
plt.ylabel('$\sigma(z)$', fontsize=25)
plt.show()

-> 무조건 0과 1사이의 값을 가짐

다변수 벡터함수

u = np.linspace(0, 1, 30)
v = np.linspace(0, 1, 30) 
U, V = np.meshgrid(u, v)
X = U
Y = V
Z = (1+U**2) + (V/(1+V**2))

fig = plt.figure(figsize=(7, 7))
ax = plt.axes(projection='3d')

ax.xaxis.set_tick_params(labelsize=15)
ax.yaxis.set_tick_params(labelsize=15)
ax.zaxis.set_tick_params(labelsize=15)

ax.set_xlabel(r'$x$', fontsize=20)
ax.set_ylabel(r'$y$', fontsize=20)
ax.set_zlabel(r'$z$', fontsize=20)

ax.scatter3D(X, Y, Z, marker='.', color='gray')

plt.show()

합성함수

x = np.linspace(-4, 4, 100)
y = x**3 - 15*x + 30  # f(x)
z = np.log(y)         # g(y)

각각의 함수 모양 보기

fig, ax = plt.subplots(1, 2, figsize=(12, 6))

ax[0].plot(x, y, label=r'$x^3 - 15x + 30$', color='k')
ax[0].legend(fontsize=18)

ax[1].plot(y, z, label=r'$\log(y)$', color='k')
ax[1].legend(fontsize=18)

plt.show()

합성한 함수 모양 보기

fig, ax = plt.subplots(1, 2, figsize=(12, 6))

ax[0].plot(x, z, '--', label=r'$\log(f(x))$', color='k')
ax[0].legend(fontsize=18)

ax[1].plot(x, y, label=r'$x^3 - 15x + 30$', color='k')
ax[1].legend(fontsize=18)

ax_tmp = ax[1].twinx()
ax_tmp.plot(x, z, '--', label=r'$\log(f(x))$', color='k')

plt.show()

boxplot

samples = [1, 7, 9, 16, 36, 39, 45, 45, 46, 48, 51, 100, 101]
tmp_y = [1] * len(samples)

plt.figure(figsize=(12, 4))
plt.scatter(samples, tmp_y)
plt.grid()
plt.show()

np.median(samples)

np.percentile(samples, 25)   # 25퍼센트 지점

np.percentile(samples, 75)

iqr = np.percentile(samples, 75) - np.percentile(samples, 25)
iqr * 1.5

그려보기

q1 = np.percentile(samples, 25)
q2 = np.median(samples)
q3 = np.percentile(samples, 75)

iqr = q3 - q1
upper_fence = q3 + iqr*1.5
lower_fence = q1 - iqr*1.5

plt.figure(figsize=(12, 4))
plt.scatter(samples, tmp_y)

plt.axvline(x=q1, color='black')
plt.axvline(x=q2, color='red')
plt.axvline(x=q3, color='black')

plt.axvline(x=upper_fence, color='black', ls='dashed')
plt.axvline(x=lower_fence, color='black', ls='dashed')

plt.grid()
plt.show()

import seaborn as sns

plt.figure(figsize=(12, 4))
sns.boxplot(samples)
plt.grid()
plt.show()

<제로베이스 데이터 취업 스쿨>

김민지

이전 포스트

Ch3 Preprocessing 16-26 (머신러닝 5-7)

다음 포스트

Ch4 Model Evaluation 27-35 (머신러닝 8-9)

Part 09. 머신러닝

Ch3 Preprocessing 16-26 (머신러닝 5-7)

Ch5 Linear Regression 36-43 (머신러닝 10-11)

0개의 댓글