๐Ÿ˜ข ์Šคํ„ฐ๋””๋…ธํŠธ (Machine Learning 10)

zoeยท2023๋…„ 5์›” 24์ผ
0

CREDIT CARD FRAUD DETECTION

๋ฐ์ดํ„ฐ ๊ฐœ์š”
์‹ ์šฉ์นด๋“œ ์‚ฌ๊ธฐ ๊ฒ€์ถœ ๋ถ„๋ฅ˜ ์‹ค์Šต์šฉ ๋ฐ์ดํ„ฐ
๋ฐ์ดํ„ฐ์— class๋ผ๋Š” ์ด๋ฆ„์˜ ์ปฌ๋Ÿผ์ด ์‚ฌ๊ธฐ ์œ ๋ฌด๋ฅผ ์˜๋ฏธ
class ์ปฌ๋Ÿผ์˜ ๋ถˆ๊ท ํ˜•์ด ๊ทน์‹ฌํ•ด์„œ ์ „์ฒด ๋ฐ์ดํ„ฐ์˜ ์•ฝ 0.172%๊ฐ€ 1(์‚ฌ๊ธฐ Fraud)์„ ๊ฐ€์ง

๋ฐ์ดํ„ฐ ํŠน์„ฑ
๊ธˆ์œต ๋ฐ์ดํ„ฐ์ด๊ณ  ๊ธฐ์—…์˜ ๊ธฐ๋ฐ€ ๋ณดํ˜ธ๋ฅผ ์œ„ํ•ด ๋Œ€๋‹ค์ˆ˜ ํŠน์„ฑ์˜ ์ด๋ฆ„์€ ์‚ญ์ œ๋˜์–ด ์žˆ์Œ.

  • Amount : ๊ฑฐ๋ž˜๊ธˆ์•ก
  • Class : Fraud ์—ฌ์œ  (1์ด๋ฉด Fraud)
# ๋ฐ์ดํ„ฐ ์ฝ๊ธฐ

import pandas as pd

data_path = './creditcard.csv'

raw_data = pd.read_csv(data_path)
raw_data.head()
# ํŠน์„ฑ
# ๋ฐ์ดํ„ฐ์˜ ํŠน์„ฑ์€ ์ด๋ฆ„์ด ๊ฐ์ถฐ์ ธ ์žˆ๋‹ค.

raw_data.columns
# ๋ฐ์ดํ„ฐ ๋ผ๋ฒจ์˜ ๋ถˆ๊ท ํ˜•์ด ์‹ฌํ•˜๋‹ค

raw_data['Class'].value_counts()
# fraud rate : 0.17%

fraud_rate = round(raw_data['Class'].value_counts()[1]/len(raw_data) * 100, 2)
print('Frauds', fraud_rate, '% of the dataset')
# ๊ทธ๋ž˜ํ”„๋กœ ํ‘œํ˜„๋˜๊ธฐ ํž˜๋“ค๋‹ค

import seaborn as sns
import matplotlib.pyplot as plt

sns.countplot(x='Class', data=raw_data)
plt.title('Class Distributions \n (0 : No Fraud || 1 : Fraud)', fontsize = 14)
plt.show()
# ์ผ๋‹จ, X, y๋กœ ๋ฐ์ดํ„ฐ ์„ ์ •

X = raw_data.iloc[:, 1:-1]
y = raw_data.iloc[:, -1]

X.shape, y.shape
# ๋ฐ์ดํ„ฐ ๋‚˜๋ˆ„๊ธฐ
# stratify : ๋ถ„ํฌ ๋น„์œจ์„ ๋งž์ถฐ ๋‚˜๋ˆŒ ์ˆ˜ ์žˆ์Œ 
# https://wikidocs.net/43332
# https://yeko90.tistory.com/entry/what-is-stratify-in-traintestsplit

from sklearn.model_selection import train_test_split

X_tarin, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.3, 
                                                    random_state=13, 
                                                    stratify=y)
# ๋‚˜๋ˆˆ ๋ฐ์ดํ„ฐ์˜ ๋ถˆ๊ท ํ˜• ์ •๋„ ํ™•์ธ

import numpy as np

np.unique(y_train, return_counts=True)
tmp = np.unique(y_train, return_counts=True)[1]
print(tmp[1] / len(y_train) * 100, '%')
print(np.unique(y_test, return_counts=True)[1][1] / len(y_test) *100 , '%')




1st Trial

# ๋ถ„๋ฅ˜๊ธฐ์˜ ์„ฑ๋Šฅ์„ returnํ•˜๋Š” ํ•จ์ˆ˜ ์ž‘์„ฑ

from sklearn.metrics import (accuracy_score, precision_score, 
                             recall_score, f1_score, roc_auc_score)

def get_clf_eval(y_test, pred) : 
    acc = accuracy_score(y_test, pred)
    pre = precision_score(y_test, pred)
    re = recall_score(y_test, pred)
    f1 = f1_score(y_test, pred)
    auc = roc_auc_score(y_test, pred)
    
    return acc, pre, re, f1, auc
    
# ์„ฑ๋Šฅ ์ถœ๋ ฅ ํ•จ์ˆ˜ ์ž‘์„ฑ

from sklearn.metrics import confusion_matrix

def print_clf_eval(y_test, pred):
    confusion = confusion_matrix(y_test, pred)
    acc, pre, re, f1, auc = get_clf_eval(y_test, pred)
    
    print('=> confusion matrix')
    print(confusion)
    print('====================')
    
    print('Accuracy : {0:.4f}, Precision : {1:.4f}' .format(acc, pre))
    print('Recall: {0:.4f}, F1 : {1:.4f}, AUC : {2:.4f}'.format(re, f1, auc))
# Logistic Regression
# recall์ด 60%๊ฐ€ ์•ˆ๋œ๋‹ค

from sklearn.linear_model import LogisticRegression

lr_clf = LogisticRegression(random_state=13, solver='liblinear')
lr_clf.fit(X_tarin, y_train)
lr_pred = lr_clf.predict(X_test)

print_clf_eval(y_test, lr_pred)
# Decision Tree
# recall์ด 71%

from sklearn.tree import DecisionTreeClassifier

dt_clf = DecisionTreeClassifier(random_state=13, max_depth=4)
dt_clf.fit(X_tarin, y_train)
dt_pred = dt_clf.predict(X_test)

print_clf_eval(y_test, dt_pred)
# Random Forest
# recall์ด 74%

from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(random_state=13, n_jobs=-1, n_estimators=100)
rf_clf.fit(X_tarin, y_train)
rf_pred = rf_clf.predict(X_test)

print_clf_eval(y_test, rf_pred)
# LightGBM
# recall์ด 77%

from lightgbm import LGBMClassifier

lgbm_clf = LGBMClassifier(n_estimators=1000, num_leaves=64, n_jobs=-1,
                          boost_from_average = False)

lgbm_clf.fit(X_tarin, y_train)
lgbm_pred = lgbm_clf.predict(X_test)

print_clf_eval(y_test, lgbm_pred)
  • ์—ฌ๊ธฐ์„œ Recall๊ณผ Precision์˜ ์˜๋ฏธ๋Š”
    - ์€ํ–‰ ์ž…์žฅ์—์„œ๋Š” Recall์ด ์ข‹์„ ๊ฒƒ์ด๋‹ค.
    - ์‚ฌ์šฉ์ž ์ž…์žฅ์—์„œ๋Š” Precision์ด ์ข‹์„ ์ˆ˜ ์žˆ๋‹ค




ํ•œ๊ฑธ์Œ ์ „์ง„

# ๋ชจ๋ธ๊ณผ ๋ฐ์ดํ„ฐ๋ฅผ ์ฃผ๋ฉด ์„ฑ๋Šฅ์„ ์ถœ๋ ฅํ•˜๋Š” ํ•จ์ˆ˜ ์ƒ์„ฑ

def get_result(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    
    return get_clf_eval(y_test, pred)
# ๋‹ค์ˆ˜์˜ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ์ •๋ฆฌํ•ด์„œ DataFrame์œผ๋กœ ๋ฐ˜ํ™˜ํ•˜๋Š” ํ•จ์ˆ˜ ์ž‘์„ฑ

def get_result_pd(models, model_names, X_train, y_train, X_test, y_test):
    col_names = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']
    tmp = []
    
    for model in models:
        tmp.append(get_result(model, X_train, y_train, X_test, y_test))
    
    return pd.DataFrame(tmp, columns=col_names, index=model_names)
# 4๊ฐœ์˜ ๋ถ„๋ฅ˜ ๋ชจ๋ธ์„ ํ•œ ๋ฒˆ์— ํ‘œ๋กœ ์ •๋ฆฌ
# ์•™์ƒ๋ธ” ๊ณ„์—ด์˜ ์„ฑ๋Šฅ์ด ์šฐ์ˆ˜ํ•˜๋‹ค(randomforest, lightGBM)

import time

models = [lr_clf, dt_clf, rf_clf, lgbm_clf]
model_names = ['LinearReg', 'DecisionTree', 'RandomForest', 'LightGBM']

start_time = time.time()
results = get_result_pd(models, model_names, X_train, y_train, X_test, y_test)

print('Fit time : ', time.time() - start_time)
results




2nd Trial

# raw_data์˜ Amount ์ปฌ๋Ÿผ ํ™•์ธ
# Amount๋Š” ์‹ ์šฉ์นด๋“œ ์‚ฌ์šฉ ๊ธˆ์•ก
# ์ปฌ๋Ÿผ ๋ถ„ํฌ๊ฐ€ ํŠน์ • ๋Œ€์—ญ์— ์•„์ฃผ ๋งŽ๋‹ค

import seaborn as sns

plt.figure(figsize=(10, 5))
sns.distplot(raw_data['Amount'], color='r')
plt.show()
# Amount ์ปฌ๋Ÿผ์— StandardScaler ์ ์šฉ

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
amount_n = scaler.fit_transform(raw_data['Amount'].values.reshape(-1, 1))

raw_data_copy = raw_data.iloc[:, 1: -2]
raw_data_copy['Amount_Scaled'] = amount_n
raw_data_copy.head()
raw_data['Amount'].values.reshape(-1, 1)
# ๋ฐ์ดํ„ฐ ๋‚˜๋ˆ„๊ธฐ

X_train, X_test, y_train, y_test = train_test_split(raw_data_copy, y, test_size=0.3, 
                                                    random_state=13, stratify=y)
# ๋ชจ๋ธ์— ๋‹ค์‹œ ํ‰๊ฐ€

models = [lr_clf, dt_clf, rf_clf, lgbm_clf]
model_names = ['LinearReg', 'DecisionTree', 'RandomForest', 'LightGBM']

start_time = time.time()
results = get_result_pd(models, model_names, X_train, y_train, X_test, y_test)

print('Fit time : ', time.time() - start_time)
results
# ๋ชจ๋ธ๋ณ„ ROC ์ปค๋ธŒ

from sklearn.metrics import roc_curve

def draw_roc_curve(models, model_names, X_test, y_test):
    plt.figure(figsize=(10, 10))
    
    for model in range(len(models)):
        pred = models[model].predict_proba(X_test)[:, 1]
        fpr, tpr, thresholds = roc_curve(y_test, pred)
        plt.plot(fpr, tpr, label = model_names[model])
        
    plt.plot([0, 1], [0, 1], 'k--', label = 'random quess')
    plt.title('ROC')
    plt.legend()
    plt.grid()
    plt.show()

draw_roc_curve(models, model_names, X_test, y_test)
# ๋‹ค๋ฅธ ์‹œ๋„ log scale

amount_log = np.log1p(raw_data['Amount'])

raw_data_copy['Amount_Scaled'] = amount_log
raw_data_copy.head()
# ๋ถ„ํฌ๊ฐ€ ๋ณ€ํ™”ํ•จ

plt.figure(figsize=(10, 5))
sns.histplot(raw_data_copy['Amount_Scaled'],kde=True, color='r')

plt.show()
# ๋‹ค์‹œ ์„ฑ๋Šฅ ํ™•์ธ
# ๋ฏธ์„ธํ•œ ๋ณ€ํ™”๊ฐ€ ๋ณด์ด์ง€๋งŒ ํ™•์‹คํ•œ ๋ณ€ํ™”๋Š” ๊ด€์ฐฐ๋˜์ง€ ์•Š๋Š”๋‹ค.

X_train, X_test, y_train, y_test = train_test_split(raw_data_copy, y, test_size=0.3, 
                                                    random_state=13, stratify=y)
start_time = time.time()
results = get_result_pd(models, model_names, X_train, y_train, X_test, y_test)

print('Fit time : ', time.time() - start_time)
results
# ROC ์ปค๋ธŒ ๊ฒฐ๊ณผ

draw_roc_curve(models, model_names, X_test, y_test)




3rd Trial

# ํŠน์ด ๋ฐ์ดํ„ฐ

import seaborn as sns

plt.figure(figsize=(10, 7))
sns.boxenplot(data=raw_data[['V13', 'V14', 'V15']])
# Outlier๋ฅผ ์ •๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด Outlier์˜ ์ธ๋ฑ์Šค๋ฅผ ํŒŒ์•…ํ•˜๋Š” ์ฝ”๋“œ

def get_outlier(df=None, column=None, weight = 1.5):
    fraud = df[df['Class']==1][column] # ๋น„์ •์ƒ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ๋งŒ outlier๋ฅผ ํ™•์ธํ•˜๊ฒ ๋‹ค
    quantile_25 = np.percentile(fraud.values, 25)
    quantile_75 = np.percentile(fraud.values, 75)
    
    iqr = quantile_75 - quantile_25
    iqr_weight = iqr * weight
    lowest_val = quantile_25 - iqr_weight
    highest_val = quantile_75 + iqr_weight
    
    outlier_index = fraud[(fraud < lowest_val) | (fraud > highest_val)].index
    
    return outlier_index
    
    
# Outlier ์ฐพ๊ธฐ

get_outlier(df=raw_data, column='V14', weight=1.5)
# Outlier ์ œ๊ฑฐ

raw_data_copy.shape
outlier_index = get_outlier(df=raw_data, column='V14', weight=1.5)
raw_data_copy.drop(outlier_index, axis=0, inplace=True)
raw_data_copy.shape
# Outlier๋ฅผ ์ œ๊ฑฐํ•˜๊ณ  ๋ฐ์ดํ„ฐ ๋‚˜๋ˆ„๊ธฐ

X = raw_data_copy

raw_data.drop(outlier_index, axis=0, inplace=True) # outlier๋ฅผ ์ œ๊ฑฐํ•œ y ๋ฐ์ดํ„ฐ๋ฅผ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด
y = raw_data.iloc[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
                                                    random_state=13, stratify=y)
# ๋‹ค์‹œ ์„ฑ๋Šฅ ํ™•์ธ
# recall์ด 80%๊นŒ์ง€ ๋” ์„ฑ๋Šฅ์ด ๋‚˜์•„์ง

models = [lr_clf, dt_clf, rf_clf, lgbm_clf]
model_names = ['LinearReg', 'DecisionTree', 'RandomForest', 'LightGBM']

start_time = time.time()
results = get_result_pd(models, model_names, X_train, y_train, X_test, y_test)

print('Fit time : ', time.time() - start_time)
results
# ROC ์ปค๋ธŒ

draw_roc_curve(models, model_names, X_test, y_test)




4th Trial - Oversampling

  • ๋ฐ์ดํ„ฐ์˜ ๋ถˆ๊ท ํ˜•์ด ์‹ฌํ•  ๋•Œ ๋ถˆ๊ท ํ˜•ํ•œ ๋‘ ํด๋ž˜์Šค์˜ ๋ถ„ํฌ๋ฅผ ๊ฐ•์ œ๋กœ ๋งž์ถฐ๋ณด๋Š” ์ž‘์—…
  • ์–ธ๋”์ƒ˜ํ”Œ๋ง : ๋งŽ์€ ์ˆ˜์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ ์€ ์ˆ˜์˜ ๋ฐ์ดํ„ฐ๋กœ ๊ฐ•์ œ๋กœ ์กฐ์ •
  • ์˜ค๋ฒ„์ƒ˜ํ”Œ๋ง
    - ์›๋ณธ ๋ฐ์ดํ„ฐ์˜ ํ”ผ์ฒ˜ ๊ฐ’๋“ค์„ ์•„์ฃผ ์•ฝ๊ฐ„ ๋ณ€๊ฒฝํ•˜์—ฌ ์ฆ์‹
    - ๋Œ€ํ‘œ์ ์œผ๋กœ๋Š” SMOTE(Synthetic Minority Over-sampling Technique) ๋ฐฉ๋ฒ•์ด ์žˆ๋‹ค
    - ์ ์€ ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ์žˆ๋Š” ๊ฐœ๋ณ„ ๋ฐ์ดํ„ฐ๋ฅผ k-์ตœ๊ทผ์ ‘์ด์›ƒ ๋ฐฉ๋ฒ•์œผ๋กœ ์ฐพ์•„์„œ ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ ์‚ฌ์ด์— ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ๋ฅผ ๋งŒ๋“œ๋Š” ๋ฐฉ์‹
    - imbalanced-learn์ด๋ผ๋Š” python pkg๊ฐ€ ์žˆ๋‹ค

โ€ป ๋‹ค๋ฅธ ์‚ฌ๋žŒ์ด ์ž‘์„ฑํ•œ ๊ฒƒ ์ฐธ๊ณ  : https://velog.io/@jaylnne/ML-%EC%8B%A0%EC%9A%A9%EC%B9%B4%EB%93%9C-%EC%82%AC%EA%B8%B0-%ED%83%90%EC%A7%80-%EB%AA%A8%EB%8D%B8-%EB%A7%8C%EB%93%A4%EC%96%B4%EB%B3%B4%EA%B8%B0

# imbalanced-learn ์„ค์น˜

#!pip install imbalanced-learn
# SMOTE ์ ์šฉ

from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=13)

X_train_over, y_train_over = SMOTE(random_state=13).fit_resample(X_train, y_train)
# fit_sample์ด fit_resample๋กœ ๋ฐ”๋€œ
# train ๋ฐ์ดํ„ฐ๋กœ ์ฆ๊ฐ•ํ•ด์•ผ ํ•œ๋‹ค, train๋ฐ์ดํ„ฐ์— ๋ณ€๊ฒฝํ•˜๋Š” ์„ค์ •์€ test์— ํ•˜๋ฉด ์•ˆ๋œ๋‹ค (์™œ๊ณก๋  ์ˆ˜ ์žˆ์Œ) โ˜…โ˜…โ˜… 
# (scaler๋Š” test๋ฐ์ดํ„ฐ์— ์ ์šฉ ๊ฐ€๋Šฅ)
# ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ํšจ๊ณผ

X_train.shape, y_train.shape
X_train_over.shape, y_train_over.shape
# ๊ฒฐ๊ณผ

print(np.unique(y_train, return_counts=True))
print(np.unique(y_train_over, return_counts=True))
# ๋‹ค์‹œ ์„ฑ๋Šฅ ํ™•์ธ
# recall์€ ํ™•์‹คํžˆ ์ข‹์•„์ง„๋‹ค (LinearReg, DecisionTree์˜ precision์€ ๋งค์šฐ ๋‚ฎ์•„์ง)

# ์—๋Ÿฌ : Found input variables with inconsistent numbers of samples
# ํ•ด๊ฒฐ๋ฐฉ๋ฒ• : https://lovelydiary.tistory.com/425

models = [lr_clf, dt_clf, rf_clf, lgbm_clf]
model_names = ['LinearReg', 'DecisionTree', 'RandomForest', 'LightGBM']

start_time = time.time()
results = get_result_pd(models, model_names, X_train_over, y_train_over, X_test, y_test)

print('Fit time : ', time.time() - start_time)
results
# ROC ์ปค๋ธŒ

draw_roc_curve(models, model_names, X_test, y_test)

์–ด๋ ต...์˜ค๋ฒ„์ƒ˜ํ”Œ๋งํ•  ๋•Œ ์ž๊พธ X_train_over, y_train_over์˜ row ์ˆ˜๊ฐ€ ์•ˆ๋งž๋‹ค๋Š” ๋“ฏ์ด ์—๋Ÿฌ๊ฐ€ ๋‚˜์™”๋Š”๋ฐ ๋‹ค์‹œ ์ „์ฒด์ ์œผ๋กœ ๋Œ๋ฆฌ๋‹ค๋ณด๋‹ˆ ๋˜์—ˆ๋‹ค...?

๐Ÿ’ป ์ถœ์ฒ˜ : ์ œ๋กœ๋ฒ ์ด์Šค ๋ฐ์ดํ„ฐ ์ทจ์—… ์Šค์ฟจ

profile
#๋ฐ์ดํ„ฐ๋ถ„์„ #ํผํฌ๋จผ์Šค๋งˆ์ผ€ํŒ… #๋ฐ์ดํ„ฐ #๋””์ง€ํ„ธ๋งˆ์ผ€ํŒ…

0๊ฐœ์˜ ๋Œ“๊ธ€