๐Ÿ˜ข ์Šคํ„ฐ๋””๋…ธํŠธ (Machine Learning 8)

zoeยท2023๋…„ 5์›” 22์ผ
0

Logistic Regression - ์•™์ƒ๋ธ” ๊ธฐ๋ฒ•

  • ์•™์ƒ๋ธ” ํ•™์Šต : ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ์ƒ์„ฑํ•˜๊ณ  ๊ทธ ์˜ˆ์ธก์„ ๊ฒฐํ•ฉํ•˜์—ฌ ์ •ํ™•ํ•œ ์ตœ์ข… ์˜ˆ์ธก์„ ๊ธฐ๋Œ€ํ•˜๋Š” ๊ธฐ๋ฒ•. ๋‹ค์–‘ํ•œ ๋ถ„๋ฅ˜๊ธฐ์˜ ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ ๊ฒฐํ•ฉํ•จ์œผ๋กœ์จ ๋‹จ์ผ ๋ถ„๋ฅ˜๊ธฐ๋ณด๋‹ค ์‹ ๋ขฐ์„ฑ์ด ๋†’์€ ์˜ˆ์ธก๊ฐ’์„ ์–ป๋Š” ๊ฒƒ. (์ •ํ˜• ๋ฐ์ดํ„ฐ๋ฅผ ๋Œ€์ƒ์œผ๋กœ ํ•˜๋Š” ๋ถ„๋ฅ˜๊ธฐ์—์„œ๋Š” ์•™์ƒ๋ธ” ๊ธฐ๋ฒ•์ด ๋›ฐ์–ด๋‚œ ์„ฑ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ๋‹ค)
  • voting

  • bagging : ๋ฐ์ดํ„ฐ๋ฅผ ์ค‘๋ณต์œผ๋กœ ํ—ˆ์šฉํ•ด์„œ ์ƒ˜ํ”Œ๋งํ•˜๊ณ  ๊ทธ ๊ฐ๊ฐ์˜ ๋ฐ์ดํ„ฐ์— ๊ฐ™์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ ์šฉํ•ด์„œ ๊ฒฐ๊ณผ๋ฅผ ํˆฌํ‘œ๋กœ ๊ฒฐ์ •ํ•จ. ๊ฐ๊ฐ์˜ ๋ถ„๋ฅ˜๊ธฐ์— ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ๊ฐ ์ƒ˜ํ”Œ๋งํ•ด์„œ ์ถ”์ถœํ•˜๋Š” ๋ฐฉ์‹์„ ๋ถ€ํŠธ์ŠคํŠธ๋ž˜ํ•‘(bootstrapping) ๋ถ„ํ•  ๋ฐฉ์‹์ด๋ผ๊ณ  ํ•œ๋‹ค

  • ์ตœ์ข… ๊ฒฐ์ •์—์„œ ํ•˜๋“œ ๋ณดํŒ…

  • ์ตœ์ข… ๊ฒฐ์ •์—์„œ ์†Œํ”„ํŠธ๋ณดํŒ… (ํ™•๋ฅ ์˜ ํ‰๊ท ๊ฐ’(?))

  • ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ Random Forest : ๊ฐ™์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ๊ตฌํ˜„ํ•˜๋Š” ๋ฐฐ๊น…(Bagging)์˜ ๋Œ€ํ‘œ์ ์ธ ๋ฐฉ๋ฒ•.
    - ์•™์ƒ๋ธ” ๋ฐฉ๋ฒ• ์ค‘์—์„œ ๋น„๊ต์  ์†๋„๊ฐ€ ๋น ๋ฅด๋ฉฐ ๋‹ค์–‘ํ•œ ์˜์—ญ์—์„œ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ๋‹ค. โ†’ ๋ถ€ํŠธ์ŠคํŠธ๋ž˜ํ•‘์€ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์ž‘์€ ๋ฐ์ดํ„ฐ ์…‹์„ ์ค‘์ฒฉ์„ ํ—ˆ์šฉํ•ด์„œ ๋งŒ๋“œ๋Š” ๊ฒƒ.
    - ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ๋Š” ๊ฒฐ์ •๋‚˜๋ฌด๋ฅผ ๊ธฐ๋ณธ์œผ๋กœ ํ•จ โ†’ ๋ถ€ํŠธ์ŠคํŠธ๋ž˜ํ•‘์œผ๋กœ ์ƒ˜ํ”Œ๋ง๋œ ๋ฐ์ดํ„ฐ๋งˆ๋‹ค ๊ฒฐ์ •๋‚˜๋ฌด๊ฐ€ ์˜ˆ์ธกํ•œ ๊ฒฐ๊ณผ๋ฅผ ์†Œํ”„ํŠธ๋ณดํŒ…์œผ๋กœ ์ตœ์ข… ์˜ˆ์ธก ๊ฒฐ๋ก ์„ ์–ป์Œ



์•™์ƒ๋ธ” ๊ธฐ๋ฒ• - HAR, Human Activity Recognition

  • IMU ์„ผ์„œ๋ฅผ ํ™œ์šฉํ•ด์„œ ์‚ฌ๋žŒ์˜ ํ–‰๋™์„ ์ธ์‹ํ•˜๋Š” ์‹คํ—˜

  • ํฐ์— ์žˆ๋Š” ๊ฐ€์†๋„/์ž์ด๋กœ ์„ผ์„œ ์‚ฌ์šฉ

  • ๋ฐ์ดํ„ฐ์˜ ๊ณต์‹ ๊ฒฝ๋กœ : https://archive.ics.uci.edu/ml/datasets/human+activity+recognition+using+smartphones

  • ๋ฐ์ดํ„ฐ์˜ ํŠน์„ฑ
    - ๊ฐ€์†๋„๊ณ„๋กœ๋ถ€ํ„ฐ 3์ถ• ๊ฐ€์†๋„(์ด ๊ฐ€์†๋„) ๋ฐ ์ถ”์ •๋œ ์‹ ์ฒด ๊ฐ€์†๋„
    - ์ž์ด๋กœ ์Šค์ฝ”ํ”„์˜ 3์ถ• ๊ฐ์†๋„
    - ์‹œ๊ฐ„ ๋ฐ ์ฃผํŒŒ์ˆ˜ ์˜์—ญ ๋ณ€์ˆ˜๊ฐ€ ํฌํ•จ๋œ 561 ๊ธฐ๋Šฅ ๋ฒกํ„ฐ
    - ํ™œ๋™ ๋ผ๋ฒจ
    - ์‹คํ—˜์„ ์ˆ˜ํ–‰ํ•œ ๋Œ€์ƒ์˜ ์‹๋ณ„์ž

  • ๋ฐ์ดํ„ฐ์˜ ํด๋ž˜์Šค : Walking Upstairs, Standing, Walking Downstairs, Sitting, Laying, Walking

  • ์‹œ๊ฐ„ ์˜์—ญ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ง์ ‘ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ ์–ด๋ ต๋‹ค.
    - ์‹œ๊ฐ„์˜์—ญ ๋ฐ์ดํ„ฐ๋ฅผ ๋จธ์‹ ๋Ÿฌ๋‹์— ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด ์—ฌ๋Ÿฌ ํ†ต๊ณ„์  ๋ฐ์ดํ„ฐ๋กœ ๋ณ€ํ™˜ํ•จ
    - ์‹œ๊ฐ„ ์˜์—ญ์˜ ํ‰๊ท , ๋ถ„์‚ฐ, ํ”ผํฌ, ์ค‘๊ฐ„ ๊ฐ’, ์ฃผํŒŒ์ˆ˜ ์˜์—ญ์˜ ํ‰๊ท , ๋ถ„์‚ฐ ๋“ฑ์œผ๋กœ ๋ณ€ํ™˜ํ•œ ์ˆ˜์น˜๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค

  • ์„ผ์„œ ์‹ ํ˜ธ โ†’ ํŠน์ง•์ถ”์ถœ โ†’ ๋ชจ๋ธํ•™์Šต โ†’ ํ–‰๋™์ถ”๋ก 

# ๋ฐ์ดํ„ฐ ์ฝ๊ธฐ

import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline
url = 'https://raw.githubusercontent.com/PinkWink/ML_tutorial/master/dataset/HAR_dataset/features.txt'

feature_name_df = pd.read_csv(url, sep='\s+', header=None,
                              names=['column_index', 'column_name'])
# sep='\s+' : ๊ธธ์ด๊ฐ€ ์ •ํ•ด์ง€์ง€ ์•Š์€ ๊ณต๋ฐฑ์ด ๊ตฌ๋ถ„์ž์ธ ๊ฒฝ์šฐ์—๋Š” \s+ ์ •๊ทœ์‹(regular expression) ๋ฌธ์ž์—ด์„ ์‚ฌ์šฉ
# ์ฐธ๊ณ  : https://datascienceschool.net/01%20python/04.02%20%EB%8D%B0%EC%9D%B4%ED%84%B0%20%EC%9E%85%EC%B6%9C%EB%A0%A5.html
# names= : column์ด๋ฆ„ ์„ค์ •

feature_name_df.head()
# ํŠน์„ฑ๋งŒ 561๊ฐœ 

len(feature_name_df)
feature_name = feature_name_df.iloc[:, 1].values.tolist()
feature_name[:10]
# X ๋ฐ์ดํ„ฐ ์ฝ๊ธฐ

X_train_url = 'https://raw.githubusercontent.com/PinkWink/ML_tutorial/master/dataset/HAR_dataset/train/X_train.txt'
X_test_url = 'https://raw.githubusercontent.com/PinkWink/ML_tutorial/master/dataset/HAR_dataset/test/X_test.txt'

X_train = pd.read_csv(X_test_url, sep='\s+', header=None)
X_test = pd.read_csv(X_test_url, sep='\s+', header=None)
X_train.columns = feature_name
X_test.columns = feature_name
X_train.head()
X_train.info()
X_test.head()
X_test.info()
# y ๋ฐ์ดํ„ฐ ์ฝ๊ธฐ

y_train_url = 'https://raw.githubusercontent.com/PinkWink/ML_tutorial/master/dataset/HAR_dataset/train/y_train.txt'
y_test_url = 'https://raw.githubusercontent.com/PinkWink/ML_tutorial/master/dataset/HAR_dataset/test/y_test.txt'

y_train = pd.read_csv(y_test_url, sep='\s+', header=None, names = ['action'])
y_test = pd.read_csv(y_test_url, sep='\s+', header=None, names=['action'])
X_train.shape, X_test.shape, y_train.shape, y_test.shape
# ๊ฐ ์•ก์…˜๋ณ„ ๋ฐ์ดํ„ฐ์˜ ์ˆ˜
# 1. Walking
# 2. WalkingUpstairs
# 3. WalkingDownstairs
# 4. Sitting
# 5. Standing
# 6. Laying

y_train['action'].value_counts()




Logistic Regression - ์•™์ƒ๋ธ” ๊ธฐ๋ฒ• - HAR ๋ฐ์ดํ„ฐ - Decisioon Tree ์ ์šฉ

# ๊ฒฐ์ •๋‚˜๋ฌด

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

dt_clf = DecisionTreeClassifier(random_state=13, max_depth=4)
dt_clf.fit(X_train, y_train)
pred = dt_clf.predict(X_test)

accuracy_score(y_test, pred)
# max_depth๋ฅผ ๋‹ค์–‘ํ•˜๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด GridSearchCV ์ด์šฉ

from sklearn.model_selection import GridSearchCV

params = {'max_depth' : [6, 8, 10, 12, 16, 20, 24]}

grid_cv = GridSearchCV(dt_clf, param_grid=params, scoring='accuracy',
                       cv = 5, return_train_score=True) # cv data๋ฅผ 5๊ฐœ๋กœ ๋‚˜๋ˆˆ(?)

grid_cv.fit(X_train, y_train)

# max_depth 16์ด ์ข‹๋‹ค๊ณ  ํ•จ(?) ์ž๋ฃŒ์™€ ๋‹ค๋ฆ„;;;

grid_cv.best_score_
grid_cv.best_params_
# max_depth ๋ณ„๋กœ ํ‘œ๋กœ ์„ฑ๋Šฅ์„ ์ •๋ฆฌ(test๋ฐ์ดํ„ฐ๋Š” ์•„์ง ์•ˆํ•จ, ์—ฌ๊ธฐ๋Š” ๋ฐ์ดํ„ฐ ๋‚˜๋ˆˆ๊ฑฐ๋ฅผ ํ•ด๋ณธ๊ฒƒ)

cv_result_df = pd.DataFrame(grid_cv.cv_results_)
cv_result_df[['param_max_depth', 'mean_test_score', 'mean_train_score']]
# ์‹ค์ œ test ๋ฐ์ดํ„ฐ์—์„œ์˜ ๊ฒฐ๊ณผ

max_depths = [6, 8, 10, 12, 16, 20, 24]

for depth in max_depths:
    dt_clf = DecisionTreeClassifier(max_depth=depth, random_state=156)
    dt_clf.fit(X_train, y_train)
    pred = dt_clf.predict(X_test)
    accuracy = accuracy_score(y_test, pred)
    print('Max_Depth = ', depth, ', Accuracy = ', accuracy)
# ๋ฒ ์ŠคํŠธ ๋ชจ๋ธ ๊ฒฐ๊ณผ

best_dt_clf = grid_cv.best_estimator_
pred1 = best_dt_clf.predict(X_test)

accuracy_score(y_test, pred1)




Logistic Regression - ์•™์ƒ๋ธ” ๊ธฐ๋ฒ• - HAR ๋ฐ์ดํ„ฐ - Random Forest ์ ์šฉ

# ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ ์ ์šฉ

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

params = {
    'max_depth' : [6, 8, 10],
    'n_estimators' : [50, 100, 200], # n_estimators : ์‚ฌ์šฉํ•  tree์ˆ˜
    'min_samples_leaf' : [8, 12], # min_samples_leaf : ๊ฐ€์žฅ ๋งˆ์ง€๋ง‰ ์žŽ(ํ•ญ๋ชฉ)์˜ ์ตœ์†Œ ๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜
    'min_samples_split' : [8, 12] # min_samples_split : ๋ถ„ํ•  ๊ธฐ์ค€ ์ตœ์†Œ ๋ฐ์ดํ„ฐ
}

rf_clf = RandomForestClassifier(random_state=13, n_jobs=-1) # n_jobs : ์‚ฌ์šฉํ•  cpu ์ฝ”์–ด ์ˆ˜
grid_cv = GridSearchCV(rf_clf, param_grid=params, cv = 2, n_jobs=-1)
grid_cv.fit(X_train, y_train)
# ๊ฒฐ๊ณผ ์ •๋ฆฌ๋ฅผ ์œ„ํ•œ ์ž‘์—…

cv_results_df = pd.DataFrame(grid_cv.cv_results_)
cv_result_df.columns
cv_result_df.head()
# ์„ฑ๋Šฅ์ด ์ข‹์Œ

target_col = ['rank_test_score', 'mean_test_score', 'param_n_estimators', 'param_max_depth']
cv_results_df[target_col].sort_values('rank_test_score').head()
# best ๋ชจ๋ธ

grid_cv.best_params_
grid_cv.best_score_
# test ๋ฐ์ดํ„ฐ์— ์ ์šฉ

rf_clf_best = grid_cv.best_estimator_
rf_clf_best.fit(X_train, y_train)
pred1 = rf_clf_best.predict(X_test)
pred1
accuracy_score(y_test, pred1)
# ์ค‘์š” ํŠน์„ฑ ํ™•์ธ
# ์˜ํ–ฅ๋ ฅ ๋†’์€ feature๋“ค ์ค‘ 20๊ฐœ๋งŒ ๊ฐ€์ ธ์˜ค๊ธฐ

best_cols_values = rf_clf_best.feature_importances_
best_cols = pd.Series(best_cols_values, index=X_train.columns)

best_cols_values, best_cols
# ๊ฐ ํŠน์„ฑ๋“ค์˜ ์ค‘์š”๋„๊ฐ€ ๊ฐœ๋ณ„์ ์œผ๋กœ ๋†’์ง€ ์•Š๋‹ค.

top20_cols = best_cols.sort_values(ascending=False)[:20]
top20_cols
# ์ฃผ์š” ํŠน์„ฑ ๊ด€์ฐฐ

import seaborn as sns

plt.figure(figsize=(8, 8))
sns.barplot(x=top20_cols, y=top20_cols.index)
plt.show()
# ์ฃผ์š” 20๊ฐœ ํŠน์„ฑ

top20_cols.index
# 20๊ฐœ์˜ ํŠน์„ฑ๋งŒ ๊ฐ€์ง€๊ณ  ๋‹ค์‹œ ์„ฑ๋Šฅ ํ™•์ธ

X_train_re = X_train[top20_cols.index]
X_test_re = X_test[top20_cols.index]
# 561๊ฐœ์˜ ํŠน์„ฑ๋ณด๋‹ค 20๊ฐœ์˜ ํŠน์„ฑ๋งŒ ๋ณด๋ฉด ์—ฐ์‚ฐ์†๋„๊ฐ€ ์ •๋ง ๋น ๋ฅผ ๊ฒƒ์ด๋‹ค. 
# accuracy๋Š” ์•ฝ๊ฐ„ ๋–จ์–ด์ง€๋”๋ผ๋„ 

rf_clf_best_re = grid_cv.best_estimator_
rf_clf_best_re.fit(X_train_re, y_train.values.reshape(-1,))

pred1_re = rf_clf_best_re.predict(X_test_re)

accuracy_score(y_test, pred1_re)

๋งŽ์ด ์–ด๋ ต...ใ… ใ… 

๐Ÿ’ป ์ถœ์ฒ˜ : ์ œ๋กœ๋ฒ ์ด์Šค ๋ฐ์ดํ„ฐ ์ทจ์—… ์Šค์ฟจ

profile
#๋ฐ์ดํ„ฐ๋ถ„์„ #ํผํฌ๋จผ์Šค๋งˆ์ผ€ํŒ… #๋ฐ์ดํ„ฐ #๋””์ง€ํ„ธ๋งˆ์ผ€ํŒ…

0๊ฐœ์˜ ๋Œ“๊ธ€