[TIL_Carrotww] 27 - 22/10/11

μœ ν˜•μ„Β·2022λ…„ 10μ›” 11일
0

TIL

λͺ©λ‘ 보기
33/138
post-thumbnail

πŸ“Carrotww의 μ½”λ”© 기둝μž₯

🧲 Kaggle Data ν™œμš©ν•΄λ³΄κΈ°

πŸ” Salary - 연차에 λ”°λ₯Έ 연봉 변화도
μ—°μ°¨κ°€ μž…λ ₯으둜 λ“€μ–΄κ°€κ³  연봉이 좜λ ₯으둜 λ‚˜μ˜€λŠ” 단일 μ„ ν˜• ν˜•νƒœλ‘œ μ‰¬μš΄νŽΈμ— μ†ν•œλ‹€.
λ‚œ μ²˜μŒν•΄λ΄μ„œ μ–΄λ ΅λ‹€...

πŸ”

import os
os.environ['KAGGLE_USERNAME'] = 'carrotww' # username
os.environ['KAGGLE_KEY'] = 
# kaggle_keyλŠ” 본인 ν‚€λ₯Ό λΆ™μ—¬μ„œ 항상 μ‚¬μš©ν•˜λ©΄ λœλ‹€.

!kaggle datasets download -d rsadiq/salary
# ν•΄λ‹Ή 데이터λ₯Ό λ‹€μš΄λ‘œλ“œ ν•œ ν›„
!unzip salary.zi
# 압좕을 ν’€μ–΄μ€€λ‹€.

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam, SGD
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
from sklearn.model_selection import train_test_split

df = pd.read_csv('Salary.csv')
# 압좕을 ν’€μ–΄μ€€ νŒŒμΌμ„ pandas의 read_csv ν•¨μˆ˜λ‘œ 
# μ½μ–΄μ„œ df에 μ €μž₯

df.tail(5)

x_data = np.array(df['YearsExperience'], dtype=np.float32)
# x dataλŠ” μ—°μ°¨λ₯Ό λ„£μ–΄ numpy array둜 λ³€κ²½ν•΄ μ£Όλ©°
# λ¨Έμ‹ λŸ¬λ‹ ν•™μŠ΅μ‹œμ—λŠ” λŒ€λΆ€λΆ„ float μžλ£Œν˜•μœΌλ‘œ μ‚¬μš©
y_data = np.array(df['Salary'], dtype=np.float32)

x_data = x_data.reshape((-1, 1))
y_data = y_data.reshape((-1, 1))
# 데이터가 ν•œ 개 μ΄λ―€λ‘œ 1 을 λ„£μ–΄μ€€λ‹€.

print(x_data.shape)
print(y_data.shape)

x_train, x_val, y_train, y_val = train_test_split(x_data, y_data, test_size=0.2, random_state=2021)
# test set을 20%둜 λ‚˜λˆ„μ–΄ μ€€λ‹€

print(x_train.shape, x_val.shape)
print(y_train.shape, y_val.shape)

model = Sequential([
  Dense(1)
])
# ν•΄λ‹Ή λͺ¨λΈμ„ μ„ ν˜• νšŒκ·€λ‘œ λ§Œλ“€μ–΄μ€Œ

model.compile(loss='mean_squared_error', optimizer=SGD(lr=0.01))
# optimizerλŠ” SGD learning rate λŠ” 0.01 둜 μ‚¬μš©
# optimizerλŠ” μ•„λ‹΄μœΌλ‘œλ„ λ³€κ²½ κ°€λŠ₯

model.fit(
    x_train,
    y_train,
    validation_data=(x_val, y_val), # 검증 데이터λ₯Ό λ„£μ–΄μ£Όλ©΄ ν•œ epoch이 λλ‚ λ•Œλ§ˆλ‹€ μžλ™μœΌλ‘œ 검증
    epochs=100 # epochs λ³΅μˆ˜ν˜•μœΌλ‘œ
    # fit을 ν•˜μ—¬ ν•™μŠ΅μ„ μ‹œμΌœμ€Œ

y_pred = model.predict(x_val)
# x_val 데이터λ₯Ό μ˜ˆμΈ‘ν•˜μ—¬ y_pred에 λ„£μ–΄μ€Œ
plt.scatter(x_val, y_val)
plt.scatter(x_val, y_pred, color='r')
# μ˜ˆμΈ‘ν•œ 값을 red 둜 그렀쀌
plt.show()

🧲 논리 νšŒκ·€ (Logistic regression)

πŸ” 논리 νšŒκ·€λž€ ?
μ‰½κ²Œ 말해 μ„ ν˜• νšŒκ·€λ‘œ ν’€ 수 μ—†λŠ” 문제λ₯Ό ν’€κΈ° μœ„ν•΄ logistic(sigmoid) function 을 μ‚¬μš©ν•œλ‹€.

πŸ” λ§Œμ•½ 좜λ ₯ μˆ˜μΉ˜κ°€ 정해진 값이 κΆκΈˆν•˜λ‹€λ©΄?
κ°€λ Ή 톡과, μ‹€νŒ¨ 같은 True False 와 같은 값은 μ„ ν˜•νšŒκ·€ κ·Έλž˜ν”„λ‘œ ν‘œν˜„ν•˜κΈ° μ–΄λ ΅λ‹€.
예λ₯Ό λ“€μ–΄ 학점이 μ•„λ‹Œ pass/fail κ³Όλͺ©μ΄ μžˆμ„ λ•Œ λͺ‡ μ‹œκ°„μ„ 곡뢀해야 passκ°€ λ‚˜μ˜€λŠ”μ§€ κΆκΈˆν•˜λ‹€. 10μ‹œκ°„μ„ 곡뢀해야 passκ°€ λ‚˜μ˜¨λ‹€λ©΄ 5, 6, 7μ‹œκ°„μ„ κ³΅λΆ€ν–ˆμ„λ•ŒλŠ” 무쑰건 fail인가?

λ§Œμ•½ κ·Έλ ‡λ‹€λ©΄ μœ„μ™€ 같은 κ·Έλž˜ν”„λ‘œ λ‚˜μ˜¬ 것이닀. ν•˜μ§€λ§Œ logistic(sigmoid) function을 μ‚¬μš©ν•œλ‹€λ©΄ μ•„λž˜μ™€ 같은 κ·Έλž˜ν”„λ‘œ λ‚˜μ˜€κ²Œ λœλ‹€.

ν•΄λ‹Ή κ·Έλž˜ν”„μ—μ„œλŠ” ν™•λ₯ μ— λ”°λ₯Έ pass/fail μ—¬λΆ€λ₯Ό 50%둜 μ •ν•˜μ˜€λ‹€.

πŸ” μ„ ν˜• νšŒκ·€μ—μ„œμ˜ 가섀은 H(x) = Wx + b 식이며 논리 νšŒκ·€μ—μ„œλŠ” μ‹œκ·Έλͺ¨μ΄λ“œ ν•¨μˆ˜μ— μ„ ν˜•νšŒκ·€ μˆ˜μ‹μ„ 넣은 것.
μ‰½κ²Œ 결과값이 0 ~ 1이 λ‚˜μ˜€κ²Œ ν•˜κ³  μ‹Άμ–΄μ„œ λ‚˜μ˜¨ 것이 논리 νšŒκ·€ 이닀. μˆ˜ν•™μ μΈ λΆ€λΆ„ λ³΄λ‹€λŠ” 이해가 μ€‘μš”ν•˜κΈ° λ•Œλ¬Έμ— μˆ˜μ‹μ€ pass..γ…Žγ…Ž

🧲 Crossentropy λž€

πŸ” 논리 νšŒκ·€ κ·Έλž˜ν”„λ₯Ό μ˜ˆμΈ‘ν•œ κ·Έλž˜ν”„λ‘œ λ§Œλ“€κΈ° μœ„ν•΄μ„œ λ„μ™€μ£ΌλŠ” 손싀 ν•¨μˆ˜ 이닀.

profile
Carrot_hyeong

0개의 λŒ“κΈ€