[TIL_Carrotww] 33 - 22/10/19

μœ ν˜•μ„Β·2022λ…„ 10μ›” 19일
0

TIL

λͺ©λ‘ 보기
40/138
post-thumbnail

πŸ“Carrotww의 μ½”λ”© 기둝μž₯

🧲 Titanic Dataset으둜 인곡지λŠ₯ ν•™μŠ΅μ‹œν‚€κΈ°

πŸ” sklearn을 μ‚¬μš©ν•˜μ—¬ 데이터 ν•™μŠ΅μ‹œμΌœλ³΄κΈ°

μœ„μ™€ 같은 ν™”λ©΄μ—μ„œ μ‚¬μš©μžμ˜ μ–Όκ΅΄ 사진을 λ„£μœΌλ©΄ μ•„λž˜μ˜ 정보λ₯Ό μž…λ ₯ν•˜κ³  μ΄λ―Έμ§€λŠ” 성별과 λ‚˜μ΄λ₯Ό νƒμƒ‰ν•˜λŠ” λ‹€λ₯Έ 이미지 인식 λͺ¨λΈμ„ 가져와 μ‚¬μš©, μ‚¬μš©μž 정보λ₯Ό μž…λ ₯λ°›μ•„ 생쑴 μ—¬λΆ€λ₯Ό μ•Œλ €μ£ΌλŠ” λ¨Έμ‹ λŸ¬λ‹μ„ μ μš©ν•œ μž₯κ³  ν”„λ‘œμ νŠΈλ₯Ό 진행쀑이닀.

from pdb import post_mortem
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split #training and testing data split
from sklearn import metrics #accuracy measure
import torch

plt.style.use('fivethirtyeight')
import warnings
warnings.filterwarnings('ignore')

# 데이터λ₯Ό 뢈러였고 보여쀀닀.
train_data=pd.read_csv('/titanic/train.csv')
test_data=pd.read_csv('/titanic/test.csv')

# data.head()λŠ” μ•žμ˜ 5κ°œλ§Œμ„ 보여쀀닀.
# print(train_data.head())

for col in train_data.columns :
    msg = 'ν•­λͺ© {:>10}\t λΉ„μ–΄μžˆλŠ” 자료의 λΉ„μœ¨ : {:.2f}%'.format(col, 100 * (train_data[col].isnull().sum() / train_data[col].shape[0]))
    # print(msg)

for col in test_data.columns :
    msg = 'ν•­λͺ© {:>10}\t λΉ„μ–΄μžˆλŠ” 자료의 λΉ„μœ¨ : {:.2f}%'.format(col, 100 * (test_data[col].isnull().sum() / test_data[col].shape[0]))
    # print(msg)

train_data.isnull().sum()
# print(train_data.isnull().sum())
# train_data μ—΄ λΆ€λΆ„μ˜ λΉ„μ–΄μžˆλŠ” 데이터 λͺ¨λ‘ sum() ν•˜μ—¬ λ³΄μ—¬μ€Œ

train_data['Initial']= train_data.Name.str.extract('([A-Za-z]+)\.')
test_data['Initial']= test_data.Name.str.extract('([A-Za-z]+)\.')
# print(test_data.Name.str.extract('([A-Za-z]+)\.'))
# print(train_data.Name.str.extract('([A-Za-z]+)\.'))

train_data['Initial'].replace(['Mlle','Mme','Ms','Dr','Major','Lady','Countess','Jonkheer','Col','Rev','Capt','Sir','Don'],['Miss','Miss','Miss','Mr','Mr','Mrs','Mrs','Other','Other','Other','Mr','Mr','Mr'],inplace=True)
test_data['Initial'].replace(['Mlle','Mme','Ms','Dr','Major','Lady','Countess','Jonkheer','Col','Rev','Capt','Sir','Don'],['Miss','Miss','Miss','Mr','Mr','Mrs','Mrs','Other','Other','Other','Mr','Mr','Mr'],inplace=True)
# 라벨에 따라 평균 값을 λ‚˜νƒ€λ‚Έλ‹€.

train_data.groupby('Initial')['Age'].mean()
# print(train_data.groupby('Initial')['Age'].mean())

train_data.loc[(train_data.Age.isnull())&(train_data.Initial=='Mr'),'Age']=33
train_data.loc[(train_data.Age.isnull())&(train_data.Initial=='Mrs'),'Age']=36
train_data.loc[(train_data.Age.isnull())&(train_data.Initial=='Master'),'Age']=5
train_data.loc[(train_data.Age.isnull())&(train_data.Initial=='Miss'),'Age']=22
train_data.loc[(train_data.Age.isnull())&(train_data.Initial=='Other'),'Age']=46

test_data.loc[(test_data.Age.isnull())&(test_data.Initial=='Mr'),'Age'] = 33
test_data.loc[(test_data.Age.isnull())&(test_data.Initial=='Mrs'),'Age'] = 36
test_data.loc[(test_data.Age.isnull())&(test_data.Initial=='Master'),'Age'] = 5
test_data.loc[(test_data.Age.isnull())&(test_data.Initial=='Miss'),'Age'] = 22
test_data.loc[(test_data.Age.isnull())&(test_data.Initial=='Other'),'Age'] = 46

train_data.Age.isnull().any()
test_data.Age.isnull().any()
train_data['Embarked'].fillna('S',inplace=True)

train_data['Age_band']=0
train_data.loc[train_data['Age']<=16,'Age_band']=0
train_data.loc[(train_data['Age']>16)&(train_data['Age']<=32),'Age_band']=1
train_data.loc[(train_data['Age']>32)&(train_data['Age']<=48),'Age_band']=2
train_data.loc[(train_data['Age']>48)&(train_data['Age']<=64),'Age_band']=3
train_data.loc[train_data['Age']>64,'Age_band']=4
train_data.head()

#family size max=4
train_data['Family_Size']=0
train_data['Family_Size']=train_data['Parch']+train_data['SibSp']

#Alone
train_data['Alone']=0
train_data.loc[train_data.Family_Size==0,'Alone']=1

train_data['Sex'].replace(['male','female'],[0,1],inplace=True)
train_data['Embarked'].replace(['S','C','Q'],[0,1,2],inplace=True)
train_data['Initial'].replace(['Mr','Mrs','Miss','Master','Other'],[0,1,2,3,4],inplace=True)

train_data.drop(['Name','Age','Ticket','Cabin','PassengerId','SibSp','Parch','Initial'],axis=1,inplace=True)

train,test=train_test_split(train_data,test_size=0.3,random_state=0,stratify=train_data['Survived'])
train_X=train[train.columns[1:]]
train_Y=train[train.columns[:1]]
test_X=test[test.columns[1:]]
test_Y=test[test.columns[:1]]
X=train_data[train_data.columns[1:]]
Y=train_data['Survived']

model = LogisticRegression()
model.fit(train_X,train_Y)
prediction3=model.predict(test_X)

test_x = [[1, 0, 10.0000, 1, 1, 1, 1]]
print('The accuracy of the Logistic Regression is',metrics.accuracy_score(prediction3,test_Y))

print(model.predict(test_x))
test_test = model.predict(test_x)

https://welcome-to-dewy-world.tistory.com/4?category=913368
μœ„ μ½”λ“œλ₯Ό μ°Έμ‘°ν•˜μ—¬ μž‘μ„±ν•˜μ˜€λ‹€. λͺ¨λ“  μ€„μ˜ μ‹€ν–‰ 결과와 ν•΄λ‹Ή μ½”λ“œλ₯Ό μ™œ μ‚¬μš©ν•˜μ˜€λŠ”μ§€λŠ” μ§κ΄€μ μœΌλ‘œ μ•Œ 수 μžˆλ‹€. 개인적으둜 이미지 μ²˜λ¦¬ν•˜λŠ” μ½”λ“œλŠ” 2차원 λ°°μ—΄, 3μ°¨μ›κΉŒμ§€ 닀루며 μ½”λ“œκ°€ 직관적이지 μ•Šμ•„ μ΄ν•΄ν•˜κΈ° λ„ˆλ¬΄ νž˜λ“€μ—ˆμ§€λ§Œ 이미지 μ²˜λ¦¬κ°€ μ•„λ‹Œ 데이터 ν•™μŠ΅μ„ μ‹œν‚€λŠ” 것은 κ°„λ‹¨ν•œ 방법이면 λ‚˜λ¦„ μ΄ν•΄ν•˜κΈ°κ°€ 쉽닀.
μΌ€κΈ€ μΆ”μ²œμˆ˜ 10000개의 ν•™μŠ΅λ²•μ€ 주말에 μ‹œκ°„μ΄ 되면 해봐야겠닀.
μœ„ μ½”λ“œμ™€ 정확도가 4νΌμ„ΌνŠΈμ •λ„ 차이가 λ‚˜λ©° ν”„λ‘œμ νŠΈλ₯Ό μ™„μ„±ν•˜κ³  λ‚˜λ©΄ μ½”λ“œλ₯Ό ν•˜λ‚˜ν•˜λ‚˜ λ‹€μ‹œ λœ―μ–΄λ΄μ•Όκ² λ‹€.

개인적으둜 μ²˜μŒμ— C++을 κ³΅λΆ€ν•˜λ“― μ—„μ²­ λ§‰λ§‰ν–ˆλŠ”λ° 데이터 ν•™μŠ΅μ„ μ‹œν‚€λ©° κ²°κ³Όκ°€ λ‚˜μ˜€λŠ”κ²Œ λ„ˆλ¬΄ μ‹ κΈ°ν–ˆλ‹€.

test_x 의 λ°μ΄ν„°λŠ” ν”„λ‘œμ νŠΈμ—μ„œ μ‚¬μš©ν•  데이터 전달 방식이닀 ν•΄λ‹Ή 배열을 μž…λ ₯λ°›μ•„ κ·ΈλŒ€λ‘œ 넣어쀄 것이닀.
νŽΈμ˜μƒ ν•™μŠ΅μ— 영ν–₯을 많이 μ£Όμ§€λ§Œ 재미둜 ν•΄λ³΄λŠ” ν…ŒμŠ€νŠΈμ΄κΈ°μ— 항ꡬ(νƒ‘μŠΉμ§€)λŠ” ν•œκ΅­ ν•­κ΅¬λ‘œ μž„μ˜λ‘œ μ •ν•˜μ˜€λ‹€.

🧲 Algorithm

πŸ” λ¬Έμ œκ°€ μ•ˆν’€λ¦°λ‹€ γ… γ… γ…  μ‹œκ°„λ„ λ„‰λ„‰μΉ˜ μ•Šμ•„ λͺ»ν–ˆλ‹€ γ… γ…  일주일 ν• λ‹ΉμΉ˜λ₯Ό λͺ» ν’€ 것 κ°™λ‹€... κΈˆμš”μΌ μ˜€μ „κΉŒμ§€ ν”„λ‘œμ νŠΈκ°€ μ§„ν–‰λ˜λ‹ˆ κΈˆμš”μΌκ³Ό 주말에 많이 ν’€μ–΄μ•Όκ² λ‹€...

profile
Carrot_hyeong

0개의 λŒ“κΈ€