Machine Learning_2

YJ·2023년 5월 16일

▷ 오늘 학습 계획: 머신러닝 강의(3~4)

📖 Chapter 02

Machine learning workflow

1) 데이터 탐색적 분석

생존 상황

f,ax = plt.subplots(1,2,figsize=(18,8))
titanic['survived'].value_counts().plot.pie(explode=[0,0.1], autopct='%1.1f%%', ax=ax[0], shadow=True)
ax[0].set_title('Pie plot - Survived')
ax[0].set_ylabel('')
sns.countplot(x='survived', data=titanic, ax=ax[1])
ax[1].set_title('Count plot - Survived')
plt.show()

성별에 따른 생존 상황

f,ax = plt.subplots(1,2,figsize=(18,8))
sns.countplot(x='sex', data=titanic, ax=ax[0])
ax[0].set_title('Count of Passengers of Sex')
ax[0].set_ylabel('')
sns.countplot(x='sex', hue='survived', data=titanic, ax=ax[1])
ax[1].set_title('Sex: Survived and Unsurvived')
plt.show()

경제력 대비 생존률

pd.crosstab(titanic['pclass'], titanic['survived'], margins=True)

선실 등급별 성별 상황

grid = sns.FacetGrid(titanic, row='pclass', col='sex', height=4, aspect=2)
grid.map(plt.hist, 'age', alpha=0.8, bins=20)
grid.add_legend()

나이별 승객 현황

import plotly.express as px
fig=px.histogram(titanic, x='age')
fig.show()

선실 등급별 생존률(연령)

나이 5단계로 나누기

titanic['age_cat'] = pd.cut(titanic['age'], bins=[0,7,15,30,60,100], include_lowest=True, labels=['baby', 'teen', 'young', 'adult', 'old'])

나이, 성별, 등급별 생존자 수

plt.figure(figsize=(14,6))
#1
plt.subplot(131)  # 1행 3열 중에 첫번째
sns.barplot(x='pclass', y='survived', data=titanic)
#2
plt.subplot(132)
sns.barplot(x='age_cat', y='survived', data=titanic)
#3
plt.subplot(133)
sns.barplot(x='sex', y='survived', data=titanic)
plt.subplots_adjust(top=1, bottom=0.1, left=0.1, right=1, hspace=0.5, wspace=0.5)

남/여 나이별 생존 상황

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14,6))
# 여성
women=titanic[titanic['sex']=='female']
ax=sns.distplot(women[women['survived']==1]['age'], bins=20, label='survived', ax=axes[0], kde = False)
ax=sns.distplot(women[women['survived']==0]['age'], bins=40, label='not_survived', ax=axes[0], kde = False)
ax.legend()
ax.set_title('Female')
# 남성
men=titanic[titanic['sex']=='male']
ax=sns.distplot(men[men['survived']==1]['age'], bins=18, label='survived', ax=axes[1], kde = False)
ax=sns.distplot(men[men['survived']==0]['age'], bins=40, label='not_survived', ax=axes[1], kde = False)
ax.legend()
ax.set_title('Male')

사회적 신분

import re
title = []
for idx, dataset in titanic.iterrows():
    tmp = dataset['name']
    title.append(re.search("\,\s\w+(\s\w+)?\.", tmp).group()[2:-1])
titanic['title'] = title

titanic[['title', 'survived']].groupby(['title'], as_index=False).mean()

2) 머신러닝을 이용한 생존자 예측

Label Encode로 컬럼의 문자를 숫자로 바꾸기

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(titanic['sex'])
titanic['gender'] = le.transform(titanic['sex'])

결측치 제거

titanic = titanic[titanic['age'].notnull()]
titanic = titanic[titanic['fare'].notnull()]

상관관계 파악

correlation_matrix = titanic.corr().round(1)
sns.heatmap(data=correlation_matrix, annot=True, cmap ='bwr')

특성을 선택하고 데이터 나누기

from sklearn.model_selection import train_test_split
X=titanic[['pclass', 'age', 'sibsp', 'parch', 'fare', 'gender']]
y=titanic['survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=13)

DecisionTree

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
dt = DecisionTreeClassifier(max_depth=4, random_state=13)
dt.fit(X_train, y_train)
# accuracy
pred = dt.predict(X_test)
print(accuracy_score(y_test, pred))  # 0.76

디카프리오의 생존률

import numpy as np
dicaprio = np.array([[3, 18, 0, 0, 5, 1]])
print('dicaprio: ', dt.predict_proba(dicaprio)[0,1])

마크다운

Mark Down 문서: Jupyter를 아름답고 실용적으로 꾸며주는 도구
문자 앞뒤로 ＊ 2개 → 굵게
문자 앞뒤로 ＊ 1개 → 기울임
문자 앞뒤로 ＊ 3개 → 굵게, 기울임

수식

데이터 과학

목적: 가정(혹은 인식)을 검증하는 것

Python

아는 만큼 쓰는 언어, 읽는 사람이 많다

Open Source

엄청난 발전과 상호 검증, 사용자의 피드백, 마켓 규모 및 성향 파악

▷ 내일 학습 계획: 머신러닝 강의(5~7)

Machine Learning_2

📖 Chapter 02

Machine learning workflow

1) 데이터 탐색적 분석

생존 상황

성별에 따른 생존 상황

경제력 대비 생존률

선실 등급별 성별 상황

나이별 승객 현황

선실 등급별 생존률(연령)

나이 5단계로 나누기

나이, 성별, 등급별 생존자 수

남/여 나이별 생존 상황

사회적 신분

2) 머신러닝을 이용한 생존자 예측

Label Encode로 컬럼의 문자를 숫자로 바꾸기

결측치 제거

상관관계 파악

특성을 선택하고 데이터 나누기

DecisionTree

디카프리오의 생존률

마크다운

데이터 과학

Python

Open Source

[이 글은 제로베이스 데이터 취업 스쿨의 강의 자료 일부를 발췌하여 작성되었습니다.]

Machine Learning_1

Machine Learning_3

0개의 댓글