ᄂ 😄 [9 일차] : FUNDAMENTAL 12. 사이킷런으로 구현해보는 머신러닝

백건·2022년 1월 21일

AIFFEL AIFFEL WEEK02 FUNDAMENTAL 아이펠

[AIFFEL] 아이펠 인공지능 전문가 과정

목록 보기

10/32

사이킷런으로 구현해보는 머신러닝

학습목표

머신러닝의 다양한 알고리즘을 소개합니다.
사이킷런 라이브러리의 사용법을 익힙니다.
사이킷런에서 데이터를 표현하는 방법에 대해 이해하고 훈련용 데이터셋과
테스트용 데이터셋으로 데이터를 나누는 방법을 이해합니다.

다양한 머신러닝 알고리즘
사이킷런에서 가이드하는 머신러닝 알고리즘
Hello Scikit-learn
사이킷런의 주요 모듈
- 4.1. 데이터 표현법
- 4.2. 회귀 모델 실습
- 4.3. datasets 모듈
- 4.4. 사이킷런 데이터셋을 이용한 분류 문제 실습
- 4.5. Estimator
훈련 데이터와 테스트 데이터 분리하기

머신러닝 알고리즘

알고리즘의 종류

아래의 알고리즘을 합쳐서 사용하기도 함
지도학습으로 진행하다가 차원과 특징(Feature)의 수가 많으면 비지도 학습으로 전환

지도학습 (Supervised Learning)

지도 학습 알고리즘은 한 세트의 사례들을(examples) 기반으로 예측을 수행
지도 학습에는 기존에 이미 분류된 학습용 데이터(labeled training data)로 구성된 입력 변수와 원하는 출력 변수가 수반
알고리즘을 이용해 학습용 데이터를 분석하여 입력 변수를 출력 변수와 매핑시키는 함수를 찾음
학습용 데이터로부터 일반화(generalizing)를 통해 알려지지 않은 새로운 사례들을 매핑
눈에 보이지 않는 상황(unseen situations) 속에서 결과를 예측
차원과 특징의 수가 적을 경우
알파고 초기 학습시 사용
단점
- 데이터 분류(레이블링) 작업에 많은 비용과 시간이 소요

분류(Classification)

데이터가 범주형(categorical) 변수를 예측하기 위해 사용될 때
이미지에 강아지나 고양이와 같은 레이블 또는 지표(indicator)를 할당하는 경우
레이블이 두 개인 경우를 ‘이진 분류(binary classification)
범주가 두 개 이상인 경우는 다중 클래스 분류(multi-class classification)

회귀(Regression)

연속 값을 예측할 때 문제는 회귀 문제

예측(Forecasting)

과거 및 현재 데이터를 기반으로 미래를 예측하는 과정
동향(trends)을 분석하기 위해 가장 많이 사용
예) 올해와 전년도 매출을 기반으로 내년도 매출을 추산하는 과정

준지도학습(Semi-Supervised Learning or Weakly Supervised Learning)

내용
- 분류된 자료가 한정적일 때
- 지도 학습을 개선하기 위해 미분류(unlabeled) 사례를 이용
- 기계(machine)는 온전히 지도 받지 않기 때문에 “기계가 준지도(semi-supervised)를 받는다”라고 표현
- 학습 정확성을 개선하기 위해 미분류 사례와 함께 소량의 분류(labeled) 데이터를 이용

비지도학습 (Unsupervised Learning)

차원과 특징의 수가 많을 경우
수행할 때 기계는 미분류 데이터만을 제공
데이터의 기저를 이루는 고유 패턴을 발견하도록 설정
- 클러스터링 구조(clustering structure)
- 저차원 다양체(low-dimensional manifold)
- 희소 트리 및 그래프(a sparse tree and graph)

클러스터링(Clustering)

특정 기준에 따라 유사한 데이터 사례들을 하나의 세트로 그룹화
전체 데이터 세트를 여러 그룹으로 분류하기 위해 사용
사용자는 고유한 패턴을 찾기 위해 개별 그룹 차원에서 분석을 수행

차원 축소(Dimension Reduction)

고려 중인 변수의 개수를 줄이는 작업
차원수(dimensionality)를 줄이면 잠재된 진정한 관계를 도출하기 용이

강화학습 (Reinforcement Learning)

알파고 최적화를 위한 학습 방법
강화 학습은 환경으로부터의 피드백을 기반으로 행위자(agent)의 행동을 분석, 최적화
어떤 액션을 취해야 할지 듣기 보다는 최고의 보상을 산출하는 액션을 발견하기 위해 서로 다른 시나리오를 시도
특징
- 시행 착오(Trial-and-error)
- 지연 보상(delayed reward)
용어
- 에이전트(Agent): 학습 주체 (혹은 actor, controller)
- 환경(Environment): 에이전트에게 주어진 환경, 상황, 조건
- 행동(Action): 환경으로부터 주어진 정보를 바탕으로 에이전트가 판단한 행동
- 보상(Reward): 행동에 대한 보상을 머신러닝 엔지니어가 설계
참고
- Reinforcement Learning KR
- aikorea/awesome-rl

Monte Carlo methods

Q-Learning

Policy Gradient methods

참고
최적의 알고리즘을 고르기 위한 치트키

알고리즘 고르는 치트키

최고의 알고리즘을 찾는 단하나의 확실한 방법은 모든 알고리즘을 시도해보는 것
1. If [path label] then use [algorithm]
(만약 <경로 레이블>이면 <알고리즘>을 사용한다)
2. If you want to perform dimension reduction then use principal component analysis.
(차원 축소를 수행하고 싶으면 주성분 분석을 사용한다.)
3. If you need a numeric prediction quickly, use decision trees or logistic regression.
(신속한 수치 예측이 필요하면 의사결정 트리 또는 로지스틱 회귀를 사용한다.)
4. If you need a hierarchical result, use hierarchical clustering.
(계층적 결과가 필요하면 계층적 클러스터링을 사용한다.)

알고리즘 선택 시 고려 사항

정확성, 학습 시간, 사용 편의성을 고려
우선 사항 : ‘어떤 결과가 나올 것인지에 상관없이 어떻게 결과를 얻을 것인가’

특정 알고리즘을 사용하는 시점

선형 회귀(Linear regression)와 로지스틱 회귀(Logistic regression)

선형 회귀(Linear regression)
- 연속적인 종속 변수 y 와 한 개 이상의 예측 변수인 x 사이의 관계를 모델링하는 접근법
로지스틱 회귀(Logistic regression)
- 종속 변수가 연속형이 아니라 범주형이라면
- 선형 회귀는 로짓 연결(logit link) 함수를 이용해 로지스틱 회귀로 변환

선형(Linear) SVM 및 커널(Kernel) SVM

커널 트릭(기법)은 분리 가능한 비선형 함수를 고차원의 분리 가능한 선형 함수로 매핑하기 위해 사용
서포트 벡터 머신(SVM; support vector machine) 학습 알고리즘
- 초평면(hyperplane)의 법선 벡터(normal vector) ‘w’와 편향 값(bias)
‘b’로 표현되는 분류기(classifier)를 찾음
- 초평면(경계)은 가능한 최대 오차(margin)로 각기 다른 클래스를 분리
→ 문제를 제약 조건이 있는(constrained) 최적화 문제로 변환

트리와 앙상블 트리(ensemble tree)

의사결정 트리, 랜덤 포레스트(random forest), 그래디언트 부스팅(gradient boosting)은 모두 의사결정 트리를 기반
특징 공간(feature space)을 거의 같은 레이블로 구별되도록 분리
의사결정 트리는 이해와 구현이 쉽지만 가지를 다 쳐내고 트리의 깊이가 너무 깊어질 경우 데이터를 과적합(overfit)하는 경향
랜덤 포레스트와 그래디언트 부스팅은 일반적으로 높은 정확성을 달성하고 과적합 문제를 해결하기 위해 트리 알고리즘을 사용

신경망과 딥러닝

→컨볼루션 신경망(convolution neural network) 아키텍처(이미지 출처: wikipedia creative commons)

구성
- 입력 계층(input layer)
- 은닉 계층(hidden layers)
- 출력 계층(output layer)
학습 표본(training samples)
- 입력 , 출력 계층을 정의
- 출력 계층이 범주형 변수일 때 신경망은 분류 문제를 해결
- 출력 계층이 연속 변수일 때 신경망은 회귀 작업을 위해 사용
- 출력 계층이 입력 계층과 동일할 때 신경망은 고유한 특징을 추출하기 위해 사용
- 은닉 계층의 수는 모델 복잡성과 모델링 수용력(capacity)을 결정

K-평균/K-모드(k-means/k-modes), 가우시안 혼합 모델(GMM; Gaussian mixture model) 클러스터링

목표
-n개의 관측치(observations)를 k개의 클러스터로 나누는 것
K-평균은 표본을 하나의 클러스터에만 강하게 결속시키는 ‘하드 할당(hard assignment)’를 정의
GMM은 각 표본이 확률 값을 가짐으로써 어느 한 클러스터에만 결속되지 않는 ‘소프트 할당(soft assignment)’을 정의
클러스터 k의 수가 주어질 때 클러스터링을 빠르고 단순하게 수행

DBSCAN

클러스터 k의 수가 주어지지 않을 때 밀도 확산(density diffusion)을 통해 표본을 연결
→ DBSCAN(density-based spatial clustering)을 사용

계층적 군집화(Hierarchical clustering)

계층적 분할은 트리 구조인 덴드로그램(dendrogram)를 이용해 시각화
각기 다른 K를 사용해 클러스터를 정제하거나 조대화할 수 있는
각기 다른 세분화(granularities) 수준에서 입력과 분할 결과를
확인할 수 있기 때문에 클러스터의 개수가 필요 없음

PCA, SVD, LDA

머신러닝 알고리즘에 많은 수의 특징을 직접 투입하는 것은 선호되지 않음
→ 일부 특징은 관련이 없거나 ‘고유한’ 차원수가 특징의 수보다 적을 수 있기 때문
- 주성분 분석(PCA; principal component analysis)
- 특이값 분해(SVD; singular value decomposition)
- 잠재 디리클레 할당(LDA; latent Dirichlet allocation)
→차원 축소를 수행
PCA
- PCA는 원래의 데이터 공간을 저차원의 공간으로 매핑하면서
  가능한 많은 정보를 보존하는 비지도 클러스터링 방식
- PCA는 기본적으로 데이터 분산(variance)을 가장 많이 보존하는
  하위 공간(subspace)을 찾음
- 하위 공간은 데이터의 공분산 매트릭스(covariance matrix)의
  지배적인 고유 벡터(eigenvectors)에 따라 정의

SVD
- 중앙 데이터 매트릭스의 SVD(특징 vs. 표본)이
  PCA로 찾은 것과 동일한 하위 공간을 정의하는
  지배적인 왼쪽 특이 벡터(left singular vectors)를 제공한다는 점에서 PCA와 관련
- SVD는 PCA가 할 수 없는 작업을 수행할 수 있기 때문에 훨씬 다재다능한 기법
- 예) 사용자 대 영화 매트릭스의 SVD는 추천 시스템에서
  사용할 수 있는 사용자 프로파일과 영화 프로파일을 추출
- 자연어 처리(NLP; natural language processing) 과정에서
  잠재 의미 분석(latent semantic analysis)으로 알려진
  주제 모델링(topic modeling) 도구로서 널리 사용
잠재 디리클레 할당(LDA)
- 자연어 처리(NLP)와 관련된 기법
- 확률적 주제 모델(probabilistic topic model)로
  가우시안 혼합 모델(GMM)이 연속 데이터를 가우시안 밀도로
  분해하는 것과 비슷한 방식으로 문서를 주제를 기준으로 분리
- GMM과 다르게 이산 데이터(discrete data, 문서 내 단어)를 모델링
- 주제는 디리클레 분포(Dirichlet distribution)에 따라
  연역적(priori)으로 분포돼야 하는 제약이 있음

정리

문제를 정의한다. 어떤 문제를 해결하고 싶은가?
단순하게 시작한다.
데이터와 기준이 되는 결과(baseline results)를 잘 인지하고 있어야 한다.
그리고 나서 복잡한 것들을 시도

사이킷런에서 가이드하는 머신러닝 알고리즘

Scikit-Learn에서는 어떻게 알고리즘을 분류?

Choosing the right estimator

사이킷런에서 알고리즘 Task 4가지

Classification 7가지

SGD Classifier
KNeighborsClassifier
LinearSVC
NaiveBayes
SVC
Kernel approximation
EnsembleClassifiers

Regression 7가지

SGD Regressor
Lasso
ElasticNet
RidgeRegression
SVR(kernel='linear')
SVR(kernel='rbf')
EnsembelRegressor

Clustering 6가지

Spectral
GMM
KMeans
MiniBatch KMeans
MeanShift
VBGMM

Dimensionality Reduction 5가지

Randomized PCA
Isomap
Spectral Embedding
kernel approximation
LLE

사이킷런 알고리즘 분류 기준

데이터 수량
라벨의 유무(정답의 유무)
데이터의 종류 (수치형 데이터(quantity)
범주형 데이터(category)

Hello Scikit-learn

설치

pip install scikit-learn

import sklearn
print(sklearn.__version__)

1.0.2

사이킷런 살펴보기.

사이킷런 소개 영상

사이키런에서 훈련 데이터와 테스트 데이터를 나누는 기능을 제공하는 것은
train_test_split

사이킷런의 사용법

사이킷런에서

ETL(Extrac Transform Load) 기능을 수행하는 함수
- transformer()
Model로 표현되는 클래스
- Estimator
  - 메소드
    - fit()
    - predict()
Estimator와 transformer() 2가지 기능을 수행하는 scikit-learn의 API
- Pipeline
- meta-estimator

정리

transformer()와 Estimator객체의 fit()과 predict()메소드가 중요한것 같습니다. 모델 셀렉션 안의 train_test_split() 이란 함수를 이용해 훈련데이터와 테스트데이터를 랜덤하게 섞어줍니다

사이킷런은 파이썬 기반 머신러닝 라이브러리로 Scipy 및 NumPy 와 비슷한 데이터 표현과 수학 관련 함수를 갖고 있습니다. 일반적으로 머신러닝에서 데이터 가공(ETL)을 거쳐 모델을 훈련하고 예측하는 과정을 거치는데 ETL부분은 ScikitLearn의 transformer()를 제공하고, 모델의 훈련과 예측은 Estimator 객체를 통해 수행되며, Estimator에는 각각 fit()(훈련), predict()(예측)을 행하는 메소드가 있습니다. 모델의 훈련과 예측이 끝나면 이 2가지는 작업을 Pipeline()으로 묶어 검증을 수행합니다.

모듈 : 데이터 표현법

데이터셋

NumPy의 ndarray
Pandas의 DataFrame
SciPy의 Sparse Matrix
훈련과 예측 등 머신러닝 모델을 다룰 때
- CoreAPI라고 불리는 다음의 함 수 이용
  - fit()
  - transfomer()
  - predict()

자주사용하는 API

데이터 표현법

특성 행렬(Feature Matrix)
- 입력 데이터를 의미합니다.
- 특성(feature)
  - 데이터에서 수치 값, 이산 값, 불리언 값으로 표현되는 개별 관측치를 의미
  - 특성 행렬에서는 열에 해당하는 값
- 표본(sample): 각 입력 데이터, 특성 행렬에서는 행에 해당하는 값
- n_samples: 행의 개수(표본의 개수)
- n_features: 열의 개수(특성의 개수)
- X: 통상 특성 행렬은 변수명 X로 표기합니다.
- [n_samples, n_features]은 [행, 열] 형태의 2차원 배열 구조를 사용
- NumPy의 ndarray, Pandas의 DataFrame, SciPy의 Sparse Matrix를 사용
타겟 벡터(Target Vector)
- 타겟 벡터 (Target Vector)
- 입력 데이터의 라벨(정답) 을 의미합니다.
- 목표(Target)
  - 라벨, 타겟값, 목표값
  - 특성 행렬(Feature Matrix)로부터 예측하고자 하는 것
- n_samples: 벡터의 길이(라벨의 개수)
- 타겟 벡터에서 n_features는 없음
- y: 변수명 y로 표기
- 타겟 벡터는 보통 1차원 벡터
- NumPy의 ndarray, Pandas의 Series를 사용
- 타겟 벡터는 경우에 따라 1차원으로 나타내지 않을 수도 있음

❗️특성 행렬 X의 n_samples와 타겟 벡터 y의 n_samples는 동일해야 함

모듈 : 회귀 모델 실습

# 회귀모델을 이용한 데이터를 예측하는 모델

import numpy as np
import matplotlib.pyplot as plt
r = np.random.RandomState(10)
x = 10 * r.rand(100)
y = 2 * x - 3 * r.rand(100)
plt.scatter(x,y)

<matplotlib.collections.PathCollection at 0x11bde3110>

# 입력데이터 x의 모양
x.shape

(100,)

# 정답 데이터 y의 모양
y.shape

(100,)

x와 y의 모양은 (100,)으로 1차원 벡터

모델 객체를 생성

사용할 모델의 이름은 LinearRegression
sklearn.linear_model 안에 있음

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model

LinearRegression()

모델을 훈련

훈련시키는 메서드는 fit()

fit 메서드에 인자로 특성행렬과 타겟 벡터를 넣어줌

행렬 형태의 입력 데이터와 1차원 벡터 형태의 정답(라벨)을 넣어줌

입력 데이터인 x를 그대로 넣으면, 에러가 발생

x는 numpy의 ndarray타입이니 reshape()를 사용

# ! 에러 발생
model.fit(x, y)

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

/var/folders/59/gjb3x8rx30s2cxwfl3zh2m040000gn/T/ipykernel_725/3325953541.py in <module>
      1 # ! 에러 발생
----> 2 model.fit(x, y)


~/opt/anaconda3/envs/dev/lib/python3.7/site-packages/sklearn/linear_model/_base.py in fit(self, X, y, sample_weight)
    661 
    662         X, y = self._validate_data(
--> 663             X, y, accept_sparse=accept_sparse, y_numeric=True, multi_output=True
    664         )
    665 


~/opt/anaconda3/envs/dev/lib/python3.7/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    579                 y = check_array(y, **check_y_params)
    580             else:
--> 581                 X, y = check_X_y(X, y, **check_params)
    582             out = X, y
    583 


~/opt/anaconda3/envs/dev/lib/python3.7/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
    974         ensure_min_samples=ensure_min_samples,
    975         ensure_min_features=ensure_min_features,
--> 976         estimator=estimator,
    977     )
    978 


~/opt/anaconda3/envs/dev/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    771                     "Reshape your data either using array.reshape(-1, 1) if "
    772                     "your data has a single feature or array.reshape(1, -1) "
--> 773                     "if it contains a single sample.".format(array)
    774                 )
    775 


ValueError: Expected 2D array, got 1D array instead:
array=[7.71320643 0.20751949 6.33648235 7.48803883 4.98507012 2.24796646
 1.98062865 7.60530712 1.69110837 0.88339814 6.85359818 9.53393346
 0.03948266 5.12192263 8.12620962 6.12526067 7.21755317 2.91876068
 9.17774123 7.14575783 5.42544368 1.42170048 3.7334076  6.74133615
 4.41833174 4.34013993 6.17766978 5.13138243 6.50397182 6.01038953
 8.05223197 5.21647152 9.08648881 3.19236089 0.90459349 3.00700057
 1.13984362 8.28681326 0.46896319 6.26287148 5.47586156 8.19286996
 1.9894754  8.56850302 3.51652639 7.54647692 2.95961707 8.8393648
 3.25511638 1.65015898 3.92529244 0.93460375 8.21105658 1.5115202
 3.84114449 9.44260712 9.87625475 4.56304547 8.26122844 2.51374134
 5.97371648 9.0283176  5.34557949 5.90201363 0.39281767 3.57181759
 0.7961309  3.05459918 3.30719312 7.73830296 0.39959209 4.29492178
 3.14926872 6.36491143 3.4634715  0.43097356 8.79915175 7.63240587
 8.78096643 4.17509144 6.05577564 5.13466627 5.97836648 2.62215661
 3.00871309 0.25399782 3.03062561 2.42075875 5.57578189 5.6550702
 4.75132247 2.92797976 0.64251061 9.78819146 3.39707844 4.95048631
 9.77080726 4.40773825 3.18272805 5.19796986].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

#  변수명 X에 특성 행렬을 넣기
X = x.reshape(100,1)

#  X를 fit()의 인자로 넣기
model.fit(X,y)

LinearRegression()

→ 입력 데이터와 그 라벨을 이용해 훈련을 완료

새로운 데이터를 넣고 예측

새로운 데이터는 np.linspace()를 이용해서 생성
예측은 predict()를 사용
predict()의 인자 역시 행렬로 넣어 주어야 함

x_new = np.linspace(-1, 11, 100)
X_new = x_new.reshape(100,1)
y_new = model.predict(X_new)

reshape() 함수에서 나머지 숫자를 -1로 넣으면 자동으로 남은 숫자를 계산해 줍니다.
즉, x_new의 인자의 개수가 100개이므로, (100, 1)의 형태나 (2, 50)의 형태 등으로 변환
(2, -1)을 인자로 넣으면 (2, 50)의 형태로 자동으로 변환

X_ = x_new.reshape(-1,1)
X_.shape

(100, 1)

성능 평가 : 학습된 회귀 모델이 잘 예측했는지

모델의 성능 평가 관련 모듈은 sklearn.metrics에 저장
회귀 모델의 경우 RMSE(Root Mean Square Error) 를 사용해 성능을 평가

Scikit-learn: Mean Squared Error

# mean_squared_error 함수의 공식 /  np.sqrt를 활용
from sklearn.metrics import mean_squared_error

error = np.sqrt(mean_squared_error(y,y_new))

print(error)

9.299028215052262

종합

# 1. 모델 객체를 생성
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model

# 2. 모델을 훈련

#  변수명 X에 특성 행렬을 넣기
X = x.reshape(100,1)
#  X를 fit()의 인자로 넣기
model.fit(X,y)

# 3. 새로운 데이터를 넣고 예측
x_new = np.linspace(-1, 11, 100) # 새로운 데이터는 np.linspace()를 이용해서 생성
X_new = x_new.reshape(100,1)
y_new = model.predict(X_new)

# 4 .모델 성능 평가
# mean_squared_error 함수의 공식 /  np.sqrt를 활용
from sklearn.metrics import mean_squared_error
error = np.sqrt(mean_squared_error(y,y_new))
print(error)

# 5. 알아보기 쉽게 그래프로
plt.scatter(x, y, label='input data')
plt.plot(X_new, y_new, color='red', label='regression line')

9.299028215052262





[<matplotlib.lines.Line2D at 0x11bdc3dd0>]

그래프의 점들과 회귀선이 거의 일치

모듈 : dataset 모듈

sklearn.datasets 모듈
구분은

dataset loaders
- Toy dataset 제공
- Real World dataset 제공
dataset fetchers
- Toy dataset 제공
- Real World dataset 제공

Toy dataset의 예시

datasets.load_boston(): 회귀 문제, 미국 보스턴 집값 예측(version 1.2 이후 삭제 예정)
datasets.load_breast_cancer(): 분류 문제, 유방암 판별
datasets.load_digits(): 분류 문제, 0 ~ 9 숫자 분류
datasets.load_iris(): 분류 문제, iris 품종 분류
datasets.load_wine(): 분류 문제, 와인 분류

datasets.load_wine()

와인 분류 데이터를 다운로드한 다음 data란 변수에 할당

from sklearn.datasets import load_wine
data = load_wine()

자료형 확인

type(data)

sklearn.utils.Bunch

sklearn.utils.Bunch라고 하는 데이터 타입
→ Bunch는 파이썬의 딕셔너리와 유사한 형태의 데이터 타입

print(data)

{'data': array([[1.423e+01, 1.710e+00, 2.430e+00, ..., 1.040e+00, 3.920e+00,
        1.065e+03],
       [1.320e+01, 1.780e+00, 2.140e+00, ..., 1.050e+00, 3.400e+00,
        1.050e+03],
       [1.316e+01, 2.360e+00, 2.670e+00, ..., 1.030e+00, 3.170e+00,
        1.185e+03],
       ...,
       [1.327e+01, 4.280e+00, 2.260e+00, ..., 5.900e-01, 1.560e+00,
        8.350e+02],
       [1.317e+01, 2.590e+00, 2.370e+00, ..., 6.000e-01, 1.620e+00,
        8.400e+02],
       [1.413e+01, 4.100e+00, 2.740e+00, ..., 6.100e-01, 1.600e+00,
        5.600e+02]]), 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2]), 'frame': None, 'target_names': array(['class_0', 'class_1', 'class_2'], dtype='<U7'), 'DESCR': '.. _wine_dataset:\n\nWine recognition dataset\n------------------------\n\n**Data Set Characteristics:**\n\n    :Number of Instances: 178 (50 in each of three classes)\n    :Number of Attributes: 13 numeric, predictive attributes and the class\n    :Attribute Information:\n \t\t- Alcohol\n \t\t- Malic acid\n \t\t- Ash\n\t\t- Alcalinity of ash  \n \t\t- Magnesium\n\t\t- Total phenols\n \t\t- Flavanoids\n \t\t- Nonflavanoid phenols\n \t\t- Proanthocyanins\n\t\t- Color intensity\n \t\t- Hue\n \t\t- OD280/OD315 of diluted wines\n \t\t- Proline\n\n    - class:\n            - class_0\n            - class_1\n            - class_2\n\t\t\n    :Summary Statistics:\n    \n    ============================= ==== ===== ======= =====\n                                   Min   Max   Mean     SD\n    ============================= ==== ===== ======= =====\n    Alcohol:                      11.0  14.8    13.0   0.8\n    Malic Acid:                   0.74  5.80    2.34  1.12\n    Ash:                          1.36  3.23    2.36  0.27\n    Alcalinity of Ash:            10.6  30.0    19.5   3.3\n    Magnesium:                    70.0 162.0    99.7  14.3\n    Total Phenols:                0.98  3.88    2.29  0.63\n    Flavanoids:                   0.34  5.08    2.03  1.00\n    Nonflavanoid Phenols:         0.13  0.66    0.36  0.12\n    Proanthocyanins:              0.41  3.58    1.59  0.57\n    Colour Intensity:              1.3  13.0     5.1   2.3\n    Hue:                          0.48  1.71    0.96  0.23\n    OD280/OD315 of diluted wines: 1.27  4.00    2.61  0.71\n    Proline:                       278  1680     746   315\n    ============================= ==== ===== ======= =====\n\n    :Missing Attribute Values: None\n    :Class Distribution: class_0 (59), class_1 (71), class_2 (48)\n    :Creator: R.A. Fisher\n    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)\n    :Date: July, 1988\n\nThis is a copy of UCI ML Wine recognition datasets.\nhttps://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data\n\nThe data is the results of a chemical analysis of wines grown in the same\nregion in Italy by three different cultivators. There are thirteen different\nmeasurements taken for different constituents found in the three types of\nwine.\n\nOriginal Owners: \n\nForina, M. et al, PARVUS - \nAn Extendible Package for Data Exploration, Classification and Correlation. \nInstitute of Pharmaceutical and Food Analysis and Technologies,\nVia Brigata Salerno, 16147 Genoa, Italy.\n\nCitation:\n\nLichman, M. (2013). UCI Machine Learning Repository\n[https://archive.ics.uci.edu/ml]. Irvine, CA: University of California,\nSchool of Information and Computer Science. \n\n.. topic:: References\n\n  (1) S. Aeberhard, D. Coomans and O. de Vel, \n  Comparison of Classifiers in High Dimensional Settings, \n  Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. of  \n  Mathematics and Statistics, James Cook University of North Queensland. \n  (Also submitted to Technometrics). \n\n  The data was used with many others for comparing various \n  classifiers. The classes are separable, though only RDA \n  has achieved 100% correct classification. \n  (RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed data)) \n  (All results using the leave-one-out technique) \n\n  (2) S. Aeberhard, D. Coomans and O. de Vel, \n  "THE CLASSIFICATION PERFORMANCE OF RDA" \n  Tech. Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. of \n  Mathematics and Statistics, James Cook University of North Queensland. \n  (Also submitted to Journal of Chemometrics).\n', 'feature_names': ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']}

data를 출력

데이터들이 중괄호에 {} 담겨있고
콜론 : 을 이용해서 구분
→ key와 value
- 번치 데이터 타입에도 파이썬의 딕셔너리 메서드인 keys()를 사용 가능

data.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names'])

데이터 키값 의미 확인

data

특성 행렬
키에 접근하기 위해 . 사용

data.data

array([[1.423e+01, 1.710e+00, 2.430e+00, ..., 1.040e+00, 3.920e+00,
        1.065e+03],
       [1.320e+01, 1.780e+00, 2.140e+00, ..., 1.050e+00, 3.400e+00,
        1.050e+03],
       [1.316e+01, 2.360e+00, 2.670e+00, ..., 1.030e+00, 3.170e+00,
        1.185e+03],
       ...,
       [1.327e+01, 4.280e+00, 2.260e+00, ..., 5.900e-01, 1.560e+00,
        8.350e+02],
       [1.317e+01, 2.590e+00, 2.370e+00, ..., 6.000e-01, 1.620e+00,
        8.400e+02],
       [1.413e+01, 4.100e+00, 2.740e+00, ..., 6.100e-01, 1.600e+00,
        5.600e+02]])

특성 행렬은 2차원
행에는 데이터의 개수(n_samples)
열에는 특성의 개수(n_features)

data.data.shape

(178, 13)

→ 특성이 13개, 데이터가 178개인 특성 행렬

nidm 을 이용하여 차원 확인

data.data.ndim

target

타겟 벡터
타겟 벡터는 1차원

data.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2])

타겟 벡터의 길이는 특성 행렬의 데이터 개수와 일치해야 함

data.target.shape

(178,)

특성 행렬의 테이터 수와 일치

feature_names

data 키에 접근해서 data의 값을 확인해 본 결과 특성이 개수 확인
feature_names란 키에 특성들의 이름이 저장

data.feature_names

['alcohol',
 'malic_acid',
 'ash',
 'alcalinity_of_ash',
 'magnesium',
 'total_phenols',
 'flavanoids',
 'nonflavanoid_phenols',
 'proanthocyanins',
 'color_intensity',
 'hue',
 'od280/od315_of_diluted_wines',
 'proline']

feature 갯수 확인
→ 내장함수 len() 사용

len(data.feature_names)

feature_names의 개수와 특성 행렬의 n_features(열)의 숫자가 일치

target_names

target_names는 분류하고자 하는 대상

data.target_names

array(['class_0', 'class_1', 'class_2'], dtype='<U7')

데이터를 각각 class_0과 class_1, class_2로 분류한다는 뜻

DESCR

DESCR은 describe의 약자로 데이터에 대한 설명

print(data.DESCR)

.. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

    :Number of Instances: 178 (50 in each of three classes)
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash  
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2
		
    :Summary Statistics:
    
    ============================= ==== ===== ======= =====
                                   Min   Max   Mean     SD
    ============================= ==== ===== ======= =====
    Alcohol:                      11.0  14.8    13.0   0.8
    Malic Acid:                   0.74  5.80    2.34  1.12
    Ash:                          1.36  3.23    2.36  0.27
    Alcalinity of Ash:            10.6  30.0    19.5   3.3
    Magnesium:                    70.0 162.0    99.7  14.3
    Total Phenols:                0.98  3.88    2.29  0.63
    Flavanoids:                   0.34  5.08    2.03  1.00
    Nonflavanoid Phenols:         0.13  0.66    0.36  0.12
    Proanthocyanins:              0.41  3.58    1.59  0.57
    Colour Intensity:              1.3  13.0     5.1   2.3
    Hue:                          0.48  1.71    0.96  0.23
    OD280/OD315 of diluted wines: 1.27  4.00    2.61  0.71
    Proline:                       278  1680     746   315
    ============================= ==== ===== ======= =====

    :Missing Attribute Values: None
    :Class Distribution: class_0 (59), class_1 (71), class_2 (48)
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML Wine recognition datasets.
https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data

The data is the results of a chemical analysis of wines grown in the same
region in Italy by three different cultivators. There are thirteen different
measurements taken for different constituents found in the three types of
wine.

Original Owners: 

Forina, M. et al, PARVUS - 
An Extendible Package for Data Exploration, Classification and Correlation. 
Institute of Pharmaceutical and Food Analysis and Technologies,
Via Brigata Salerno, 16147 Genoa, Italy.

Citation:

Lichman, M. (2013). UCI Machine Learning Repository
[https://archive.ics.uci.edu/ml]. Irvine, CA: University of California,
School of Information and Computer Science. 

.. topic:: References

  (1) S. Aeberhard, D. Coomans and O. de Vel, 
  Comparison of Classifiers in High Dimensional Settings, 
  Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. of  
  Mathematics and Statistics, James Cook University of North Queensland. 
  (Also submitted to Technometrics). 

  The data was used with many others for comparing various 
  classifiers. The classes are separable, though only RDA 
  has achieved 100% correct classification. 
  (RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed data)) 
  (All results using the leave-one-out technique) 

  (2) S. Aeberhard, D. Coomans and O. de Vel, 
  "THE CLASSIFICATION PERFORMANCE OF RDA" 
  Tech. Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. of 
  Mathematics and Statistics, James Cook University of North Queensland. 
  (Also submitted to Journal of Chemometrics).

모듈 : 사이킷런의 데이터셋을 이용한 분류 문제 실습

DataFrame으로 나타내기

특성 행렬을 Pandas의 DataFrame으로 나타낼 수 있다

import pandas as pd

pd.DataFrame(data.data, columns=data.feature_names)

	alcohol	malic_acid	ash	alcalinity_of_ash	magnesium	total_phenols	flavanoids	nonflavanoid_phenols	proanthocyanins	color_intensity	hue	od280/od315_of_diluted_wines	proline
0	14.23	1.71	2.43	15.6	127.0	2.80	3.06	0.28	2.29	5.64	1.04	3.92	1065.0
1	13.20	1.78	2.14	11.2	100.0	2.65	2.76	0.26	1.28	4.38	1.05	3.40	1050.0
2	13.16	2.36	2.67	18.6	101.0	2.80	3.24	0.30	2.81	5.68	1.03	3.17	1185.0
3	14.37	1.95	2.50	16.8	113.0	3.85	3.49	0.24	2.18	7.80	0.86	3.45	1480.0
4	13.24	2.59	2.87	21.0	118.0	2.80	2.69	0.39	1.82	4.32	1.04	2.93	735.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...
173	13.71	5.65	2.45	20.5	95.0	1.68	0.61	0.52	1.06	7.70	0.64	1.74	740.0
174	13.40	3.91	2.48	23.0	102.0	1.80	0.75	0.43	1.41	7.30	0.70	1.56	750.0
175	13.27	4.28	2.26	20.0	120.0	1.59	0.69	0.43	1.35	10.20	0.59	1.56	835.0
176	13.17	2.59	2.37	20.0	120.0	1.65	0.68	0.53	1.46	9.30	0.60	1.62	840.0
177	14.13	4.10	2.74	24.5	96.0	2.05	0.76	0.56	1.35	9.20	0.61	1.60	560.0

178 rows × 13 columns

DataFrame으로 나타내니 한결 데이터 보기가 편해짐
이렇게 하면 EDA(Exploration Data Analysis)할 때 굉장히 편함

머신러닝

특성행렬 생성

특성 행렬은 통상 변수명 X에 저장하고, 타겟 벡터는 y에 저장

X = data.data
y = data.target

모델을 생성

분류 문제임으로 RandomForestClassifier를 사용

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()

훈련시키기

model.fit(X, y)

RandomForestClassifier()

예측

y_pred = model.predict(X)

성능을 평가

from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

#타겟 벡터 즉 라벨인 변수명 y와 예측값 y_pred을 각각 인자로 넣습니다. 
print(classification_report(y, y_pred))
#정확도를 출력합니다. 
print("accuracy = ", accuracy_score(y, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        59
           1       1.00      1.00      1.00        71
           2       1.00      1.00      1.00        48

    accuracy                           1.00       178
   macro avg       1.00      1.00      1.00       178
weighted avg       1.00      1.00      1.00       178

accuracy =  1.0

모듈 : Estimator

Estimator 객체

데이터셋을 기반으로 머신러닝 모델의 파라미터를 추정하는 객체를 Estimator
사이킷런의 모든 머신러닝 모델은 Estimator라는 파이썬 클래스로 구현
추정을 하는 과정 즉, 훈련은 Estimator의 fit()메서드
예측은 predict()메서드

Estimator 객체는 LinearRegression()과 RandomForestClassifier()

와인의 분류 문제를 해결하는 과정 그림

선형 회귀 문제를 해결하는 과정 그림

타겟 벡터가 없다면 어떻게 표현

정답이 없는 데이터인 비지도학습의 경우는 fit() 메서드의 인자로 Target Vector가 들어가지 않음
사이킷런의 Estimator 객체를 사용한다면 비지도학습, 지도학습에 관계없이 학습과 예측을 할 수 있음

훈련 데이터와 테스트 데이터 분리하기

Estimator 객체에 fit()과 prediction() 메서드에 인자로 각기 다른 데이터가 들어가야 함

하지만 아래 그림과 같이 훈련에 쓰이는 데이터와 예측에 쓰이는 데이터는 다른 데이터를 사용해야 함

훈련 데이터와 테스트 데이터 직접 분리하기

훈련 데이터와 테스트 데이터의 비율은 8:2로 설정

from sklearn.datasets import load_wine
data = load_wine()
print(data.data.shape)
print(data.target.shape)

(178, 13)
(178,)

전체 데이터의 개수는 178개입니다.

8 대 2로 특성 행렬과 타겟 벡터를 나누어 보도록
데이터의 개수이므로 정수만 가능
178개의 80%면 142.4이지만
정수로 표현해 142개,
훈련 데이터는 나머지 36개로

특성 행렬과 타겟 벡터는 ndarray type이니 numpy의 슬라이싱을 사용

훈련데이터

# 특성 행렬과 타겟 벡터는 ndarray type이니 numpy의 슬라이싱을 사용

X_train = data.data[:142]
X_test = data.data[142:]
print(X_train.shape, X_test.shape)

(142, 13) (36, 13)

테스트 데이터

y_train = data.target[:142]
y_test = data.target[142:]
print(y_train.shape, y_test.shape)

(142,) (36,)

훈련

# 훈련 데이터와 테스트 데이터의 분리가 끝났습니다. 그럼 다시 훈련과 예측
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X_train, y_train)

RandomForestClassifier()

예측

y_pred = model.predict(X_test)

정확도 평가.

from sklearn.metrics import accuracy_score

print("정답률=", accuracy_score(y_test, y_pred))

정답률= 0.9444444444444444

train_test_split() 사용해서 분리

훈련 데이터와 테스트 데이터 분리는 필수 기능입니다. 훈련에 쓴 데이터를 예측에 사용하면 항상 정확도는 100%가 나올 것이기 때문이죠. 사이킷런에서는 이 필수 기능을 당연히 API로 제공하고 있습니다. 바로 model_selection의 train_test_split() 함수

from sklearn.model_selection import train_test_split

result = train_test_split(X, y, test_size=0.2, random_state=42)

인자로 특성 행렬 X와 타겟 벡터 y를 넣고 테스트 데이터의 비율을 넣어 키워드 인자로 지정해 줍니다. 20%로 해 볼게요. 그리고 우리는 0번부터 순차적으로 데이터를 분할했죠? 사이킷런은 랜덤하게 데이터를 섞어주는 기능도 있습니다. random_state 인자에 seed 번호를 입력하면 됩니다. seed 번호는 임의로 결정할 수 있고, 같은 seed 번호를 사용하면 언제든 같은 결과를 얻을 수 있습니다.

train_test_split()은 반환값으로 4개의 원소로 이루어진 list를 반환합니다. (*리스트 원소의 데이터 타입은 array입니다.)

print(type(result))
print(len(result))

<class 'list'>
4

각각의 모양 확인

result[0].shape

(142, 13)

result[1].shape

(36, 13)

result[2].shape

(142,)

result[3].shape

(36,)

모양을 보니 감이 잡히시나요? 네 0번 원소부터 순서대로 훈련 데이터용 특성 행렬, 테스트 데이터용 특성 행렬, 훈련 데이터용 타겟 벡터, 테스트 데이터용 타겟 벡터입니다.

우리는 이 함수를 이런 식으로 unpacking 해서 사용

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

실습

와인 분류 문제의 데이터를 훈련용 데이터셋과 테스트용 데이터셋으로 나눈 뒤 훈련하고 예측하는 전체 코드를 직접 작성

# 데이터셋 로드하기
# [[your code]

# 훈련용 데이터셋 나누기
# [[your code]

# 훈련하기
# [[your code]

# 예측하기
# [[your code]

# 정답률 출력하기
# [[your code]

# 데이터셋 로드하기
data = load_wine()
# 훈련용 데이터셋 나누기
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2)
# 훈련하기
model = RandomForestClassifier()
model.fit(X_train, y_train)
# 예측하기
y_pred = model.predict(X_test)
# 정답률 출력하기
print("정답률=", accuracy_score(y_test, y_pred))

정답률= 0.9166666666666666

총정리

data = load_wine()
# 훈련용 데이터셋 나누기
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2)
# 훈련하기
model = RandomForestClassifier()
model.fit(X_train, y_train)
# 예측하기
y_pred = model.predict(X_test)
# 정답률 출력하기
print("정답률=", accuracy_score(y_test, y_pred))

백건

마케팅을 위한 인공지능 설계와 스타트업 Log

이전 포스트