DA Practice

김태준·2023년 10월 8일

DA_Project

목록 보기

6/7

✅ DataFrame 함수

데이터프레임을 다루는 기본적인 함수들을 알아보고자 한다.

# 데이터 로드
import pandas as pd
import numupy as np
import warnings
warnings.filterwarnings(action = 'ignore')

df = pd.read_csv('~~/data/train.csv')
# 데이터프레임 컬럼 별 정보 확인
df.info()
# 데이터프레임 기술통계량 확인
df.describe()
# 데이터프레임 Object 컬럼 기술통계량 확인
df.describe(include=['O'])
# 결측치 있는 행들 확인
df[df.isna().any(axis=1)]
# 특정 컬럼 기준 결측치 있는 행 찾기
df[df['column'].isna()]

데이터 분석의 전체적인 과정은 다음과 같다.
1. 개선할 수 있는 분석 주제 선정
2. 현 문제를 반영한 데이터 수집
3. 수집한 데이터를 통해 유의미한 통계치/패턴 발견하여 파생변수 도출
4. 간단한 모델 도출의 경우 Object -> Numeric 형태로 변경 必 (순서, 개수 고려)
5. train-test split 이후 모델 학습 수행 (앙상블 OR ML/DL)

🎈 Encoding 방식

문자 형태의 컬럼을 숫자 형태로 변경해주는데 2가지 방식이 존재한다.
1. Label Encoding
# 문자형 컬럼 숫자형으로 인코딩하는 방법.
from sklearn import preprocessing
def encode_features(df):
	features = ['인코딩 하려는 문자형 컬럼']
    for feature in features:
    	le = preprocessing.LabelEncoder()
        df[feature] = le.fit_transform(df[feature])
    return df
위 방법을 활용할 경우 순서 특성이 반영되기에 기존 문자형일때 없었던 관계가 발생할 수 있다.
또한, 숫자값을 가중치로 잘못 인식해 값에 왜곡이 발생한다.
-> 보통 Tree 계열 ML 알고리즘에 활용

One-Hot Encoding
long type으로 구성된 변수들을 wide type으로 목록화해 변수의 차원을 늘린다. 이후 개별로 해당 컬럼이 맞으면 1, 아니면 0인 이진수를 갖는다. 문자열에서 바로 변환되지 않으므로 Labelencoder로 숫자 변환 이후 활용하기.
판다스로는 한번에 변환도 가능하다.
from sklearn.preprocessing import OneHotEncoder
import numpy as np

col1 = ['메론', '사과', '포도', '바나나', '딸기']

encoder = LabelEncoder()
encoder.fit(col1)
labels = encoder.transform(col1)

# 2차원 데이터로 변환
labels = labels.reshape(-1, 1)
# one-hot 인코딩 적용
oh_encoder = OneHotEncoder()
oh_encoder.fit(labels)
oh_labels = oh_encoder.transform(labels)

df = pd.get_dummies(df)
이로 인해 독립 변수 간 강한 상관관계가 나타나는 다중 공선성이 발생할 수 있다.
1. R2 값이 높아 회귀식의 설명력은 높으나, 독립변수의 P값이 큰 경우. (변수 간 상관관계 높음)
2. 독립변수(x)간 상관관계를 구하고 VIF를 구해 10이 넘으면 다중공선성있다고 판단.

해결 방법은 다음과 같다.

상관관계 높은 변수 중 하나/일부 제거

파생변수 생성 또는 새로운 관측치 수집해 가공

자료 수집하는 현장 상황 파악해 상관관계 이유 파악

PCA를 이용한 diagonal matrix 형태로 공선성 제거

🎈 PCA

PCA 수행 이전 Data Scaling이 반드시 이루어져야 한다.
데이터 크기에 따라 설명 가능한 분산량이 달라지기 때문에, scaling을 하지 않는다면 왜곡이 발생하게 되므로 변수 내 표준화 작업은 반드시 이루어져야 한다.
from sklearn.preprocessing import StandardScaler
X = StandardScaler().fit_transform(X)
feature = [독립변수들]
pd.DataFrame(X, columns=feature)

from sklearn.decomposition import PCA
# 주성분 몇개할지 지정
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(x)
principaldf = pd.DataFrame(data=principalComponents, columns = ['princol1', 'princol2']
# 누적 설명 분산량 체크
sum(pca.explained_variance_ratio_)

# 결측치 채우기 (중앙값)
df['컬럼'].fillna(df['컬럼'].median(), inplace=True)

# train - test split
from sklearn.model_selection import train_test_split
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 42)

from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.

🎈 scikit learn 라이브러리 예제

# 스케일링, 인코딩
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.preprocessing import StandardScaler
# 이후 실행 코드 형태는 다음과 같음
# -> coder = 인코더/스케일러명()              
# -> 컬럼 = coder.fit_transform(컬럼)

# 훈련/평가 데이터 분리
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state = 42)

# (분류) 로지스틱 회귀 , 의사결정나무, 
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
# 회귀
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
from sklearn.metrics import accuracy_score
print("예측정확도 : {0: .4f}".format(accuracy_score(y_test, y_pred)))

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix

회귀모델 평가 지표 MAE, RMSE, MAPE 등 값이 작을수록 좋은 오차들을 의미
MAE, MSE는 sklearn에서 제공해주는 평가지표

분류모델의 경우 accuracy, confusion matrix 기반 precision, recall, f1 score, roc, auc 등이 있다.

정확도의 경우 전체 예측 중 맞은 것 + 틀린 것의 합.
정밀도의 경우 전체 예측 중 올바르게 예측을 수행한 것.
재현율(recall = sensitivity)의 경우 실제 값 중 올바르게 예측한 비율
F1 score의 경우 정밀도와 재현율을 결합해 만든 지표.

김태준

To be a DataScientist

이전 포스트

KRX Stock_Investment Project #4

다음 포스트