python AI 기초지식

hello brown·2023년 4월 6일

DROP OneHotEncoding dummy_fields folium interpolate()sklearn scaling testaice

AI Python

많은 부분 직접 타이핑하여, 오타가 있습니다.

데이터 불러오기 & 분석

데이터 불러오기

import

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

불러오기

csv : pd.read_csv("파일이름. csv")
txt : pd.read_csv("파일이름. csv", sep="구분자")
xlsx : pd.read_excel('파일이름.xlsx')
	  # ex) pd.read_excel(filename, engine='openpyxl') # pip install openpyxl
pickle : pd.read_pickle("파일이름.pkl")
# 저장하기 
csv : df.to_csv('파일이름.csv', index=False)

데이터 병합

같은 column의 dataframe 이어붙이기 (row 개념)

pd.concat()

df = pd.read_csv("onenavi_train.csv",sep="|")
df_eval = pd.read_csv("onenavi_evaluation.csv",sep="|")
df_total=pd.concat([df,df_eval],ignore_index=True)

데이터 병합

pd.merge()

df_total=pd.merge(df_total,df_signal , on="RID")

분석

시각화 기본

기본시각화 순서
1. 사이즈 설정
  
  plt.figure()
2. 그래프 생성
  
  plt.plot()
3. 보여주기
  
  plt.show()

한글 폰트 사용

import matplotlib.font_manager as fm
fm.findSystemFonts(fontpaths=None, fontext='ttf')
#찾은 폰트를 기본 폰트로 설정하기. 여기서는 나눔고딕체 (NanumGothicCoding)
plt.rc('font', family='NanumGothicCoding')
plt.rc('axes', unicode_minus=False) # 폰트가 깨지지 않도록

Seaborn

import

!pip install seaborn

import seaborn as sns
import matplotlib.pyplot as plt

참고 링크
- Seaborn(https://seaborn.pydata.org/api.html)
- Seaborn.CountChart(https://seaborn.pydata.org/generated/seaborn.countplot.html)
- Seaborn.Distplot(https://seaborn.pydata.org/generated/seaborn.distplot.html?highlight=distplot#seaborn.distplot) : 히스토그램 + 커널밀도
- Seaborn.Boxplot(https://seaborn.pydata.org/generated/seaborn.boxplot.html#seaborn.boxplot)
- Seaborn.Heatmap(https://seaborn.pydata.org/generated/seaborn.heatmap.html?highlight=heatmap#seaborn.heatmap)
- Seaborn.Pairplot(https://seaborn.pydata.org/generated/seaborn.pairplot.html?highlight=pairplot#seaborn.pairplot) : 조합별 히스토그램 + 산점도

sns 사용 사전 준비

# 설치된 폰트 리스트 출력
import matplotlib.font_manager as fm

fm.get_fontconfig_fonts()

font_list = [font.name for font in fm.fontManager.ttflist]
# font_list
sns.set(font="NanumGothicCoding", 
        rc={"axes.unicode_minus":False}, # 마이너스 부호 깨짐 현상 해결
        style='darkgrid')

차트 그리기
1. 산점도 : sns.scatterplot()
2. 카테고리 분포 값 : sns.catplot()
3. 산점도의 회귀선을 넣기 : sns.lmplot()
4. 항목 별 갯수 sns.countplot()
  1. 색깔 바꾸기 가능 palette='spring'
  2. 예시 :sns.countplot(data=df, x='MultipleLines', hue='Churn')
5. 산점도와 countplot을 한꺼번에 상관관계 확인 sns.jointplot()
6. 상관관계 확인 데이터의 continuous 가 필수 -> sns.heatmap()
  1. plt.rc('axes', unicode_minus=False)
7. 수치적 자료 표현 그래프. 통계량(5가지요약) 을 표현 sns.boxplot()
8. 밀집도 확인 가능 sns.violinplot()
9. 히스토그램 확인 가능 : sns.histplot()
  1. 예시 : sns.histplot(data=df, x='tenure', hue='Churn')
  2. 대체재 : sns.kdeplot(data=df, x='tenure', hue='Churn')
    1. histplot은 앞 막대에 막혀 안보일 수 있다.
10. 전체 분포도 확인 sns.pairplot(df_total)

matplotlib

이론

plt.plot(data)
	- 추세선
	- data는 x축y축의 개념이 있음 
plt.scatter(x,y)
	- 산점도
plt.hist(x)
	- 빈도, 빈도밀도, 확률 분포 그리기 좋음 
plt.boxplot(x)
	- 수치적 자료를 표현
	- 최소값, 제 1사분위값, 제 2사분위값, 제 3사분위값, 최대값 이렇게 5개를 요약가능
		- 주식차트가 생각남
plt.bar(x,height)
	- 범주형 데이터의 수치를 요약

예시

plt.scatter(y=df["avg_bill"),x=df["age"])

plt.hist(df["A_bill"], bins=20)

x = [5,3,7,10,9,5,3.5,8,6]
plt.boxplot(x=x)

df.boxplot(by="by_age", column="avg_bill", figsize=(16,7))

y=[5, 3, 7, 10, 9, 5, 3.5, 8]
x=list(range(len(y)))
plt.bar(x, y)

df2[['A_bill', 'B_bill']].plot(kind='bar', stacked=True)
                 
df['Churn'].value_counts().plot(kind='bar')

상관관계(pandas, seaborn)

pandas를 이용한 상관계수

df_total.corr()

seaborn을 이용한 heatmap 시각화

sns.heatmap(df_total.corr(), annot=True, cmap="RdBu")
plt.show()

추가 시각화 라이브러리

Folium

import
- import folium
일반 지도 그리기
- map = folium.Map(location=[f_lat,f_lon], zoom_start=14)
다른 형식의 지도 그리기
- map_ST = folium.Map(location=[f_lat,f_lon], zoom_start=14, tiles='Stamen Terrain')

지도 위 heatmap

from folium.plugins import HeatMap
heat_data=np.array([ansan_map_1['lat'],ansan_map_1['lon']])
heat_data=heat_data.transpose()

map = folium.Map(location=[f_lat,f_lon], zoom_start=13)
HeatMap(heat_data,min_opacity=0.2,max_val=1,max_zoom=25,radius=25).add_to(map)

전처리

탐색

df.info()
1. 데이터 타입, Non-Null 개수 등 파악 가능
df.describe()
1. 수학적 통계 확인 가능
df = df.rename(columns={딕셔너리구조 })
1. 명칭 정리
df = df.astype({'칼럼명':float})
1. type 변경
2. Nan의 경우 int type 지원하지 않음. python에서는 float형태로 작업 수행
df.select_dtypes('O')
1. Dtype이 Object인 항목 추출(column을 기준으로 필터링되어 추출)
2. list(df_total.select_dtypes('O').columns)
df['TotalCharges'].str.contains("[^0-9.]")
1. string 내 숫자만 존재하는지 확인할 수 있는 방법

이상치/결측치

결측치

결측치 검색
1. df_total.isnull().sum()
결측치 삭제
1. df.dropna()
2. df = df.dropna(axis=0)
결측치 채우기
1. df.fillna(0)
2. df.fillna(method='pad/ffill')
  1. backfill/bfill: 바로 뒤에 값으로 채우기
  2. pad/ffill: 바로 앞의 값으로 채우기
3. df['age'] = df['age'].replace(np.nan, df['age'].median())
  1. replace 활용
    1. 위 예제 소스는 중간값 사용
4. df.interpolate()
  1. 결측값 보간법. 그라데이션처럼 채워나가는 방식
  2. 1-Nan-Nan-7 일 경우, 보간법 적용 시, 1-3-5-7

이상치

이상치를 처리할 경우, 값의 상관관계를 따져 확인해보아야 한다.
상관관계 따져보기
- 고정된 방식이 없음. 아래를 기본적으로 추천
- ```
import seaborn as sns
import matplotlib.pyplot as plt

sns.pairplot(df_total)
plt.show()
```
  - 소스 수행 후, 이상 값 삭제
- ```
# 데이터 분포 확인하기
df_total.describe()
```
  - min/max 검토 후 이상 값 확인

값의 의미를 고려했을 때 삭제하고 싶은 데이터 삭제

예시

# 평균시속 변수 만들기 : 속도는 거리 나누기 시간
df_total['PerHour']=(df_total['A_DISTANCE']/1000)/(df_total['ET']/3600)

시속 계산 후, 값 확인 시에 시속이 과하게 높은 값 제거

# Outlier 제거 후 데이터만 남기기
df_total=df_total[df_total['PerHour']<=130]
df_total

중복 제거

중복 제거
- df.drop_duplicates()

Feature Engineering

초기데이터로부터 특징을 가공하여 입력 데이터를 생성하는 과정

Binning

cut
1. 구간별 범주화
2. pd.cut()
qcut
1. 개수를 이용한 범주화
2. pd.qcut

Scaling

Standardization

정규 분포를 평균이 0이고 분산이 1인 표준 정규 분표로 변화하고자 할 경우, 수행
object.describe() 를 통해 상황 확인 가능
Standardization_df = (cust_data_num - cust_data_num.mean())/cust_data_num.std()

Normalizaiont

MinMaxScalar 이용

# 스케일링
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0, 1))
feature = pd.DataFrame(scaler.fit_transform(train_data))
feature.columns=columnNames

Encoding

One-Hot Encoding(dummies)

True/False 로 만들어주는 기법

dummy_fields = ['WEEKDAY','HOUR','level1_pnu','level2_pnu']

for dummy in dummy_fields:
    dummies = pd.get_dummies(df_total[dummy], prefix=dummy, drop_first=False)
    df_total = pd.concat([df_total, dummies], axis=1)
    
df_total = df_total.drop(dummy_fields,axis=1)

pd.dummies() 을 이용해 값을 빼서, pd.concat()을 이용해 값 합한 후, 기존의 column값을 drop을 이용해 삭제

from sklearn.preprocessing import OneHotEncoder

Label Encoding

차이점
- label 인코딩은 과목 column의 값인 국어, 영어, 수학, 과학, 사회를 0,1,2,3,4로 지정하는 것
- label encoding 결과는 숫자이지만 평균값이나 중간값으로 계산되면 안되는 값이며, 이를 one-hot encoding으로 변환하여
- 과목이 0인 column, 과목이 1인 column, 과목이 2인 column 등으로 column을 변경해야함
사용방법
- label encoding만 하면 위 내용처럼 오해가 생길 수 있음
- 선형 알고리즘에서는 one-hot 적용 필수, tree계열의 알고리즘은 label만 해도 가능

from sklearn.preprocessing import LabelEncoder

머신러닝

ML 종류별 예시 - 회귀

Linear Regression

from sklearn.linear_model import LinearRegression
model = LinearRegression()
# 아래 내용은 ML 에서 반복 사용
model.fit(X_train, y_train)
pred = model.predict(X_test)

ML 종류별 예시 - 분류

Logistic Regression

이진 분류 규칙은 0과 1의 두 클래스를 갖는 것으로 선형회귀 모델을 이진 분류에 사용하기 어려움
로지스틱 함수를 사용하여 로지스틱 회귀 곡선으로 변환하여 이진 분류 가능

from sklearn.linear_model import LogisticRegression 
model = LogisticRegression()

ML 종류별 예시 - 회귀/분류

Decision Tree

from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=2)

from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor()

K-Nearest Neighbor

from sklearn.neighbors import KNeighborsClassifier 
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
pred=knn.predict(X_test)

from sklearn.neighbors import KNeighborsRegressor
regressor = KNeighborsRegressor(n_neighbors = 3, weights = "distance")

ML 종류별 예시 - 앙상블

Random Forest

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=50)

XGboost

설치
- !pip install xgboost

from xgboost import XGBClassifier
xgb = XGBClassifier(n_estimators=3, random_state=42)  # 10초 소요
xgb.fit(X_train, y_train)
xgb_pred = xgb.predict(X_test)

Stacking

개별 모델이 예측한 데이터를 기반으로 종합하여 output 계산

from sklearn.ensemble import StackingRegressor, StackingClassifier
stack_models = [
    ('LogisticRegression', lg), 
    ('KNN', knn), 
    ('DecisionTree', dt),
]
stacking = StackingClassifier(stack_models, final_estimator=rfc, n_jobs=-1)
stacking.fit(X_train, y_train)   # 1분 20초 소요
stacking_pred = stacking.predict(X_test)

Weighted Blending

예측값에 weight를 곱하여 최종 output 계산

# 별개의 function이 있는 게 아니라, 여기 아래처럼 predict 결과 값을 모아서, output을 계산하는 방식 

final_outputs = {
    'DecisionTree': dt_pred, 
    'randomforest': rfc_pred, 
    'xgb': xgb_pred, 
    'lgbm': lgbm_pred,
    'stacking': stacking_pred,
}

final_prediction=\
final_outputs['DecisionTree'] * 0.1\
+final_outputs['randomforest'] * 0.2\
+final_outputs['xgb'] * 0.25\
+final_outputs['lgbm'] * 0.15\
+final_outputs['stacking'] * 0.3\

final_prediction = np.where(final_prediction > 0.5, 1, 0)

ML 수행 전/후 추가 작업

학습/훈련 데이터 나누기

from sklearn.model_selection import train_test_split
X = df1.drop('termination_Y', axis=1).values
y = df1['termination_Y'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.3, 
                                                    stratify=y,
                                                    random_state=42)

모델 분류기 성능 평가(score)

score
- model.score(X_test, y_test)

오차행렬

pred = model.predict(X_test)
from sklearn.metrics import confusion_matrix 
confusion_matrix(y_test, pred)

그 외 결과 값

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
accuracy_score(y_test, pred)  
precision_score(y_test, pred) 
recall_score(y_test, pred)  
f1_score(y_test, pred)

from sklearn.metrics import classification_report
print(classification_report(y_test, lg_pred))

모델 저장

import pickle
import joblib
model.fit(train_x, train_y)
joblib.dump(model, '{}_model.pkl'.format(i))

딥러닝

기본 가이드

팁

activation설정
- 마지막 출력층에 Label의 열이 하나고 두 개의 값으로 이루어진 이진분류라면 sigmoid
- Label의 열이 두개 이상이라면 softmax
loss설정
- 출력층 activation이 sigmoid 인 경우: binary_crossentropy
- 출력층 activation이 softmax인 경우:
  - 원핫인코딩(O): categorical_crossentropy
  - 원핫인코딩(X): sparse_categorical_crossentropy
metrics를 acc 혹은accuracy로 지정하면, 학습시 정확도를 모니터링 할 수 있습니다.

예시

모델 컴파일 – 다중 분류 모델 (Y값을 One-Hot-Encoding 한경우)

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
모델 컴파일 – 다중 분류 모델 (Y값을 One-Hot-Encoding 하지 않은 경우)

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
모델 컴파일 – 예측 모델

model.compile(optimizer='adam', loss='mse')

딥러닝 종류별 소스

DNN

import tensorflow as tf
from tensorflow.keras.models import Sequentioal
from tensorflow.keras.layers import Dense, Dropout

model = Sequential()
model.add(Dense(4, input_shape=(3,), activation='relu'))
# hiden layer
model.add(Dropout(0.2))
model.add(Dense(4, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# 결과 값이 1개일경우, 보통 sigmoid 
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])
history = model.fit(X_train, validation_data=(X_test, y_test),....)

CNN

model = Sequential()
model.add(Conv2D(12, kernel_size=(5,5), activation='relu', input_shape=(120, 60,1)))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Conv2D(12, kernel_size=(5,5), activation='relu'))	
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Conv2D(12, kernel_size=(4,4), activation='relu'))	
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(4, activation='softmax'))

LSTM

tf.keras.layers.LSTM(64)
tf.keras.layers.LSTM(64, return_sequence=True)

활용 예시

callback

from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
# val_loss 모니터링해서 성능이 5번 지나도록 좋아지지 않으면 조기 종료
early_stop = EarlyStopping(monitor='val_loss', mode='min', 
                           verbose=1, patience=5)
# val_loss 가장 낮은 값을 가질때마다 모델저장
check_point = ModelCheckpoint('best_model.h5', verbose=1, 
                              monitor='val_loss', mode='min', save_best_only=True)       
history = model.fit(x=X_train_over, y=y_train_over, 
          epochs=50 , batch_size=32,
          validation_data=(X_test, y_test), verbose=1,
          callbacks=[early_stop, check_point])

val_loss 값이 더 떨어져야 하는데, 5번 지나도록 더 떨어지지 않으면 조기 종료 하겠다.
예시1
- Epoch 00002: val_loss improved from 0.43748 to 0.43266, saving model to best_model.h5
  - 2번째 epoch에서 val_loss가 떨어져서 저장되는 모습
예시2
- ```
Epoch 00037: val_loss did not improve from 0.42811
Epoch 00037: early stopping
```
  - 37번째 epoch에서 val_loss가 더이상 발전되지 않아 종료됨

성능시각화

plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Acc')
plt.legend(['acc', 'val_acc'])
plt.show()

Feature Engineering

성능향상
불균형 Churn 데이터 균형
- OverSampling
- UnderSampling
  
  샘플 갯수를 늘려서/줄여서, 학습시키기
  
  데이터를 강제로 늘릴 경우, 과적합이 될 수도 있음

예시

!pip install -U imbalanced-learn

from imblearn.over_sampling import SMOTE
# SMOTE 함수 정의 및 Oversampling 수행

smote = SMOTE(random_state=0)
X_train_over, y_train_over = smote.fit_resample(X_train, y_train)
print('SMOTE 적용 전 학습용 피처/레이블 데이터 세트: ', X_train.shape, y_train.shape)
print('SMOTE 적용 후 학습용 피처/레이블 데이터 세트: ', X_train_over.shape, y_train_over.shape)
# > 수행 시 데이터 세트가 거의 1.5배 늘어남
# SMOTE 적용 후 레이블 값 분포 : 0과 1 갯수가 동일 
pd.Series(y_train_over).value_counts()

# MinMaxScaler
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(X_train)
X_train_over = scaler.transform(X_train_over)
X_test = scaler.transform(X_test)

재 scailing

DNN 재수행 후, 정확도가 79%이상으로 향상됨

기타 팁

object.drop
- data.drop(columns=['dev'], axis=1)
- data.drop('dev', axis='columns')
  - 위 소스 동일

dataframe.loc

first_data_dropped =  data_dropped.loc[data_dropped.index <= datetime.datetime(2019, 10,31)]
last_data_dropped = data_dropped.loc[data_dropped.index > datetime.datetime(2019, 10,31)]

train = data_dropped.loc[:datetime.datetime(2019,10,31),:]
test = data_dropped.loc[datetime.datetime(2019,11,1):,:]

위 소스 결과 동일

hello brown

냠냠 보드람 치킨

python AI 기초지식

AI Python

데이터 불러오기 & 분석

데이터 불러오기

분석

시각화 기본

Seaborn

matplotlib

상관관계(pandas, seaborn)

추가 시각화 라이브러리

Folium

지도 위 heatmap

전처리

탐색

이상치/결측치

결측치

이상치

중복 제거

Feature Engineering

Binning

Scaling

Standardization

Normalizaiont

Encoding

One-Hot Encoding(dummies)

Label Encoding

머신러닝

ML 종류별 예시 - 회귀

Linear Regression

ML 종류별 예시 - 분류

Logistic Regression

ML 종류별 예시 - 회귀/분류

Decision Tree

K-Nearest Neighbor

ML 종류별 예시 - 앙상블

Random Forest

XGboost

Stacking

Weighted Blending

ML 수행 전/후 추가 작업

학습/훈련 데이터 나누기

모델 분류기 성능 평가(score)

모델 저장

딥러닝

기본 가이드

팁

예시

딥러닝 종류별 소스

DNN

CNN

LSTM

활용 예시

callback

성능시각화

Feature Engineering

기타 팁

0개의 댓글