Model Selection

seongyong·2021년 4월 15일

grid search model selection randomized search target encoder

Machine Learning

목록 보기

8/12

학습내용

RandomizedSearchCV

파라미터들의 범위를 지정하여 설정하고 랜덤으로 조합하여 그 중 최선의 파라미터 조합을 선정해주는 방법
최적 파라미터의 범위를 잘 모를때 사용하며, 일반적으로 많이 쓰인다.

#best parameter 찾기

from scipy.stats import randint, uniform
from sklearn.feature_selection import SelectKBest

pipe_temp = make_pipeline(
    TargetEncoder(smoothing = 1000), #na는 처리해주지 않음
    SimpleImputer(strategy = 'mean'), #na처리
    PolynomialFeatures(2),
    SelectKBest(),
    RandomForestClassifier()
)

param_grid = {
    'selectkbest__k' : range(50, 101),
    'randomforestclassifier__n_estimators' : randint(50,500),
    'randomforestclassifier__max_features' : uniform(0,1),
    'randomforestclassifier__max_depth' : [5, 10, 20, 30, None]
}

clf = RandomizedSearchCV(
    pipe_temp,
    param_distributions = param_grid,
    n_iter=30,
    cv=3,
    n_jobs = -1,
    verbose = 1
)

clf.fit(X_train_val, y_train_val)

pipe = clf.best_estimator_

print('Best parameters : \n', clf.best_params_)
print('\nBest score : ', clf.best_score_)
pd.DataFrame(clf.cv_results_)

GridSearchCV

설정한 모든 가능한 파라미터 조합을 다 시험해보는 방법.
도메인 지식이 어느정도있어 최적 파라미터가 대략적으로 위치하는 범위를 알때 사용하기 좋음.

#best parameter 찾기

from scipy.stats import randint, uniform
from sklearn.feature_selection import SelectKBest

pipe_temp = make_pipeline(
    # TargetEncoder(smoothing = 1000), #na는 처리해주지 않음
    # SimpleImputer(strategy = 'mean'), #na처리
    PolynomialFeatures(2),
    SelectKBest(),
    RandomForestClassifier(n_estimators = 200, n_jobs=-1, max_features = 0.25, max_depth = 10, random_state = 10)
)

param_grid = {
    # 'randomforestclassifier__max_features' : ['sqrt', 0.25, 0.5],
    # 'randomforestclassifier__max_depth' : [10, 20, 30, None]
    'selectkbest__k' : [50,60,70,80,90]
}




clf = GridSearchCV(
    pipe_temp,
    param_grid = param_grid,
    cv=5,
    n_jobs = -1,
    verbose = 1
)

clf.fit(X_train_val, y_train_val)
pipe = clf.best_estimator_

print('Best parameters : \n', clf.best_params_)
print('\nBest score : ', clf.best_score_)
pd.DataFrame(clf.cv_results_)

Target encoder

해당 feature의 범주마다 target값의 평균으로 인코딩하는 방식.
smoothing은 범주마다 평균에서 떨어지는 정도를 나타내는 것이라고 생각하면됨.

from category_encoders import TargetEncoder

encoder = TargetEncoder(smoothing = 1000)

seongyong

이전 포스트

Metric

다음 포스트