사이킷런 결측치 처리: SimpleImputer

반디·2023년 2월 24일

ML/DL

목록 보기

2/5

univariate imputation algorithm	multivariate imputation algorithm
결측치를 해당 특징값 중 결측치가 없는 값들로만 채우는 것	결측치를 전체 데이터를 이용해서 채우는 것
다른 특징값 이용 X	다른 특징값 이용가능
ex) SimpleImputer	ex) IterativeImputer

SimpleImputer는 univariate imputation algorithm으로, 각 column의 결측치를 각 column의 평균값, 중간값, 최빈값 혹은 상수값으로 채우는 결측치 처리법입니다.

numeric data	string(categorical) data
mean, median, most_frequent, constant	most_frequent, constant

- most_frequent의 경우, 최빈값이 2개 이상인 경우 작은 값을 리턴

fill_value: 결측치를 채우는 값 (option: str or numerical value, default=None)
- None: numerical data는 0으로, string이나 object data는 'missing_value'로 채움
copy: 복사본을 만들어서 결측치 처리를 할지의 여부 (option: bool, default = True)
- X가 실수형 변수로만 이루어지지 않았거나 CSR matrix로 인코딩된 형태이거나, add_indicator = True 인 경우에는 항상 복사본을 만들어서 작업하게 됨
add_indicator: 해당 값의 결측치 여부를 표시하는 column 추가여부 (option: bool, default = True)
keep_empty_features: 전체가 다 결측치인 특징값을 0으로 채워서 반환할지의 여부 (option: bool, default=False)
- strategy = 'constant'인 경우, fill_value로 채움
- default = False 이므로, 값이 없는 빈 column은 삭제될 수 있음

fit(X[, y]): Fit the imputer on X
fit_transform(X[, y]): Fit to data, then transform
get_feature_names_out([input_features]): Get output feature names for transformation
get_params(): Get parameters for this estimator
inverse_transform(X): Convert the data back to the original representation
set_output(*[, transform]): transform와 fit_transform을 수행한 후 output 형태 지정
set_params(**params)
transform(X): Impute all missing values in X.

참고문헌
https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

꾸준히!