[Dacon] 신용카드 고객 세그먼트 분류

Gongsam·2025년 3월 15일

RandomForestClassifier Catboostclassifier colab 깃허브 연동 dacon 신용카드 고객 세그먼트 분류 해커톤

dacon

목록 보기

2/2

1. 문제 상황

[Colab] 사용 가능한 RAM을 모두 사용한 후 세션이 다운되었습니다.

세션 -> 세션 관리 -> 활성 세션 -> 세션 종료
런타임 유형 변경 -> 하드웨어 가속기 -> T4 GPU 혹은 v2-8 TPU 선택
- CPU는 램을 12가 정도만 지원해줘서 다운되는 문제였음
- v2-8 TPU로 바꾸고 확인하니 데이터 로드에만 램 28.5기가 넘게 사용 중.. 다운될 만함..

Catboostclassifier 사용 중 발생한 문제

타겟 라벨인코딩 해놓고 안씀

feature_cols = [col for col in train_df.columns if col not in ["ID", "Segment"]]

X = train_df[feature_cols].copy()
y = train_df["Segment"].copy()

# 타깃 라벨 인코딩
le_target = LabelEncoder()
y_encoded = le_target.fit_transform(y)

y_encoded를 해놓고

X_train, X_val, y_train, y_val = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)

정작 train_test_split에서 안쓴 문제 발생..^^
Catboostclassifier가 자체적으로 라벨인코딩을 해줬기 때문에 학습 상 문제가 없었음

y_test_pred_labels = le_target.inverse_transform(y_test_pred)

그래서 위와 같이 하는 대신에

test_data["pred_label"] = y_test_pred.flatten()

이렇게 함

다차원 배열을 column으로 저장할 때 문제 해결

test_data["pred_label"] = y_test_pred.flatten()

이때 flatten()을 하지 않으면 Value Error 가 났는데 y_test_pred가 numpy.ndarray 다차원 배열이라서. 해결 방법은 다음 두 가지이다.

flatten() 쓰기
- 1차원으로 바뀌기 때문에 오류가 안남
  - y_test_pred.shape 는 (600000, 1)
  - y_test_pred.flatten()는 (600000,)
DataFrame으로 변경 후 concat 쓰기

y_test_df = pd.DataFrame(y_test_pred, columns=['pred_label'])
test_data2 = test_df.copy() # 원본 유지
test_data2_concat = pd.concat([test_data2, y_test_df], axis=1)

Type Error

TypeError: Encoders require their input argument must be uniformly strings or numbers. Got ['float', 'str']

위와 같은 에러가 나서 AI에 도움을 요청하고 다음과 같은 답변을 받음

Hypothesis
The error message "Encoders require their input argument must be uniformly strings or numbers. Got ['float', 'str']" indicates that the LabelEncoder encountered a column containing mixed data types – specifically, strings and floats. This typically happens when a column intended to hold categorical values has some entries that are numbers (either as actual numeric types or strings that can be interpreted as numbers) and other entries that are strings representing categories.

In your case, this issue arises in the loop where you iterate through categorical_features and apply LabelEncoder. One or more of the columns in categorical_features likely contains both strings and numbers, preventing the LabelEncoder from processing it correctly.

제안된 해결법

categorical_features = X.select_dtypes(include=['object']).columns.tolist()

X_test = test_df.copy()

encoders = {}  # 각 컬럼별 encoder 저장

for col in categorical_features:
    # Convert the column to string type before applying LabelEncoder
    X[col] = X[col].astype(str)
    X_test[col] = X_test[col].astype(str)
    
    le_train = LabelEncoder()
    X[col] = le_train.fit_transform(X[col])
    encoders[col] = le_train
    unseen_labels_val = set(X_test[col]) - set(le_train.classes_)
    if unseen_labels_val:
        le_train.classes_ = np.append(le_train.classes_, list(unseen_labels_val))
    X_test[col] = le_train.transform(X_test[col])

Use code with caution
Explanation of changes:

Convert to String: Before applying the LabelEncoder, we explicitly convert the column to string type using astype(str). This ensures that all values in the column are treated as strings, resolving the type mismatch that caused the error.
Apply to both X and X_test: The conversion is applied to both your training data (X) and your test data (X_test) to maintain consistency and prevent similar issues later on.

구글 코랩(colab)에서 Github 연동

파일 -> github에 사본 저장할 때 private repository가 안뜨는 문제 발생
- 도구 -> 설정 -> github -> 비공개 저장소 및 조직에 액세스 체크하면 됨

2. 사용한 모델

1) Catboostclassifier

from catboost import CatBoostClassifier
cat_model = CatBoostClassifier(iterations=1000,  # 최대 반복 횟수
                           learning_rate=0.05,
                           depth=6,
                           early_stopping_rounds=50,  # 50번 동안 개선 없으면 조기 종료
                           verbose=25)

학습 1시간 걸림..
preprocessing 없이 라벨 인코딩만 했을 경우: 0.4995921294

from catboost import CatBoostClassifier
cat_model = CatBoostClassifier(iterations=1200,  # 최대 반복 횟수
                           learning_rate=0.05,
                           depth=6,
                           early_stopping_rounds=50,  # 50번 동안 개선 없으면 조기 종료
                           verbose=50)

2시간 걸림
최대 반복 횟수 증가 => 결과: 0.50315
iterations를 키울 수록 높아지는 거 같음 (early stopping이 적용 안됨)

2) RandomForestClassifier

from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
)
rf_model.fit(X_train, y_train)