feature_cols = [col for col in train_df.columns if col not in ["ID", "Segment"]]
X = train_df[feature_cols].copy()
y = train_df["Segment"].copy()
# 타깃 라벨 인코딩
le_target = LabelEncoder()
y_encoded = le_target.fit_transform(y)
X_train, X_val, y_train, y_val = train_test_split(
X,
y,
test_size=0.2,
random_state=42
)
y_test_pred_labels = le_target.inverse_transform(y_test_pred)
test_data["pred_label"] = y_test_pred.flatten()
test_data["pred_label"] = y_test_pred.flatten()
y_test_pred.shape
는 (600000, 1)y_test_pred.flatten()
는 (600000,)y_test_df = pd.DataFrame(y_test_pred, columns=['pred_label'])
test_data2 = test_df.copy() # 원본 유지
test_data2_concat = pd.concat([test_data2, y_test_df], axis=1)
TypeError: Encoders require their input argument must be uniformly strings or numbers. Got ['float', 'str']
Hypothesis
The error message "Encoders require their input argument must be uniformly strings or numbers. Got ['float', 'str']" indicates that the LabelEncoder encountered a column containing mixed data types – specifically, strings and floats. This typically happens when a column intended to hold categorical values has some entries that are numbers (either as actual numeric types or strings that can be interpreted as numbers) and other entries that are strings representing categories.
In your case, this issue arises in the loop where you iterate through categorical_features and apply LabelEncoder. One or more of the columns in categorical_features likely contains both strings and numbers, preventing the LabelEncoder from processing it correctly.
categorical_features = X.select_dtypes(include=['object']).columns.tolist()
X_test = test_df.copy()
encoders = {} # 각 컬럼별 encoder 저장
for col in categorical_features:
# Convert the column to string type before applying LabelEncoder
X[col] = X[col].astype(str)
X_test[col] = X_test[col].astype(str)
le_train = LabelEncoder()
X[col] = le_train.fit_transform(X[col])
encoders[col] = le_train
unseen_labels_val = set(X_test[col]) - set(le_train.classes_)
if unseen_labels_val:
le_train.classes_ = np.append(le_train.classes_, list(unseen_labels_val))
X_test[col] = le_train.transform(X_test[col])
Use code with caution
Explanation of changes:
from catboost import CatBoostClassifier
cat_model = CatBoostClassifier(iterations=1000, # 최대 반복 횟수
learning_rate=0.05,
depth=6,
early_stopping_rounds=50, # 50번 동안 개선 없으면 조기 종료
verbose=25)
from catboost import CatBoostClassifier
cat_model = CatBoostClassifier(iterations=1200, # 최대 반복 횟수
learning_rate=0.05,
depth=6,
early_stopping_rounds=50, # 50번 동안 개선 없으면 조기 종료
verbose=50)
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(
n_estimators=100,
max_depth=10,
)
rf_model.fit(X_train, y_train)