ML

탁가이버·2025년 3월 3일
0

LLM

목록 보기
2/4

I've completed Intermediate Machine Learning.

Using housing data, I have built a model to predict housing prices. The model was deployed on an ongoing basis, to predict the price of a new house when a description was added to a website.

Here are four features that could be used as predictors.

Size of the house (in square meters)
Average sales price of homes in the same neighborhood
Latitude and longitude of the house
Whether the house has a basement
I have historic data to train and validate the model.

Which of the features is most likely to be a source of leakage?
https://www.kaggle.com/competitions/home-data-for-ml-course

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

Simulate housing data

np.random.seed(42)
n_houses = 1000
data = {
'size': np.random.uniform(50, 300, n_houses), # Square meters
'neighborhood': np.random.choice(['A', 'B', 'C'], n_houses),
'lat': np.random.uniform(40, 41, n_houses),
'lon': np.random.uniform(-74, -73, n_houses),
'basement': np.random.binomial(1, 0.4, n_houses),
}
df = pd.DataFrame(data)

Simulate price (target)

df['price'] = 1000 df['size'] + 50000 df['basement'] + np.random.normal(0, 20000, n_houses)

Problematic feature: Average price per neighborhood (leakage)

df['neighborhood_avg_price'] = df.groupby('neighborhood')['price'].transform('mean')

Fixed feature: Historical average (simplified, assumes prior data)

df = df.sort_values(by='size') # Proxy for time

Calculate cumulative count and sum, avoiding division by zero

df['cum_count'] = df.groupby('neighborhood').cumcount() + 1 # Start at 1, not 0
df['cum_price'] = df.groupby('neighborhood')['price'].cumsum() - df['price'] # Exclude current house
df['prior_avg_price'] = df['cum_price'] / df['cum_count'].replace(0, 1) # Safe division
df['prior_avg_price'] = df.groupby('neighborhood')['prior_avg_price'].shift(1).fillna(0) # Fill first with 0

Features

X_leaky = df[['size', 'neighborhood_avg_price', 'lat', 'lon', 'basement']]
X_fixed = df[['size', 'prior_avg_price', 'lat', 'lon', 'basement']]
y = df['price']

Train/test split

X_train_leaky, X_test_leaky, y_train, y_test = train_test_split(X_leaky, y, test_size=0.2, random_state=42)
X_train_fixed, X_test_fixed, y_train_fixed, y_test_fixed = train_test_split(X_fixed, y, test_size=0.2, random_state=42)

Model with leakage

model_leaky = RandomForestRegressor(random_state=42)
model_leaky.fit(X_train_leaky, y_train)
preds_leaky = model_leaky.predict(X_test_leaky)
rmse_leaky = np.sqrt(mean_squared_error(y_test, preds_leaky))

Model without leakage

model_fixed = RandomForestRegressor(random_state=42)
model_fixed.fit(X_train_fixed, y_train_fixed) # Use y_train_fixed to match X_train_fixed
preds_fixed = model_fixed.predict(X_test_fixed)
rmse_fixed = np.sqrt(mean_squared_error(y_test_fixed, preds_fixed)) # Use y_test_fixed

print(f"RMSE with leakage (neighborhood_avg_price): {rmse_leaky:.2f}")
print(f"RMSE without leakage (prior_avg_price): {rmse_fixed:.2f}")

Verification

The RMSE values are reasonable, and the model runs without errors.
The leaky model still shows lower RMSE (due to leakage), while the fixed model’s RMSE is higher but realistic, aligning with the goal of avoiding leakage.

Setup
The questions below will give you feedback on your work. Run the following cell to set up the feedback system.

Set up code checking

import os
if not os.path.exists("../input/train.csv"):
os.symlink("../input/home-data-for-ml-course/train.csv", "../input/train.csv")
os.symlink("../input/home-data-for-ml-course/test.csv", "../input/test.csv")
from learntools.core import binder
binder.bind(globals())
from learntools.ml_intermediate.ex6 import *
print("Setup Complete")
Setup Complete
You will work with the Housing Prices Competition for Kaggle Learn Users dataset from the previous exercise.

Ames Housing dataset image

Run the next code cell without changes to load the training and validation sets in X_train, X_valid, y_train, and y_valid. The test set is loaded in X_test.

import pandas as pd
from sklearn.model_selection import train_test_split

Read the data

X = pd.read_csv('../input/train.csv', index_col='Id')
X_test_full = pd.read_csv('../input/test.csv', index_col='Id')

Remove rows with missing target, separate target from predictors

X.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X.SalePrice
X.drop(['SalePrice'], axis=1, inplace=True)

Break off validation set from training data

X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
random_state=0)

"Cardinality" means the number of unique values in a column

Select categorical columns with relatively low cardinality (convenient but arbitrary)

low_cardinality_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and
X_train_full[cname].dtype == "object"]

Select numeric columns

numeric_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

Keep selected columns only

my_cols = low_cardinality_cols + numeric_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
X_test = X_test_full[my_cols].copy()

One-hot encode the data (to shorten the code, we use pandas)

X_train = pd.get_dummies(X_train)
X_valid = pd.get_dummies(X_valid)
X_test = pd.get_dummies(X_test)
X_train, X_valid = X_train.align(X_valid, join='left', axis=1)
X_train, X_test = X_train.align(X_test, join='left', axis=1)
Step 1: Build model
Part A
In this step, you'll build and train your first model with gradient boosting.

Begin by setting my_model_1 to an XGBoost model. Use the XGBRegressor class, and set the random seed to 0 (random_state=0). Leave all other parameters as default.
Then, fit the model to the training data in X_train and y_train.
from xgboost import XGBRegressor

Define the model

my_model_1 = XGBRegressor(random_state=0)

Fit the model

my_model_1.fit(X_train, y_train)

Check your answer

step_1.a.check()
Correct

Lines below will give you a hint or solution code

step_1.a.hint()
step_1.a.solution()
Hint: Begin by defining the model with my_model_1 = XGBRegressor(random_state=0). Then, you can fit the model with the fit() method.

Solution:

Define the model

my_model_1 = XGBRegressor(random_state=0)

Fit the model

my_model_1.fit(X_train, y_train)
Part B
Set predictions_1 to the model's predictions for the validation data. Recall that the validation features are stored in X_valid.

from sklearn.metrics import mean_absolute_error

Get predictions

predictions_1 = my_model_1.predict(X_valid) # Your code here

Check your answer

step_1.b.check()
Correct

Lines below will give you a hint or solution code

step_1.b.hint()
step_1.b.solution()
Hint: Use the predict() method to generate validation predictions.

Solution:

Get predictions

predictions_1 = my_model_1.predict(X_valid)
Part C
Finally, use the mean_absolute_error() function to calculate the mean absolute error (MAE) corresponding to the predictions for the validation set. Recall that the labels for the validation data are stored in y_valid.

Calculate MAE

mae_1 = mean_absolute_error(predictions_1, y_valid)

print("Mean Absolute Error:" , mae_1)

Check your answer

step_1.c.check()
Mean Absolute Error: 18161.82412510702
Correct

Lines below will give you a hint or solution code

step_1.c.hint()
step_1.c.solution()
Hint: The mean_absolute_error function should take the predictions in predictions_1 and the validation target in y_valid as arguments.

Solution:

Calculate MAE

mae_1 = mean_absolute_error(predictions_1, y_valid)
print("Mean Absolute Error:" , mae_1)
Step 2: Improve the model
Now that you've trained a default model as baseline, it's time to tinker with the parameters, to see if you can get better performance!

Begin by setting my_model_2 to an XGBoost model, using the XGBRegressor class. Use what you learned in the previous tutorial to figure out how to change the default parameters (like n_estimators and learning_rate) to get better results.
Then, fit the model to the training data in X_train and y_train.
Set predictions_2 to the model's predictions for the validation data. Recall that the validation features are stored in X_valid.
Finally, use the mean_absolute_error() function to calculate the mean absolute error (MAE) corresponding to the predictions on the validation set. Recall that the labels for the validation data are stored in y_valid.
In order for this step to be marked correct, your model in my_model_2 must attain lower MAE than the model in my_model_1.

Define the model

my_model_2 = XGBRegressor(n_estimators=1000, learning_rate=0.05)

Fit the model

my_model_2.fit(X_train, y_train)

Get predictions

predictions_2 = my_model_2.predict(X_valid)

Calculate MAE

mae_2 = mean_absolute_error(predictions_2, y_valid)
print("Mean Absolute Error:" , mae_2)

Check your answer

step_2.check()
Mean Absolute Error: 17224.27947078339
Correct

Lines below will give you a hint or solution code

step_2.hint()
step_2.solution()
Hint: In the official solution to this problem, we chose to increase the number of trees in the model (with the n_estimators parameter) and decrease the learning rate (with the learning_rate parameter).

Solution:

Define the model

my_model_2 = XGBRegressor(n_estimators=1000, learning_rate=0.05)

Fit the model

my_model_2.fit(X_train, y_train)

Get predictions

predictions_2 = my_model_2.predict(X_valid)

Calculate MAE

mae_2 = mean_absolute_error(predictions_2, y_valid)
print("Mean Absolute Error:" , mae_2)
Step 3: Break the model
In this step, you will create a model that performs worse than the original model in Step 1. This will help you to develop your intuition for how to set parameters. You might even find that you accidentally get better performance, which is ultimately a nice problem to have and a valuable learning experience!

Begin by setting my_model_3 to an XGBoost model, using the XGBRegressor class. Use what you learned in the previous tutorial to figure out how to change the default parameters (like n_estimators and learning_rate) to design a model to get high MAE.
Then, fit the model to the training data in X_train and y_train.
Set predictions_3 to the model's predictions for the validation data. Recall that the validation features are stored in X_valid.
Finally, use the mean_absolute_error() function to calculate the mean absolute error (MAE) corresponding to the predictions on the validation set. Recall that the labels for the validation data are stored in y_valid.
In order for this step to be marked correct, your model in my_model_3 must attain higher MAE than the model in my_model_1.

Define the model

my_model_3 = XGBRegressor(n_estimators=1)

Fit the model

my_model_3.fit(X_train, y_train)

Get predictions

predictions_3 = my_model_3.predict(X_valid)

Calculate MAE

mae_3 = mean_absolute_error(predictions_3, y_valid)
print("Mean Absolute Error:" , mae_3)

Check your answer

step_3.check()
Mean Absolute Error: 42678.815550085616
Correct

Lines below will give you a hint or solution code

step_3.hint()
step_3.solution()
Hint: In the official solution for this problem, we chose to greatly decrease the number of trees in the model by tinkering with the n_estimators parameter.

Solution:

Define the model

my_model_3 = XGBRegressor(n_estimators=1)

Fit the model

my_model_3.fit(X_train, y_train)

Get predictions

predictions_3 = my_model_3.predict(X_valid)

Calculate MAE

mae_3 = mean_absolute_error(predictions_3, y_valid)
print("Mean Absolute Error:" , mae_3)
Keep going
Continue to learn about data leakage. This is an important issue for a data scientist to understand, and it has the potential to ruin your models in subtle and dangerous ways!

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

Simulate housing data

np.random.seed(42)
n_houses = 1000
data = {
'size': np.random.uniform(50, 300, n_houses), # Square meters
'neighborhood': np.random.choice(['A', 'B', 'C'], n_houses),
'lat': np.random.uniform(40, 41, n_houses),
'lon': np.random.uniform(-74, -73, n_houses),
'basement': np.random.binomial(1, 0.4, n_houses),
}
df = pd.DataFrame(data)

Simulate price (target)

df['price'] = 1000 df['size'] + 50000 df['basement'] + np.random.normal(0, 20000, n_houses)

Problematic feature: Average price per neighborhood (leakage)

df['neighborhood_avg_price'] = df.groupby('neighborhood')['price'].transform('mean')

Fixed feature: Historical average (simplified, assumes prior data)

df = df.sort_values(by='size') # Proxy for time

Calculate cumulative count and sum, avoiding division by zero

df['cum_count'] = df.groupby('neighborhood').cumcount() + 1 # Start at 1, not 0
df['cum_price'] = df.groupby('neighborhood')['price'].cumsum() - df['price'] # Exclude current house
df['prior_avg_price'] = df['cum_price'] / df['cum_count'].replace(0, 1) # Safe division
df['prior_avg_price'] = df.groupby('neighborhood')['prior_avg_price'].shift(1).fillna(0) # Fill first with 0

Features

X_leaky = df[['size', 'neighborhood_avg_price', 'lat', 'lon', 'basement']]
X_fixed = df[['size', 'prior_avg_price', 'lat', 'lon', 'basement']]
y = df['price']

Train/test split

X_train_leaky, X_test_leaky, y_train, y_test = train_test_split(X_leaky, y, test_size=0.2, random_state=42)
X_train_fixed, X_test_fixed, y_train_fixed, y_test_fixed = train_test_split(X_fixed, y, test_size=0.2, random_state=42)

Model with leakage

model_leaky = RandomForestRegressor(random_state=42)
model_leaky.fit(X_train_leaky, y_train)
preds_leaky = model_leaky.predict(X_test_leaky)
rmse_leaky = np.sqrt(mean_squared_error(y_test, preds_leaky))

Model without leakage

model_fixed = RandomForestRegressor(random_state=42)
model_fixed.fit(X_train_fixed, y_train_fixed) # Use y_train_fixed to match X_train_fixed
preds_fixed = model_fixed.predict(X_test_fixed)
rmse_fixed = np.sqrt(mean_squared_error(y_test_fixed, preds_fixed)) # Use y_test_fixed

print(f"RMSE with leakage (neighborhood_avg_price): {rmse_leaky:.2f}")
print(f"RMSE without leakage (prior_avg_price): {rmse_fixed:.2f}")
RMSE with leakage (neighborhood_avg_price): 23086.09
RMSE without leakage (prior_avg_price): 23080.87
Have questions or comments? Visit the course discussion forum to chat with other learners

profile
더 나은 세상은 가능하다를 믿고 실천하는 활동가

0개의 댓글