I tried titanic data preprocessing
'Age' column is NaN.
usually drop column 'Age'
I think predict 'Age' column value model
because titanic dataset leak data values
so I try to make predict 'Age' value model(:LinearRegression)
through model predict return value subtitude NaN
but I swam worry, that was apply predict values to NaN
train_data.isna().sum()
[result]
Survived 0
Sex 0
Age 177
Fare 0
Pclass_2 0
Pclass_3 0
Embarked_Q 0
Embarked_S 0
dtype: int64
train_temp = train_data.loc[train_data['Age'].notnull(),]
temp_x = train_temp.drop(columns='Age')
temp_y = train_temp['Age']
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
model = LinearRegression()
model.fit(temp_x, temp_y)
pred = model.predict(temp_x)
print(r2_score(temp_y, pred))
[result]
0.20847354768995863 😅
distribute NaN values
xx = train_data.loc[train_data['Age'].isna()]
xxx = xx.drop(columns='Age')
save_index = xxx.reset_index()['index']
pred = model.predict(xxx)
temp_df = pd.DataFrame({'index':save_index, 'Age':pred})
temp_df
[result]
index Age
0 5 28.772607
1 17 27.621829
2 19 18.663342
3 26 25.665526
4 28 21.794739
... ... ...
172 859 25.665349
173 863 23.592961
174 868 27.083844
175 878 27.151203
176 888 25.528662
177 rows × 2 columns
for i in temp_df['index'].values:
train_data.loc[i, 'Age'] = temp_df.loc[temp_df['index'] == i, 'Age'].values
I worried this code to make long time
expect more better working model
2022.10.21. first commit