[trouble #4] apply another values to NaN

kamchur·2022년 10월 21일

I tried titanic data preprocessing
'Age' column is NaN.
usually drop column 'Age'
I think predict 'Age' column value model
because titanic dataset leak data values

so I try to make predict 'Age' value model(:LinearRegression)
through model predict return value subtitude NaN

but I swam worry, that was apply predict values to NaN

😁START

check NaN values

train_data.isna().sum()

[result]
Survived        0
Sex             0
Age           177
Fare            0
Pclass_2        0
Pclass_3        0
Embarked_Q      0
Embarked_S      0
dtype: int64

create predict 'Age' Model

train_temp = train_data.loc[train_data['Age'].notnull(),]

temp_x = train_temp.drop(columns='Age')
temp_y = train_temp['Age']

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

model = LinearRegression()
model.fit(temp_x, temp_y)
pred = model.predict(temp_x)
print(r2_score(temp_y, pred))

[result]
0.20847354768995863 😅

distribute NaN values

xx = train_data.loc[train_data['Age'].isna()]
xxx = xx.drop(columns='Age')
save_index = xxx.reset_index()['index']

pred = model.predict(xxx)

temp_df = pd.DataFrame({'index':save_index, 'Age':pred})
temp_df

[result]
index	Age
0	5	28.772607
1	17	27.621829
2	19	18.663342
3	26	25.665526
4	28	21.794739
...	...	...
172	859	25.665349
173	863	23.592961
174	868	27.083844
175	878	27.151203
176	888	25.528662
177 rows × 2 columns

main apply(merge) code

for i in temp_df['index'].values:
    train_data.loc[i, 'Age'] = temp_df.loc[temp_df['index'] == i, 'Age'].values

😂END

I worried this code to make long time
expect more better working model

2022.10.21. first commit

kamchur

chase free

이전 포스트

[math #1] python root

다음 포스트