[kaggle/python] House Price exploration

Jia Kangยท2022๋…„ 7์›” 31์ผ
1
post-thumbnail

๐Ÿ“Œ ์ฃผ์ œ: House Price exploration

๐Ÿ“– ์ฐธ๊ณ  ์†”๋ฃจ์…˜

Comprehensive data exploration with Python(by Pedro Marcelino)


โœ”๏ธ Understand the problem

โšก ๋ณ€์ˆ˜, ๋ฐ์ดํ„ฐ์…‹ ์‚ดํŽด๋ณด๊ธฐ

โœ๏ธ ํ•„์š”ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

# ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
import numpy as np
from scipy.stats import norm
from sklearn.preprocessing import StandardScaler
from scipy import stats
# warnings ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋กœ ๊ฒฝ๊ณ ๋ฉ”์„ธ์ง€ ๋ฌด์‹œํ•˜๊ธฐ
import warnings
warnings.filterwarnings(action='ignore')

โœ๏ธ ๋ฐ์ดํ„ฐ์…‹ ๊ฐ€์ ธ์˜ค๊ธฐ

# ๋ฐ์ดํ„ฐ์…‹ ๊ฐ€์ ธ์˜ค๊ธฐ
train_data = 'C:\\Users\\USER\\Desktop\\Data Analysis\\data\\train2.csv'
test_data = 'C:\\Users\\USER\\Desktop\\Data Analysis\\data\\test2.csv'
df_train = pd.read_csv(train_data)
df_test = pd.read_csv(test_data)

โœ๏ธ train set์˜ ์ปฌ๋Ÿผ(๋ณ€์ˆ˜) ํ™•์ธํ•˜๊ธฐ

print(df_train.columns.values)

โœ๏ธ ๋ฐ์ดํ„ฐ ํ™•์ธํ•˜๊ธฐ

df_train.head()

โœ๏ธ train, test set์˜ ์š”์•ฝ์ •๋ณด ํ™•์ธํ•˜๊ธฐ

df_train.info()
print('\n')
df_test.info()

๐Ÿ”น Question

  1. ์ด ๋ณ€์ˆ˜๊ฐ€ ์ง‘์„ ๊ตฌ๋งคํ•  ๋•Œ ํ•„์š”ํ•œ๊ฐ€?
  2. ๊ทธ๋ ‡๋‹ค๋ฉด, ์ด ๋ณ€์ˆ˜๊ฐ€ ์–ผ๋งˆ๋‚˜ ์ค‘์š”ํ•œ๊ฐ€?
  3. ์ด ๋ณ€์ˆ˜๊ฐ€ ๋‹ค๋ฅธ ๋ณ€์ˆ˜์— ์˜ํ•ด ์ด๋ฏธ ์„ค๋ช…๋˜์–ด ์žˆ๋Š”๊ฐ€?
  • ์œ„์™€ ๊ฐ™์€ ์งˆ๋ฌธ์„ ํ†ตํ•ด, ์ด ๋ฌธ์ œ์—์„œ 'OverallQual', 'YearBuilt', 'TotalBsmtSF', 'GrLivArea' ๋ณ€์ˆ˜๊ฐ€ ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒฐ๋ก ์ด ๋„์ถœ๋จ.
    โ†’ 'building'๊ณผ ๊ด€๋ จ๋œ ๋‘ ๊ฐœ์˜ ๋ณ€์ˆ˜: OverallQual, YearBuilt
    โ†’ 'space'์™€ ๊ด€๋ จ๋œ ๋‘ ๊ฐœ์˜ ๋ณ€์ˆ˜: TotalBsmtSF, GrLivArea

โœ”๏ธ analysing 'SalePrice'

โœ๏ธ ํ†ต๊ณ„ ์š”์•ฝ์ •๋ณด ํ™•์ธํ•˜๊ธฐ

df_train['SalePrice'].describe()

โœ๏ธ ํžˆ์Šคํ† ๊ทธ๋žจ ๊ทธ๋ ค๋ณด๊ธฐ

sns.distplot(df_train['SalePrice'])

  • Deviate from the normal distribution: ์ •๊ทœ๋ถ„ํฌ๋ฅผ ๋ฒ—์–ด๋‚จ.

  • Have appreciable positive skewness: ์–‘์˜ ์™œ๋„๋ฅผ ๊ฐ€์ง.

  • Show peakedness: ๋พฐ์กฑํ•œ ๋ชจ์–‘์„ ๊ฐ€์ง.


โšก numerical ๋ณ€์ˆ˜๋“ค๊ณผ์˜ ๊ด€๊ณ„ ์‚ดํŽด๋ณด๊ธฐ

โœ๏ธ scatter plot (GrLibArea, SalePrice)

var = 'GrLivArea'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1) # axis=1: ์—ด ๋ฐฉํ–ฅ์œผ๋กœ ๊ฒฐํ•ฉ
data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000))       

  • SalePrice์™€ GrLivArea ๊ฐ„ ์–‘์˜ ์„ ํ˜•๊ด€๊ณ„๊ฐ€ ์กด์žฌํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ž„

โœ๏ธ scatter plot (TotalBsmtSF, SalePrice)

var = 'TotalBsmtSF'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000))      

  • SalePrice์™€ TotalBsmtSF ๊ฐ„ ์–‘์˜ ์„ ํ˜•๊ด€๊ณ„๊ฐ€ ์กด์žฌํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ž„
  • TotalBsmtSF ๊ฐ’์ด 0์ธ ๋ฐ์ดํ„ฐ๊ฐ€ ๋‹ค์ˆ˜ ์กด์žฌํ•จ

โ€ป pd.concat: ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ๊ฒฐํ•ฉ

pd.concat(df,
		  axis=0,			# axis: ์ถ• ๋ฐฉํ–ฅ
          keys=None,		# ์›๋ณธ๋ฐ์ดํ„ฐ ์ด๋ฆ„ ์ง€์ •
          levels=None,
          names=None)

โšก categorical ๋ณ€์ˆ˜๋“ค๊ณผ์˜ ๊ด€๊ณ„ ์‚ดํŽด๋ณด๊ธฐ

โœ๏ธ box plot (OverallQual, SalePrice)

var = 'OverallQual'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
f, ax = plt.subplots(figsize=(8,6))
fig = sns.boxplot(x=var, y='SalePrice', data=data)
fig.axis(ymin=0, ymax=800000)

  • OverallQual ๊ฐ’์ด ์ปค์งˆ์ˆ˜๋ก SalePrice์˜ ๊ฐ’๋„ ์ฆ๊ฐ€ํ•˜๋Š” ๊ฒฝํ–ฅ์„ ๋ณด์ž„

โœ๏ธ box plot (YearBuilt, SalePrice)

var = 'YearBuilt'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
f, ax = plt.subplots(figsize=(16,8))
fig = sns.boxplot(x=var, y='SalePrice', data=data)
fig.axis(ymin=0, ymax=800000)     
plt.xticks(rotation=90)         # x์ถ• ๋ˆˆ๊ธˆ ๋ผ๋ฒจ ํšŒ์ „ํ•˜๊ธฐ(90๋„)

  • YearBuilt ๊ฐ’์ด ์ปค์งˆ์ˆ˜๋ก(์‹œ๊ฐ„์ด ์ง€๋‚ ์ˆ˜๋ก) SalePrice๊ฐ€ ์ฆ๊ฐ€ํ•˜๋Š” ๊ฒฝํ–ฅ์„ ๋ณด์ž„

โœ”๏ธ Correlation, Scatter plot

โšก correlation

โœ๏ธ correlation matrix

corrmat = df_train.corr()
f, ax = plt.subplots(figsize=(12,9))
sns.heatmap(corrmat, vmax=.8, square=True)

โœ๏ธ SalePrice correlation matrix

k = 10      # heatmap์˜ ๋ณ€์ˆ˜์˜ ๊ฐœ์ˆ˜
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(df_train[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values,xticklabels=cols.values)
# cbar: colorbar์˜ ์œ ๋ฌด, annot: ๊ฐ ์…€์— ๊ฐ’ ํ‘œ๊ธฐ ์œ ๋ฌด
# fmt: ๊ฐ’์˜ ๋ฐ์ดํ„ฐํƒ€์ž… ์„ค์ • -> fmt='.2f': ์†Œ์ˆ˜์  ๋‘˜์งธ์ž๋ฆฌ๊นŒ์ง€
# yticklabels=cols.values: y์ถ•์— ์ปฌ๋Ÿผ๋ช… ์ถœ๋ ฅ 
plt.show()

  • OverallQual, GrLivArea, TotalBsmtSF โ†’ SalePrice์™€ correlation ๋†’์Œ
  • GarageCars, GarageArea โ†’ SalePrice์™€์˜ correlation์ด ๊ฐ๊ฐ 0.64, 0.62์ž„
  • TotalBsmtSF, 1stFloor โ†’ SalePrice์™€์˜ correlation์ด 0.61๋กœ ๊ฐ™์Œ
  • YearBuilt โ†’ SalePrice์™€ correlation์ด ์กด์žฌํ•จ

โ€ป heatmap ๊ธฐ๋ณธ๋ฌธ๋ฒ• (์ฐธ๊ณ ์ž๋ฃŒ)

heatmap(df,						# ๋ฐ์ดํ„ฐ
		vmin=100,				# ์ตœ์†Œ๊ฐ’
        vmax=700,				# ์ตœ๋Œ€๊ฐ’
        cbar=True,				# colorbar์˜ ์œ ๋ฌด
        center=400,				# ์ค‘์•™๊ฐ’ 
        linewidths=0.5,			# cell ์‚ฌ์ด์— ์„ ์„ ์ง‘์–ด ๋„ฃ์Œ
        annot=True,				# ๊ฐ cell์˜ ๊ฐ’ ํ‘œ๊ธฐ ์œ ๋ฌด
        fmt="d",				# cell์— ํ‘œ์‹œ๋œ ๊ฐ’์˜ ๋ฐ์ดํ„ฐ ํƒ€์ž…
        cmap='Blues')			# heatmap์˜ ์ƒ‰๊น”

โšก scatter plot

โœ๏ธ correlation matrix

sns.set()
cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt']
sns.pairplot(df_train[cols], size=2.5)  # ๋ณ€์ˆ˜ ๊ฐ„ ๊ด€๊ณ„ ํŒŒ์•…
plt.show()

โœ”๏ธ Data cleaning

โšก missing data

โœ๏ธ missing data์˜ ๊ฐœ์ˆ˜ ํ™•์ธํ•˜๊ธฐ

total = df_train.isnull().sum().sort_values(ascending=False)
 # isnull์˜ ๊ฒฐ๊ณผ -> True(1): ๋ˆ„๋ฝ๋ฐ์ดํ„ฐ, False(0): ์œ ํšจํ•œ ๋ฐ์ดํ„ฐ
percent = (df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)
TotalPercent
PoolQC14530.995205
MiscFeature14060.963014
Alley13690.937671
Fence11790.807534
FireplaceQu6900.472603
LotFrontage2590.177397
GarageYrBlt810.055479
GarageCond810.055479
GarageType810.055479
โ€ฆโ€ฆโ€ฆ
  • PoolQC, MiscFeature, Alley,Fence, FireplaceQu, LotFrontage: ๊ฒฐ์ธก์น˜ ๋งค์šฐ ๋งŽ๊ณ , ์ง‘์„ ๊ตฌ๋งคํ•  ๋•Œ ์ค‘์š”ํ•œ ์š”์†Œ๋Š” ์•„๋‹Œ ๊ฒƒ์œผ๋กœ ํŒ๋‹จ๋จ โ†’ ํ•ด๋‹น ๋ณ€์ˆ˜ ์ œ๊ฑฐ ๊ณ ๋ ค

  • GarageYrBlt, GarageCond, GarageType, GarageFinish, GarageQual: ๊ฒฐ์ธก์น˜ ๊ฐœ์ˆ˜๊ฐ€ ๊ฐ™์Œ. garage์™€ ์ค‘์š”ํ•œ ๋ณ€์ˆ˜ ์ค‘ SalePrice์™€ ๊ฐ€์žฅ correlation์ด ๋†’์€ ๊ฒƒ์€ 'GarageCars'์ด๋ฏ€๋กœ, ํ•ด๋‹น ๋ณ€์ˆ˜๋“ค์€ ์ œ๊ฑฐํ•จ

  • BsmtFinType2, BsmtExposure, BsmtQual, BsmtCond, BsmtFinType1: ์œ„์™€ ๊ฐ™์€ ๋…ผ๋ฆฌ๋ฅผ ์ ์šฉํ•˜์—ฌ, ํ•ด๋‹น ๋ณ€์ˆ˜๋“ค์€ ์ œ๊ฑฐํ•จ

  • MasVnrArea, MasVnrType: ์ด๋ฏธ ๊ณ ๋ ค๋Œ€์ƒ์ธ YearBuild, OverallQual๊ณผ ๊ฐ•ํ•œ correlation์„ ๊ฐ–๊ณ  ์žˆ์œผ๋ฏ€๋กœ, ํ•ด๋‹น ๋ณ€์ˆ˜๋Š” ์ œ๊ฑฐํ•จ

  • Electrical: ๊ฒฐ์ธก์น˜ 1๊ฐœ ์กด์žฌํ•˜๋ฏ€๋กœ, ๊ฒฐ์ธก์น˜๋งŒ ์ œ๊ฑฐํ•จ


โœ๏ธ dealing with missing data

df_train = df_train.drop((missing_data[missing_data['Total'] > 1]).index,1)     # missing data๊ฐ€ 1๊ฐœ๋ณด๋‹ค ๋งŽ์œผ๋ฉด drop
df_train = df_train.drop(df_train.loc[df_train['Electrical'].isnull()].index)   # Electrical์— ์กด์žฌํ•˜๋Š” missing data(1๊ฐœ)๋ฅผ drop
df_train.isnull().sum().max()

โšก outlier

โœ๏ธ ๋ฐ์ดํ„ฐ ํ‘œ์ค€ํ™”

# standardizing data
saleprice_scaled = StandardScaler().fit_transform(df_train['SalePrice'][:,np.newaxis])
low_range = saleprice_scaled[saleprice_scaled[:,0].argsort()][:10]
high_range= saleprice_scaled[saleprice_scaled[:,0].argsort()][-10:]
print('outer range (low) of the distribution:')
print(low_range)
print('\nouter range (high) of the distribution:')
print(high_range)
  • outlier๋ฅผ ํŒ๋‹จํ•˜๊ธฐ ์œ„ํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ํ‘œ์ค€ํ™”ํ•จ


โœ๏ธ scatter plot (GrLibArea, SalePrice)

  • ๊ทธ๋ž˜ํ”„์˜ ์˜ค๋ฅธ์ชฝ ์•„๋ž˜์— ์œ„์น˜ํ•œ 2๊ฐœ์˜ ์ ์„ outlier๋กœ ํŒ๋‹จํ•˜๊ณ  ์ œ๊ฑฐํ•จ

  • ๊ทธ๋ž˜ํ”„์˜ ์˜ค๋ฅธ์ชฝ ์œ„์— ์œ„์น˜ํ•œ 2๊ฐœ์˜ ์ ์€ trend๋ฅผ ๋”ฐ๋ฅด๊ณ  ์žˆ์œผ๋ฏ€๋กœ, ์ œ๊ฑฐํ•˜์ง€ ์•Š์Œ


โœ๏ธ Deleting points

df_train.sort_values(by='GrLivArea', ascending=False)[:2] 
# GrLivArea๋ฅผ ๋‚ด๋ฆผ์ฐจ์ˆœ์œผ๋กœ ์ •๋ ฌํ•˜๊ณ , ๊ทธ์ค‘ ๊ฐ€์žฅ ํฐ GrLivArea ๊ฐ’์„ ๊ฐ–๋Š” 2๊ฐœ์˜ ํ–‰๋งŒ ์ถœ๋ ฅ

# Id๊ฐ€ 1299, 524์ธ ํ–‰(outlier) ์‚ญ์ œ
df_train = df_train.drop(df_train[df_train['Id'] == 1299].index)
df_train = df_train.drop(df_train[df_train['Id'] == 524].index)
  • outlier๋กœ ํŒ๋‹จ๋˜๋Š” ์  2๊ฐœ๋ฅผ ์ œ๊ฑฐํ•จ


โœ๏ธ scatter plot (saleprice, TotalBsmtSF)

  • outlier๋กœ ํŒ๋‹จํ•  ๋งŒํ•œ ๊ด€์ธก์น˜๊ฐ€ ๋ฐœ๊ฒฌ๋˜์ง€ ์•Š์Œ

โšก checking assumption

1) Normality

โœ๏ธ normality (SalePrice)

# histogram and normal probability plot(Q-Q Plot)
sns.distplot(df_train['SalePrice'], fit=norm)
fig = plt.figure()
res = stats.probplot(df_train['SalePrice'], plot=plt)


  • SalePrice๋Š” normal distribution์„ ๋”ฐ๋ฅด์ง€ ์•Š๋Š” ๊ฒƒ์œผ๋กœ ํŒ๋‹จ๋จ

๋กœ๊ทธ๋ณ€ํ™˜

df_train['SalePrice'] = np.log(df_train['SalePrice'])
# transformed histogram and normal probability plot
sns.distplot(df_train['SalePrice'], fit=norm)
fig = plt.figure()
res = stats.probplot(df_train['SalePrice'], plot=plt)

โœ๏ธ normality (GrLivArea)

# histogram and normal probability plot
sns.distplot(df_train['GrLivArea'], fit=norm)
fig = plt.figure()
res = stats.probplot(df_train['GrLivArea'], plot=plt)

๋กœ๊ทธ๋ณ€ํ™˜

df_train['GrLivArea'] = np.log(df_train['GrLivArea'])
# transformed histogram and normal probability plot
sns.distplot(df_train['GrLivArea'], fit=norm)
fig = plt.figure()
res = stats.probplot(df_train['GrLivArea'], plot=plt)

โœ๏ธ normality (TotalBsmtSF)

# histogram and normal probability plot
sns.distplot(df_train['TotalBsmtSF'], fit=norm)
fig = plt.figure()
res = stats.probplot(df_train['TotalBsmtSF'], plot=plt)


  • ๋‹ค์ˆ˜์˜ ๊ด€์ธก์น˜๊ฐ€ 0 ๊ฐ’์„ ๊ฐ€์ง(basement๊ฐ€ ์—†๋Š” ์ง‘์ธ ๊ฒฝ์šฐ) โ†’ ๋กœ๊ทธ๋ณ€ํ™˜์„ ํ•  ์ˆ˜ ์—†์Œ
    => basement ์กด์žฌ ์—ฌ๋ถ€์— ๋”ฐ๋ผ 0 ๋˜๋Š” 1 ๊ฐ’์„ ๊ฐ–๋Š” ๋ณ€์ˆ˜๋ฅผ ์ƒ์„ฑํ•˜์—ฌ, 0์ด ์•„๋‹Œ ๊ด€์ธก์น˜์— ๋Œ€ํ•ด์„œ๋งŒ ๋กœ๊ทธ๋ณ€ํ™˜์„ ์‹ค์‹œํ•จ


โœ๏ธ์ƒˆ๋กœ์šด ๋ณ€์ˆ˜ ์ƒ์„ฑํ•˜๊ธฐ

# ์ƒˆ๋กœ์šด ๋ณ€์ˆ˜ ์ƒ์„ฑ (basement์˜ ์กด์žฌ ์—ฌ๋ถ€๋ฅผ 0, 1๋กœ ๋ฒ”์ฃผํ™”)
df_train['HasBsmt'] = pd.Series(len(df_train['TotalBsmtSF']), index=df_train.index)
df_train['HasBsmt'] = 0
df_train.loc[df_train['TotalBsmtSF'] > 0, 'HasBsmt'] = 1    
  • TotalBsmtSF์˜ ๊ฐ’์ด 0๋ณด๋‹ค ํฌ๋ฉด, HasBsmt์— 1๊ฐ’์„ ์คŒ
# transform data
df_train.loc[df_train['HasBsmt'] == 1, 'TotalBsmtSF'] = np.log(df_train['TotalBsmtSF'])     
  • HasBsmt์˜ ๊ฐ’์ด 1์ด๋ฉด(basement๊ฐ€ ์กด์žฌํ•˜๋ฉด), TotalBsmtSF์— ๋กœ๊ทธ๋ณ€ํ™˜ ์‹ค์‹œํ•จ
# histogram and normal probability plot
sns.distplot(df_train[df_train['TotalBsmtSF'] > 0]['TotalBsmtSF'], fit=norm)
fig = plt.figure()
res = stats.probplot(df_train[df_train['TotalBsmtSF'] > 0]['TotalBsmtSF'], plot=plt)

2) homoscedasticity

โœ๏ธ scatter plot (SalePrice, GrLivArea)

plt.scatter(df_train['GrLivArea'], df_train['SalePrice'])

  • ๋“ฑ๋ถ„์‚ฐ์„ฑ์„ ๋งŒ์กฑํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ž„

โœ๏ธ scatter plot (SalePrice, TotalBsmtSF)

plt.scatter(df_train[df_train['TotalBsmtSF'] > 0]['TotalBsmtSF'], df_train[df_train['TotalBsmtSF'] > 0]['SalePrice'])

  • ๋“ฑ๋ถ„์‚ฐ์„ฑ์„ ๋งŒ์กฑํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ž„

โšก checking assumption

โœ๏ธ categorical ๋ณ€์ˆ˜๋ฅผ dummy ๋ณ€์ˆ˜๋กœ ๋ณ€ํ™˜ํ•˜๊ธฐ

df_train = pd.get_dummies(df_train)

โ—๏ธ ์ฐธ๊ณ ์ž๋ฃŒ โ—๏ธ

House Prices - Advanced Regression Techniques
Comprehensive data exploration with Python(by Pedro Marcelino)
heatmap ๊ธฐ๋ณธ ๋ฌธ๋ฒ•

profile
๋ฐ์ดํ„ฐ ๋ถ„์„๊ฐ€๊ฐ€ ๋˜๊ธฐ ์œ„ํ•œ ์—ฌ์ •

0๊ฐœ์˜ ๋Œ“๊ธ€