You’re the average of the five people spend the most time with.
-Jim Rohn

How does it work?

KNN model simply stores the data input via fit() method
입력값을 'fit()'을 통해 저장함

⇒ when given a new input(prediction), the model references the nearest k(5 by default) data set
새로운 입력값이 주어졌을 때, 기존에 존재하는 데이터중 신규 입력값과 가장 가까운 k개(기본값: 5)를 참조하여 예측


What is there to know?

📌 Distance-based algorithms (including KNN) are sensitive to the distance between individual data
(= requires an appropirate preprocessing of data--scaling).
KNN을 포함한 거리기반 알고리즘은 각 데이터간의 거리에 민감
적절한 전처리(스케일링) 필수

📌 over/underfitting issue can be resolved with k-value adjustment
(= increasing/decreasing the complexity of the model).
k값으로 모델의 복잡도를 증가/감소시키는 것으로 과대/과소 적합 문제 해결 가능

📌 KNN model is sensitive to outliers and missing values
이상치와 결측치에 민감


Step 1: Preprocessing - Scaling

Standard Score (z-Score)

How further is the feature value away from the mean, with standard deviation base (allows the model to compare data under the same condition regardless of the size of the actual feature value)
특성값이 평균에서 표준편차의 몇 배 만큼 떨어져있는지 나타내는 것 (실제 특성값의 크기와 상관없이 동일한 조건으로 비교 가능)

# scale train_set
train_scaled = ((train_data) - np.mean(train_data, axis=0)) / np.std(train_data, axis=0)

# train scaled train_set
kn.fit(train_scaled,train_target)

# scale test_set
test_scaled = ((test_data) - np.mean(train_data, axis=0)) / np.std(train_data, axis=0)

# check accuracy
kn.score(test_scaled, test_target)

# scale sample
sample = ([sample_x,sample_y] - np.mean(train_data, axis=0)) / np.std(train_data, axis=0)

# predict
kn.predict([sample])

Step 2: Utilising KNN

KNN Classification 분류

A classification model is evaluated with its prediction accuracy.
분류모델의 성능은 예측 정확도로 평가

  • sklearn.neighbors : KNeighborClassifier
kn = KNeighborClassifier() # parameter: n_neighbors
kn.fit(train_data,train_target)
kn.score(data,target) # accuracy 정확도
kn.predict([[x,y]])
kn.predict_proba(train_data)
kn.kneighbors([[x,y]]) # distance & indexes

KNN Regression 회귀

= estimating the value 특정값을 예측하는 것
A regression model is evaluated with Coefficient of Determination.
회귀모델의 성능은 결정계수로 평가

  • sklearn.neighbors : KNeighborRegressor
knr = KNeighborRegressor() #
knr.fit(train_data, train_target)
knr.score(data, target) # coefficient of determination 결정계수
knr.predict([[x,y]])

Coefficient of Determination 결정계수 (R Squared)

R2=1sum(((targetpredict)2)/sum((targetavg(target))2)R^2 = 1 - sum(((target - predict)^2) / sum((target - avg(target))^2)
  • the “goodness of fit” is represented as a value between 0.0 and 1.0
    ⇒ the interpretations of fit depened on the context of analysis (fields)
  • R squared indicates the proportion of variance in
    • the dependent variable that is predicted by the regression model and
    • the predictor variable (independent variable)

Since regression model is based on the best possible fit, R squared will always be greater than zero, even when the predictor and outcome variables bear no relationship to one another.
⇒ R squared increases when a new predictor variable is added to the model, even if the new predictor is not associated with the outcome.

⇒ Adj’d R squared incorporates the same info as R squared but also penalises for the number of predictor variable inlcuded in the model

profile
데이터 나라의 앨리슨 👩🏼‍💻 Alyson in Dataland

0개의 댓글