This is data that confirmed the grade of performance with age and some exercise performance data.
Header | Description |
---|---|
age | 20 ~64 |
gender | F,M |
height_cm | - |
weight_kg | - |
body fat_% | - |
diastolic | diastolic blood pressure (min) |
systolic | systolic blood pressure (min) |
gripForce | - |
sit and bend forward_cm | - |
sit-ups counts | - |
broad jump_cm | - |
class | A,B,C,D ( A: best) / stratified |
> df <- read.csv(file = 'bodyPerformance.csv', stringsAsFactors = TRUE)
> head(df)
age gender height_cm weight_kg body.fat_. diastolic systolic gripForce sit.and.bend.forward_cm sit.ups.counts
1 27 M 172 75.2 21.3 80 130 54.9 18.4 60
2 25 M 165 55.8 15.7 77 126 36.4 16.3 53
3 31 M 180 78.0 20.1 92 152 44.8 12.0 49
4 32 M 174 71.1 18.4 76 147 41.4 15.2 53
5 28 M 174 67.7 17.1 70 127 43.5 27.1 45
6 36 F 165 55.4 22.0 64 119 23.8 21.0 27
broad.jump_cm class
1 217 C
2 229 A
3 181 C
4 219 B
5 217 B
6 153 B
> str(df)
'data.frame': 13393 obs. of 12 variables:
$ age : num 27 25 31 32 28 36 42 33 54 28 ...
$ gender : Factor w/ 2 levels "F","M": 2 2 2 2 2 1 1 2 2 2 ...
$ height_cm : num 172 165 180 174 174 ...
$ weight_kg : num 75.2 55.8 78 71.1 67.7 ...
$ body.fat_. : num 21.3 15.7 20.1 18.4 17.1 22 32.2 36.9 27.6 14.4 ...
$ diastolic : num 80 77 92 76 70 64 72 84 85 81 ...
$ systolic : num 130 126 152 147 127 119 135 137 165 156 ...
$ gripForce : num 54.9 36.4 44.8 41.4 43.5 23.8 22.7 45.9 40.4 57.9 ...
$ sit.and.bend.forward_cm: num 18.4 16.3 12 15.2 27.1 21 0.8 12.3 18.6 12.1 ...
$ sit.ups.counts : num 60 53 49 53 45 27 18 42 34 55 ...
$ broad.jump_cm : num 217 229 181 219 217 153 146 234 148 213 ...
$ class : Factor w/ 4 levels "A","B","C","D": 3 1 3 2 2 2 4 2 3 2 ...
> dim(df)
[1] 13393 12
> summary(df)
age gender height_cm weight_kg body.fat_. diastolic systolic gripForce
Min. :21.0 F:4926 Min. :125 Min. : 26.3 Min. : 3.0 Min. : 0.0 Min. : 0 Min. : 0.0
1st Qu.:25.0 M:8467 1st Qu.:162 1st Qu.: 58.2 1st Qu.:18.0 1st Qu.: 71.0 1st Qu.:120 1st Qu.:27.5
Median :32.0 Median :169 Median : 67.4 Median :22.8 Median : 79.0 Median :130 Median :37.9
Mean :36.8 Mean :169 Mean : 67.4 Mean :23.2 Mean : 78.8 Mean :130 Mean :37.0
3rd Qu.:48.0 3rd Qu.:175 3rd Qu.: 75.3 3rd Qu.:28.0 3rd Qu.: 86.0 3rd Qu.:141 3rd Qu.:45.2
Max. :64.0 Max. :194 Max. :138.1 Max. :78.4 Max. :156.2 Max. :201 Max. :70.5
sit.and.bend.forward_cm sit.ups.counts broad.jump_cm class
Min. :-25.0 Min. : 0.0 Min. : 0 A:3348
1st Qu.: 10.9 1st Qu.:30.0 1st Qu.:162 B:3347
Median : 16.2 Median :41.0 Median :193 C:3349
Mean : 15.2 Mean :39.8 Mean :190 D:3349
3rd Qu.: 20.7 3rd Qu.:50.0 3rd Qu.:221
Max. :213.0 Max. :80.0 Max. :303
결측값은 없으며, summary(df)를 보았을 때 sit.and.bend.forward_cm변수와 broad.jump_cm변수에 이상치가 있는 것으로 보인다.
> library(pastecs)
> library(psych)
> describeBy(df[, -c(2, 12)], df$class, mat = FALSE)
Descriptive statistics by group
group: A
vars n mean sd median trimmed mad min max range skew kurtosis se
age 1 3348 35.3 13.00 30.0 33.8 10.38 21.0 64.0 43.0 0.86 -0.56 0.22
height_cm 2 3348 167.9 7.84 168.0 167.9 8.90 140.5 191.8 51.3 -0.03 -0.53 0.14
weight_kg 3 3348 64.4 10.55 64.5 64.2 12.25 34.5 101.4 66.9 0.20 -0.56 0.18
body.fat_. 4 3348 20.5 6.44 20.2 20.4 6.82 3.0 78.4 75.4 0.42 1.33 0.11
diastolic 5 3348 77.9 10.60 78.0 78.0 11.12 41.0 126.0 85.0 -0.05 -0.19 0.18
systolic 6 3348 129.3 14.77 128.0 129.1 14.83 14.0 201.0 187.0 0.04 0.88 0.26
gripForce 7 3348 38.6 10.89 39.0 38.3 14.31 2.1 70.5 68.4 0.13 -1.06 0.19
sit.and.bend.forward_cm 8 3348 21.4 5.12 21.1 21.2 4.37 11.8 185.0 173.2 9.90 308.84 0.09
sit.ups.counts 9 3348 47.9 10.82 49.0 48.2 11.86 17.0 80.0 63.0 -0.31 -0.25 0.19
broad.jump_cm 10 3348 202.7 35.89 202.0 203.3 44.48 40.0 299.0 259.0 -0.13 -0.73 0.62
---------------------------------------------------------------------------------------
group: B
vars n mean sd median trimmed mad min max range skew kurtosis se
age 1 3347 37.1 13.70 32.0 36.0 13.34 21.0 64.0 43.0 0.59 -1.05 0.24
height_cm 2 3347 168.6 8.13 169.2 168.8 8.45 125.0 191.8 66.8 -0.23 -0.22 0.14
weight_kg 3 3347 66.6 10.88 67.1 66.5 11.71 31.9 111.8 79.9 0.11 -0.37 0.19
body.fat_. 4 3347 22.0 6.65 21.7 21.9 6.97 4.7 44.5 39.8 0.23 -0.43 0.11
diastolic 5 3347 78.7 10.66 79.0 78.8 10.38 6.0 112.0 106.0 -0.28 0.74 0.18
systolic 6 3347 130.6 14.48 130.0 130.6 14.83 43.9 193.0 149.1 -0.02 -0.02 0.25
gripForce 7 3347 37.9 10.39 39.6 37.9 12.31 0.0 69.9 69.9 -0.05 -0.85 0.18
sit.and.bend.forward_cm 8 3347 17.5 5.91 17.0 17.2 4.89 7.1 213.0 205.9 11.08 357.09 0.10
sit.ups.counts 9 3347 42.6 11.74 44.0 43.0 13.34 12.0 78.0 66.0 -0.28 -0.50 0.20
broad.jump_cm 10 3347 195.3 36.31 200.0 196.6 41.51 71.0 295.0 224.0 -0.27 -0.64 0.63
---------------------------------------------------------------------------------------
group: C
vars n mean sd median trimmed mad min max range skew kurtosis se
age 1 3349 36.7 13.78 32.0 35.5 13.34 21.0 64.0 43.0 0.58 -1.05 0.24
height_cm 2 3349 169.2 8.52 170.0 169.4 8.75 141.0 193.8 52.8 -0.28 -0.43 0.15
weight_kg 3 3349 66.8 10.86 67.2 66.7 11.71 38.1 103.6 65.5 0.06 -0.52 0.19
body.fat_. 4 3349 22.6 6.27 22.2 22.5 5.78 3.5 43.0 39.5 0.26 -0.20 0.11
diastolic 5 3349 78.5 10.64 79.0 78.7 10.38 40.0 156.2 116.2 -0.03 0.52 0.18
systolic 6 3349 129.9 14.52 130.0 129.9 14.83 91.0 195.0 104.0 0.05 -0.38 0.25
gripForce 7 3349 36.6 10.22 38.0 36.5 12.01 0.0 65.2 65.2 -0.05 -0.89 0.18
sit.and.bend.forward_cm 8 3349 14.4 5.88 14.0 14.1 6.52 2.3 37.0 34.7 0.43 -0.24 0.10
sit.ups.counts 9 3349 38.7 12.73 40.0 39.1 13.34 7.0 71.0 64.0 -0.25 -0.60 0.22
broad.jump_cm 10 3349 188.6 39.36 194.0 190.4 43.00 0.0 303.0 303.0 -0.44 -0.07 0.68
---------------------------------------------------------------------------------------
group: D
vars n mean sd median trimmed mad min max range skew kurtosis se
age 1 3349 38.06 13.86 36.0 37.18 17.79 21.0 64.0 43.0 0.39 -1.23 0.24
height_cm 2 3349 168.63 9.11 169.7 168.84 9.93 139.5 192.0 52.5 -0.22 -0.56 0.16
weight_kg 3 3349 72.00 13.88 71.8 71.63 14.68 26.3 138.1 111.8 0.30 0.09 0.24
body.fat_. 4 3349 27.74 7.51 27.4 27.67 7.56 3.5 54.9 51.4 0.10 -0.11 0.13
diastolic 5 3349 80.08 10.96 80.0 80.33 11.86 0.0 120.0 120.0 -0.31 0.48 0.19
systolic 6 3349 131.08 15.02 131.0 131.28 16.31 0.0 181.0 181.0 -0.26 0.97 0.26
gripForce 7 3349 34.75 10.58 35.6 34.69 12.60 0.0 70.4 70.4 0.00 -0.70 0.18
sit.and.bend.forward_cm 8 3349 7.59 9.39 7.7 7.89 9.64 -25.0 35.2 60.2 -0.28 -0.19 0.16
sit.ups.counts 9 3349 29.88 15.03 30.0 30.19 16.31 0.0 78.0 78.0 -0.14 -0.62 0.26
broad.jump_cm 10 3349 173.82 41.82 180.0 175.76 44.48 0.0 275.0 275.0 -0.48 0.01 0.72
각 class별 평균을 보았을 때 A그룹에 비해 D그룹의 몸무게가 각각 64, 72kg으로 꽤 차이나는 것으로 보인다. 또한 대부분의 변수는 첨도와 왜도가 0에 가까워 표준정규분포를 따르는 것처럼 보이지만 sit.and.bend.forward_cm 변수는 A그룹에서 왜도 9.9, 첨도 308.84로 정규분포에서 크게 벗어나 있는 것으로 보인다.
> df_numeric <- df[, -c(2, 12)]
> cor_df <- cor(df_numeric)
> cor_df
age height_cm weight_kg body.fat_. diastolic systolic gripForce sit.and.bend.forward_cm
age 1.000 -0.294 -0.1000 0.2423 0.1585 0.2112 -0.180 -0.0700
height_cm -0.294 1.000 0.7349 -0.5154 0.1459 0.2102 0.735 -0.2220
weight_kg -0.100 0.735 1.0000 -0.0841 0.2623 0.3389 0.700 -0.2962
body.fat_. 0.242 -0.515 -0.0841 1.0000 0.0481 -0.0304 -0.542 -0.0712
diastolic 0.159 0.146 0.2623 0.0481 1.0000 0.6763 0.202 -0.0721
systolic 0.211 0.210 0.3389 -0.0304 0.6763 1.0000 0.286 -0.0824
gripForce -0.180 0.735 0.7001 -0.5418 0.2021 0.2860 1.000 -0.1126
sit.and.bend.forward_cm -0.070 -0.222 -0.2962 -0.0712 -0.0721 -0.0824 -0.113 1.0000
sit.ups.counts -0.545 0.500 0.2949 -0.6089 0.0165 0.0563 0.577 0.1772
broad.jump_cm -0.435 0.675 0.4796 -0.6733 0.0972 0.1529 0.747 0.0265
sit.ups.counts broad.jump_cm
age -0.5446 -0.4352
height_cm 0.5004 0.6746
weight_kg 0.2949 0.4796
body.fat_. -0.6089 -0.6733
diastolic 0.0165 0.0972
systolic 0.0563 0.1529
gripForce 0.5767 0.7469
sit.and.bend.forward_cm 0.1772 0.0265
sit.ups.counts 1.0000 0.7483
broad.jump_cm 0.7483 1.0000
> library(PerformanceAnalytics)
> chart.Correlation(df_numeric, histogram = TRUE, pch = 19, method = 'pearson')
상관관계가 0.8이 넘는 변수는 존재하지 않지만, 0.74를 넘는 상관관계를 가진 변수를 제거해보자
> library(caret)
> findCorrelation(cor_df, cutoff = 0.74)
[1] 7 10
> str(df_numeric)
'data.frame': 13393 obs. of 10 variables:
$ age : num 27 25 31 32 28 36 42 33 54 28 ...
$ height_cm : num 172 165 180 174 174 ...
$ weight_kg : num 75.2 55.8 78 71.1 67.7 ...
$ body.fat_. : num 21.3 15.7 20.1 18.4 17.1 22 32.2 36.9 27.6 14.4 ...
$ diastolic : num 80 77 92 76 70 64 72 84 85 81 ...
$ systolic : num 130 126 152 147 127 119 135 137 165 156 ...
$ gripForce : num 54.9 36.4 44.8 41.4 43.5 23.8 22.7 45.9 40.4 57.9 ...
$ sit.and.bend.forward_cm: num 18.4 16.3 12 15.2 27.1 21 0.8 12.3 18.6 12.1 ...
$ sit.ups.counts : num 60 53 49 53 45 27 18 42 34 55 ...
$ broad.jump_cm : num 217 229 181 219 217 153 146 234 148 213 ...
> df_numeric_2 <- df_numeric[, -c(7, 10)]
> str(df_numeric_2)
'data.frame': 13393 obs. of 8 variables:
$ age : num 27 25 31 32 28 36 42 33 54 28 ...
$ height_cm : num 172 165 180 174 174 ...
$ weight_kg : num 75.2 55.8 78 71.1 67.7 ...
$ body.fat_. : num 21.3 15.7 20.1 18.4 17.1 22 32.2 36.9 27.6 14.4 ...
$ diastolic : num 80 77 92 76 70 64 72 84 85 81 ...
$ systolic : num 130 126 152 147 127 119 135 137 165 156 ...
$ sit.and.bend.forward_cm: num 18.4 16.3 12 15.2 27.1 21 0.8 12.3 18.6 12.1 ...
$ sit.ups.counts : num 60 53 49 53 45 27 18 42 34 55 ...
> nearZeroVar(df_numeric_2, saveMetrics = TRUE)
freqRatio percentUnique zeroVar nzv
age 1.221800 0.3285298 FALSE FALSE
height_cm 1.125000 3.4868961 FALSE FALSE
weight_kg 1.039216 10.4382887 FALSE FALSE
body.fat_. 1.034483 3.9348914 FALSE FALSE
diastolic 1.390041 0.6645262 FALSE FALSE
systolic 1.237981 0.7615919 FALSE FALSE
sit.and.bend.forward_cm 1.166667 3.9423579 FALSE FALSE
sit.ups.counts 1.055838 0.6047935 FALSE FALSE
분산이 0에 가까운 변수는 없다!(분산이 0에 가까운 변수는 위 출력의 nzv부분이 TRUE로 나온다)
> df_numeric_2$gender <- df$gender
> df_numeric_2$class <- df$class
> str(df_numeric_2)
'data.frame': 13393 obs. of 10 variables:
$ age : num 27 25 31 32 28 36 42 33 54 28 ...
$ height_cm : num 172 165 180 174 174 ...
$ weight_kg : num 75.2 55.8 78 71.1 67.7 ...
$ body.fat_. : num 21.3 15.7 20.1 18.4 17.1 22 32.2 36.9 27.6 14.4 ...
$ diastolic : num 80 77 92 76 70 64 72 84 85 81 ...
$ systolic : num 130 126 152 147 127 119 135 137 165 156 ...
$ sit.and.bend.forward_cm: num 18.4 16.3 12 15.2 27.1 21 0.8 12.3 18.6 12.1 ...
$ sit.ups.counts : num 60 53 49 53 45 27 18 42 34 55 ...
$ gender : Factor w/ 2 levels "F","M": 2 2 2 2 2 1 1 2 2 2 ...
$ class : Factor w/ 4 levels "A","B","C","D": 3 1 3 2 2 2 4 2 3 2 ...
결과변수인 class를 제외한 설명변수가 9개로 적당해 보인다.
> library(GGally)
> ggpairs(df_numeric_2, aes(colour = class))
![ggpairs plot](
class별로 다른 색으로 그래프를 보여준다. 맨 오른쪽의 class줄을 보면 boxplot의 범위를 벗어나는 이상치가 보인다. 위의 첨도 왜도 분석에서 보았듯이 sit.and.bend.forward_cm변수의 히스토그램은 왼쪽으로 쏠려있고 뾰족한 것으로 보인다.
또 하나를 말하자면 그래프가 나오는데 꽤 오랜 시간이 걸린다. 변수가 많다면 ggpairs로 그래프 그리는 것을 시도하지 않는 편이 좋다.
> library(reshape2)
> model_scale <- preProcess(df_numeric_2, method = c('center', 'scale'))
> scaled_df <- predict(model_scale, df_numeric_2)
> scaled_df_2 <- scaled_df[, -9]
> melt_df <- melt(scaled_df_2, id.vars = 'class')
> head(melt_df)
class variable value
1 C age -0.71740533
2 A age -0.86418743
3 C age -0.42384113
4 B age -0.35045008
5 B age -0.64401428
6 B age -0.05688587
> library(ggplot2)
> library(gridExtra)
> library(gapminder)
> library(dplyr)
> p1 <- ggplot(melt_df, aes(x = variable, y = value, fill = class)) +
+ geom_boxplot()
> p1
![class별 boxplot](
age변수를 제외한 모든 변수에 이상치가 있다. 특히 sit.and.bend.forward_cm변수에 극심한 이상치가 보인다
또한 class별 boxplot을 보았을 때, 그룹을 구별짓는 중요한 변수는 weight_kg, body.fat., sit.and.bend.forwardcm, sit.ups.counts인 것으로 추측할 수 있다.
> outHigh <- function(x) {
+ x[x > quantile(x, 0.95)] <- quantile(x, 0.75)
+ x
+ }
> outLow <- function(x) {
+ x[x < quantile(x, 0.05)] <- quantile(x, 0.25)
+ x
+ }
> str(df_numeric_2)
'data.frame': 13393 obs. of 10 variables:
$ age : num 27 25 31 32 28 36 42 33 54 28 ...
$ height_cm : num 172 165 180 174 174 ...
$ weight_kg : num 75.2 55.8 78 71.1 67.7 ...
$ body.fat_. : num 21.3 15.7 20.1 18.4 17.1 22 32.2 36.9 27.6 14.4 ...
$ diastolic : num 80 77 92 76 70 64 72 84 85 81 ...
$ systolic : num 130 126 152 147 127 119 135 137 165 156 ...
$ sit.and.bend.forward_cm: num 18.4 16.3 12 15.2 27.1 21 0.8 12.3 18.6 12.1 ...
$ sit.ups.counts : num 60 53 49 53 45 27 18 42 34 55 ...
$ gender : Factor w/ 2 levels "F","M": 2 2 2 2 2 1 1 2 2 2 ...
$ class : Factor w/ 4 levels "A","B","C","D": 3 1 3 2 2 2 4 2 3 2 ...
> df <- data.frame(lapply(df_numeric_2[, -c(9, 10)], outHigh))
> df <- data.frame(lapply(df, outLow))
> df$gender <- df_numeric_2$gender
> df$class <- df_numeric_2$class
> str(df)
'data.frame': 13393 obs. of 10 variables:
$ age : num 27 25 31 32 28 36 42 33 54 28 ...
$ height_cm : num 172 165 180 174 174 ...
$ weight_kg : num 75.2 55.8 78 71.1 67.7 ...
$ body.fat_. : num 21.3 15.7 20.1 18.4 17.1 22 32.2 28 27.6 14.4 ...
$ diastolic : num 80 77 92 76 70 64 72 84 85 81 ...
$ systolic : num 130 126 152 147 127 119 135 137 141 141 ...
$ sit.and.bend.forward_cm: num 18.4 16.3 12 15.2 20.7 21 0.8 12.3 18.6 12.1 ...
$ sit.ups.counts : num 60 53 49 53 45 27 18 42 34 55 ...
$ gender : Factor w/ 2 levels "F","M": 2 2 2 2 2 1 1 2 2 2 ...
$ class : Factor w/ 4 levels "A","B","C","D": 3 1 3 2 2 2 4 2 3 2 ...
> library(reshape2)
> scale <- preProcess(df, method = c('center', 'scale'))
> scaled <- predict(scale, df)
> scaled_2 <- scaled[, -9]
> melt <- melt(scaled_2, id.vars = 'class')
> head(melt)
class variable value
1 C age -0.71905005
2 A age -0.87434140
3 C age -0.40846737
4 B age -0.33082169
5 B age -0.64140438
6 B age -0.02023901
> p2 <- ggplot(melt, aes(x = variable, y = value, fill = class)) + geom_boxplot()
> p2
![이상치 제거 후 boxplot](
이상치는 잘 교체 된 것으로 보인다. 왜 이상치를 교체했는데도 이상치를 나타내는 점 표시가 보이냐면, 현실감 있는 데이터를 위해 양 극단의 5%의 이상치만 각각 75%, 25% 사분위수로 바꾸었기 때문이다.
> library(caret)
> idx <- createDataPartition(df$class, p = 0.7)
> train <- df[idx$Resample1, ]
> test <- df[-idx$Resample1, ]
> table(train$class)
A B C D
2344 2343 2345 2345
결과변수 class의 A, B, C, D의 데이터 개수가 거의 유사하므로 업샘플링이나 SMOTE는 진행하지 않는다.
> model_train <- preProcess(train, method = c('center', 'scale'))
> model_test <- preProcess(test, method = c('center', 'scale'))
> scaled_train <- predict(model_train, train)
> scaled_test <- predict(model_test, test)
> library(class)
> library(kknn)
> library(e1071)
> library(caret)
> library(MASS)
> library(reshape2)
> library(ggplot2)
> library(kernlab)
> library(corrplot)
> grid1 <- expand.grid(.k = seq(2, 30, by = 1))
> control <- trainControl(method = 'cv')
> set.seed(502)
> knn.train <- train(class ~., data = scaled_train,
+ method = 'knn',
+ trControl = control,
+ tuneGrid = grid1)
> knn.train
k-Nearest Neighbors
9377 samples
9 predictor
4 classes: 'A', 'B', 'C', 'D'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 8441, 8438, 8439, 8438, 8438, 8440, ...
Resampling results across tuning parameters:
k Accuracy Kappa
2 0.5005833 0.3341175
3 0.5406863 0.3875905
4 0.5488974 0.3985361
5 0.5579629 0.4106229
6 0.5626612 0.4168893
.
.
.
24 0.5928352 0.4571245
25 0.5892094 0.4522887
26 0.5860078 0.4480183
27 0.5867555 0.4490160
28 0.5869688 0.4493018
29 0.5877152 0.4502974
30 0.5896380 0.4528611
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 24.
k매개변수값으로 24가 나왔다. 24개의 k를 사용할 때의 카파통계량은 0.457이다.
K값 | 동의의 강도 |
---|---|
< 0.2 | Poor(약함) |
0.21 ~ 0.4 | Fair(약간 동의) |
0.41 ~ 0.6 | Moderate(어느 정도 동의) |
0.61 ~ 0.8 | Good(상당히 동의) |
0.81 ~ 1 | Very Good(매우 동의) |
> knn.test <- knn(scaled_train[, -10], scaled_test[, -10], scaled_train[, 10], k = 24)
Error in knn(scaled_train[, -10], scaled_test[, -10], scaled_train[, 10], :
NA/NaN/Inf in foreign function call (arg 6)
In addition: Warning messages:
1: In knn(scaled_train[, -10], scaled_test[, -10], scaled_train[, 10], :
NAs introduced by coercion
2: In knn(scaled_train[, -10], scaled_test[, -10], scaled_train[, 10], :
NAs introduced by coercion
에러가 떴다... 무엇이 문제일까...? 예측변수의 gender가 팩터형인게 문제같다.
> dummies <- dummyVars(class ~., data = scaled_train)
> dummies
Dummy Variable Object
Formula: class ~ .
10 variables, 2 factors
Variables and levels will be separated by '.'
A less than full rank encoding is used
> scaled_train_knn <- as.data.frame(predict(dummies, newdata = scaled_train))
Warning message:
In model.frame.default(Terms, newdata, na.action = na.action, xlev = object$lvls) :
variable 'class' is not a factor
> names(scaled_train_knn)
[1] "age" "height_cm" "weight_kg" "body.fat_."
[5] "diastolic" "systolic" "sit.and.bend.forward_cm" "sit.ups.counts"
[9] "gender.F" "gender.M"
> str(scaled_train_knn)
'data.frame': 9377 obs. of 10 variables:
$ age : num -0.876 -0.41 -0.332 0.444 1.609 ...
$ height_cm : num -0.519 1.603 0.862 -0.592 -0.897 ...
$ weight_kg : num -1.176 1.156 0.431 -0.346 -0.924 ...
$ body.fat_. : num -1.272 -0.512 -0.806 1.577 -0.374 ...
$ diastolic : num -0.21 1.5 -0.324 -0.779 -1.121 ...
$ systolic : num -0.355 1.761 1.354 0.377 -1.983 ...
$ sit.and.bend.forward_cm: num 0.1143 -0.6071 -0.0702 -2.4863 0.8526 ...
$ sit.ups.counts : num 1.118 0.769 1.118 -1.939 -0.891 ...
$ gender.F : num 0 0 0 1 1 1 0 0 1 0 ...
$ gender.M : num 1 1 1 0 0 0 1 1 0 1 ...
> scaled_train_knn$class <- scaled_train$class
> dummies2 <- dummyVars(class ~., data = scaled_test)
> dummies2
Dummy Variable Object
Formula: class ~ .
10 variables, 2 factors
Variables and levels will be separated by '.'
A less than full rank encoding is used
> scaled_test_knn <- as.data.frame(predict(dummies2, newdata = scaled_test))
Warning message:
In model.frame.default(Terms, newdata, na.action = na.action, xlev = object$lvls) :
variable 'class' is not a factor
> str(scaled_test_knn)
'data.frame': 4016 obs. of 10 variables:
$ age : num -0.7149 -0.6374 -0.0168 -0.2495 1.3795 ...
$ height_cm : num 0.507 0.723 -0.487 0.881 -0.285 ...
$ weight_kg : num 0.8247 0.0372 -1.2474 1.0294 0.0163 ...
$ body.fat_. : num -0.309 -1.046 -0.186 0.866 0.796 ...
$ diastolic : num 0.139 -1 -1.683 0.594 0.708 ...
$ systolic : num -0.0239 -0.2707 -0.9286 0.5518 0.8808 ...
$ sit.and.bend.forward_cm: num 0.47 0.86 0.911 -0.565 0.504 ...
$ sit.ups.counts : num 1.732 0.407 -1.183 0.142 -0.565 ...
$ gender.F : num 0 0 1 0 0 0 0 1 0 1 ...
$ gender.M : num 1 1 0 1 1 1 1 0 1 0 ...
> scaled_test_knn$class <- scaled_test$class
> knn.test <- knn(scaled_train_knn[, -11], scaled_test_knn[, -11], scaled_train_knn[, 11], k = 24)
> confusionMatrix(knn.test, scaled_test_knn$class, positive = 'A')
Confusion Matrix and Statistics
Reference
Prediction A B C D
A 799 348 159 41
B 171 441 224 98
C 22 168 500 249
D 12 47 121 616
Overall Statistics
Accuracy : 0.5867
95% CI : (0.5712, 0.6019)
No Information Rate : 0.25
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.4489
Mcnemar's Test P-Value : < 2.2e-16
Statistics by Class:
Class: A Class: B Class: C Class: D
Sensitivity 0.7958 0.4392 0.4980 0.6135
Specificity 0.8181 0.8363 0.8542 0.9402
Pos Pred Value 0.5932 0.4722 0.5325 0.7739
Neg Pred Value 0.9232 0.8173 0.8362 0.8795
Prevalence 0.2500 0.2500 0.2500 0.2500
Detection Rate 0.1990 0.1098 0.1245 0.1534
Detection Prevalence 0.3354 0.2326 0.2338 0.1982
Balanced Accuracy 0.8069 0.6378 0.6761 0.7769
정확도가 약 58.6%, 카파통계량이 0.4489로 나왔다.
> kknn.train <- train.kknn(class ~., data = scaled_train_knn, kmax = 30, distance = 2,
+ kernel = c('rectangular', 'triangular', 'epanechnikov'))
> plot(kknn.train)
distance = 2는 절댓값합 거리를 사용함을 의미하고 1로 설정하면 유클리드 거리를 의미한다.
위의 plot은 k값을 x축, 각 커널에 의해 잘못 분류된 관찰값의 비율을 y축에 표시하고 있다. 얼핏보았을 때 k가 29일 때 epanechnikov커널을 사용했을 때 가장 오류가 낮은 것으로 보인다.
> kknn.train
Call:
train.kknn(formula = class ~ ., data = scaled_train_knn, kmax = 30, distance = 2, kernel = c("rectangular", "triangular", "epanechnikov"))
Type of response variable: nominal
Minimal misclassification: 0.400128
Best kernel: epanechnikov
Best k: 29
> kknn.pred <- predict(kknn.train, newdata = scaled_test_knn)
> confusionMatrix(kknn.pred, scaled_test_knn$class)
Confusion Matrix and Statistics
Reference
Prediction A B C D
A 789 346 146 33
B 178 441 227 109
C 23 168 511 232
D 14 49 120 630
Overall Statistics
Accuracy : 0.5904
95% CI : (0.575, 0.6057)
No Information Rate : 0.25
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.4539
Mcnemar's Test P-Value : < 2.2e-16
Statistics by Class:
Class: A Class: B Class: C Class: D
Sensitivity 0.7859 0.4392 0.5090 0.6275
Specificity 0.8257 0.8293 0.8596 0.9392
Pos Pred Value 0.6005 0.4618 0.5471 0.7749
Neg Pred Value 0.9204 0.8161 0.8400 0.8832
Prevalence 0.2500 0.2500 0.2500 0.2500
Detection Rate 0.1965 0.1098 0.1272 0.1569
Detection Prevalence 0.3272 0.2378 0.2326 0.2024
Balanced Accuracy 0.8058 0.6343 0.6843 0.7834
커널X : (정확도 58.6%, 카파통계량 0.4489), 커널O : (정확도 59%, 카파통계량 0.4539)
커널을 사용했을 때가 커널을 사용하지 않았을 때보다 아주 약간 더 나은 성능을 보이고 있다.
> library(class)
> library(kknn)
> library(e1071)
> library(caret)
> library(MASS)
> library(kernlab)
> library(corrplot)
> linear.tune <- tune.svm(class ~., data = scaled_train,
+ kernel = 'linear',
+ cost = c(0.001, 0.01, 0.1, 0.5, 1, 2, 5, 10))
> summary(linear.tune)
Parameter tuning of ‘svm’:
- sampling method: 10-fold cross validation
- best parameters:
cost
0.1
- best performance: 0.4456618
- Detailed performance results:
cost error dispersion
1 1e-03 0.4645370 0.01667856
2 1e-02 0.4470478 0.01569832
3 1e-01 0.4456618 0.01619672
4 5e-01 0.4463020 0.01493139
5 1e+00 0.4460888 0.01475279
6 2e+00 0.4466221 0.01448448
7 5e+00 0.4466219 0.01446311
8 1e+01 0.4463021 0.01458349
최적의 cost함수는 0.1로 나왔다.
> best.linear <- linear.tune$best.model
> tune.test <- predict(best.linear, newdata = scaled_test)
> confusionMatrix(tune.test, scaled_test$class)
Confusion Matrix and Statistics
Reference
Prediction A B C D
A 733 301 141 44
B 252 405 233 118
C 13 226 410 168
D 6 72 220 674
Overall Statistics
Accuracy : 0.5533
95% CI : (0.5378, 0.5687)
No Information Rate : 0.25
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.4044
Mcnemar's Test P-Value : < 2.2e-16
Statistics by Class:
Class: A Class: B Class: C Class: D
Sensitivity 0.7301 0.4034 0.4084 0.6713
Specificity 0.8386 0.7998 0.8649 0.9011
Pos Pred Value 0.6013 0.4018 0.5018 0.6934
Neg Pred Value 0.9031 0.8009 0.8143 0.8916
Prevalence 0.2500 0.2500 0.2500 0.2500
Detection Rate 0.1825 0.1008 0.1021 0.1678
Detection Prevalence 0.3035 0.2510 0.2034 0.2420
Balanced Accuracy 0.7844 0.6016 0.6366 0.7862
정확도는 약 55.3%, 카파통계량은 0.4044가 나왔다. KNN분류기 보다 낮은 성능이다.
> set.seed(123)
> poly.tune <- tune.svm(class ~., data = scaled_train,
+ kernel = 'polynomial', degree = c(3, 4, 5),
+ coef0 = c(0.1, 0.5, 1, 2, 3, 4))
> summary(poly.tune)
Parameter tuning of ‘svm’:
- sampling method: 10-fold cross validation
- best parameters:
degree coef0
3 3
- best performance: 0.3559744
- Detailed performance results:
degree coef0 error dispersion
1 3 0.1 0.3757039 0.02016595
2 4 0.1 0.3829551 0.01954127
3 5 0.1 0.3956462 0.01639961
4 3 0.5 0.3657837 0.02027498
5 4 0.5 0.3621583 0.02020181
6 5 0.5 0.3646096 0.01985574
7 3 1.0 0.3610916 0.01877135
8 4 1.0 0.3595992 0.01953197
9 5 1.0 0.3683441 0.01915916
10 3 2.0 0.3568267 0.01675985
11 4 2.0 0.3584260 0.01808252
12 5 2.0 0.3664251 0.01715993
13 3 3.0 0.3559744 0.01528650
14 4 3.0 0.3589597 0.01817861
15 5 3.0 0.3664249 0.01730917
16 3 4.0 0.3562944 0.01675393
17 4 4.0 0.3598136 0.01573694
18 5 4.0 0.3669589 0.01575001
최적의 degree는 3이고, coef0은 3이다.
> best.poly <- poly.tune$best.model
> poly.test <- predict(best.poly, newdata = scaled_test)
> confusionMatrix(poly.test, scaled_test$class)
Confusion Matrix and Statistics
Reference
Prediction A B C D
A 823 311 139 43
B 164 515 202 92
C 9 145 557 199
D 8 33 106 670
Overall Statistics
Accuracy : 0.6387
95% CI : (0.6236, 0.6536)
No Information Rate : 0.25
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5183
Mcnemar's Test P-Value : < 2.2e-16
Statistics by Class:
Class: A Class: B Class: C Class: D
Sensitivity 0.8197 0.5129 0.5548 0.6673
Specificity 0.8363 0.8479 0.8828 0.9512
Pos Pred Value 0.6254 0.5293 0.6121 0.8201
Neg Pred Value 0.9330 0.8393 0.8561 0.8956
Prevalence 0.2500 0.2500 0.2500 0.2500
Detection Rate 0.2049 0.1282 0.1387 0.1668
Detection Prevalence 0.3277 0.2423 0.2266 0.2034
Balanced Accuracy 0.8280 0.6804 0.7188 0.8093
정확도는 약 63.87%이고, 카파통계량은 0.5183이다. 선형 svm모델(정확도 55.3%, 카파통계량 0.4044)보다 훨씬 좋은 성능을 보인다.
> set.seed(123)
> sigmoid.tune <- tune.svm(class ~., data = scaled_train,
+ kernel = 'sigmoid',
+ gamma = c(0.1, 0.5, 1, 2, 3, 4),
+ coef0 = c(0.1, 0.5, 1, 2, 3, 4))
> summary(sigmoid.tune)
Parameter tuning of ‘svm’:
- sampling method: 10-fold cross validation
- best parameters:
gamma coef0
0.1 4
- best performance: 0.5549702
- Detailed performance results:
gamma coef0 error dispersion
1 0.1 0.1 0.6098942 0.017237266
2 0.5 0.1 0.6647129 0.019556511
3 1.0 0.1 0.6778299 0.013018919
4 2.0 0.1 0.6796458 0.013721557
5 3.0 0.1 0.6816705 0.008677423
6 4.0 0.1 0.6800726 0.014748441
7 0.1 0.5 0.6410379 0.016690596
8 0.5 0.5 0.6865779 0.019343938
9 1.0 0.5 0.6830530 0.013403084
10 2.0 0.5 0.6786878 0.018862369
11 3.0 0.5 0.6784731 0.016285652
12 4.0 0.5 0.6816695 0.011816541
13 0.1 1.0 0.6450930 0.017299432
14 0.5 1.0 0.7033180 0.020566970
15 1.0 1.0 0.7010799 0.012793388
16 2.0 1.0 0.6857219 0.010920674
17 3.0 1.0 0.6810307 0.013951721
18 4.0 1.0 0.6827354 0.012767381
19 0.1 2.0 0.6403987 0.018140656
20 0.5 2.0 0.7070483 0.014031467
21 1.0 2.0 0.7103602 0.024367990
22 2.0 2.0 0.6890285 0.014172960
23 3.0 2.0 0.6838045 0.015632352
24 4.0 2.0 0.6862564 0.012334060
25 0.1 3.0 0.6132047 0.016577372
26 0.5 3.0 0.7262468 0.013513776
27 1.0 3.0 0.7068387 0.016881267
28 2.0 3.0 0.6974507 0.014741367
29 3.0 3.0 0.6877478 0.012779387
30 4.0 3.0 0.6875358 0.015403522
31 0.1 4.0 0.5549702 0.024714337
32 0.5 4.0 0.7200590 0.022518266
33 1.0 4.0 0.7127002 0.018309779
34 2.0 4.0 0.7092896 0.010917369
35 3.0 4.0 0.6956382 0.013894373
36 4.0 4.0 0.6910530 0.013037909
_최적의 gamma 값으로 0.1이 선택되었고, coef0 값으로는 4가 선택되었다.
> best.sigmoid <- sigmoid.tune$best.model
> sigmoid.test <- predict(best.sigmoid, newdata = scaled_test)
> confusionMatrix(sigmoid.test, scaled_test$class)
Confusion Matrix and Statistics
Reference
Prediction A B C D
A 943 708 457 154
B 0 0 0 0
C 21 68 121 59
D 40 228 426 791
Overall Statistics
Accuracy : 0.4619
95% CI : (0.4464, 0.4775)
No Information Rate : 0.25
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.2825
Mcnemar's Test P-Value : < 2.2e-16
Statistics by Class:
Class: A Class: B Class: C Class: D
Sensitivity 0.9392 0.00 0.12052 0.7878
Specificity 0.5621 1.00 0.95086 0.7696
Pos Pred Value 0.4169 NaN 0.44981 0.5327
Neg Pred Value 0.9652 0.75 0.76434 0.9158
Prevalence 0.2500 0.25 0.25000 0.2500
Detection Rate 0.2348 0.00 0.03013 0.1970
Detection Prevalence 0.5632 0.00 0.06698 0.3698
Balanced Accuracy 0.7507 0.50 0.53569 0.7787
정확도는 약 46.19%, 카파통계량은 약 0.2825로 가장 좋지 않은 성능을 보인다.
현재까지는 polynomial커널함수를 사용한 SVM모델이 정확도 63.87%, 카파통계량 0.5183로 가장 좋은 성능을 보이고 있다.
> library(rpart)
> library(partykit)
> library(MASS)
> library(genridge)
> library(randomForest)
> library(xgboost)
> library(caret)
> library(Boruta)
> rfmodel <- randomForest(class ~., data = train)
> rfmodel
Call:
randomForest(formula = class ~ ., data = train)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 3
OOB estimate of error rate: 30.89%
Confusion matrix:
A B C D class.error
A 1948 326 52 18 0.1689420
B 613 1286 308 136 0.4511310
C 295 457 1401 192 0.4025586
D 50 166 284 1845 0.2132196
수행 결과 OOB(out of bag)오차율은 30.89%가 나왔다.
> plot(rfmodel)
> which.min(rfmodel$err.rate[, 1])
[1] 401
모형 정확도를 최적화하기에 필요한 트리수가 401개면 된다는 결과를 얻었다.
> set.seed(123)
> rfmodel2 <- randomForest(class ~., data = train, ntree = 401)
> rfmodel2
Call:
randomForest(formula = class ~ ., data = train, ntree = 401)
Type of random forest: classification
Number of trees: 401
No. of variables tried at each split: 3
OOB estimate of error rate: 30.88%
Confusion matrix:
A B C D class.error
A 1946 326 54 18 0.1697952
B 614 1277 318 134 0.4549723
C 289 452 1407 197 0.4000000
D 53 160 281 1851 0.2106610
OOB 오차율이 30.89%에서 30.88%로 미미한 차이를 보인다.
> rftest <- predict(rfmodel2, newdata = test, type = 'response')
> confusionMatrix(rftest, test$class)
Confusion Matrix and Statistics
Reference
Prediction A B C D
A 851 262 104 24
B 127 546 204 59
C 18 139 586 111
D 8 57 110 810
Overall Statistics
Accuracy : 0.6955
95% CI : (0.681, 0.7097)
No Information Rate : 0.25
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.594
Mcnemar's Test P-Value : < 2.2e-16
Statistics by Class:
Class: A Class: B Class: C Class: D
Sensitivity 0.8476 0.5438 0.5837 0.8068
Specificity 0.8705 0.8705 0.9110 0.9419
Pos Pred Value 0.6857 0.5833 0.6862 0.8223
Neg Pred Value 0.9449 0.8513 0.8678 0.9360
Prevalence 0.2500 0.2500 0.2500 0.2500
Detection Rate 0.2119 0.1360 0.1459 0.2017
Detection Prevalence 0.3090 0.2331 0.2126 0.2453
Balanced Accuracy 0.8591 0.7072 0.7473 0.8743
정확도 69.55%, 카파통계량 0.594로 polynomial커널함수를 사용한 SVM모델(정확도 63.87%, 카파통계량 0.5183)보다 좋은 성능을 보인다.
> dummies <- dummyVars(class ~., data = train)
> dummies
Dummy Variable Object
Formula: class ~ .
10 variables, 2 factors
Variables and levels will be separated by '.'
A less than full rank encoding is used
> train_dummies <- as.data.frame(predict(dummies, newdata = train))
Warning message:
In model.frame.default(Terms, newdata, na.action = na.action, xlev = object$lvls) :
variable 'class' is not a factor
> str(train_dummies)
'data.frame': 9377 obs. of 10 variables:
$ age : num 25 31 32 42 57 27 22 24 25 26 ...
$ height_cm : num 165 180 174 164 162 ...
$ weight_kg : num 55.8 78 71.1 63.7 58.2 53.9 67.9 84.4 66.6 71.5 ...
$ body.fat_. : num 15.7 20.1 18.4 32.2 20.9 35.5 18 20.4 30.2 18 ...
$ diastolic : num 77 92 76 72 69 69 71 80 82 64 ...
$ systolic : num 126 152 147 135 106 116 120 120 120 135 ...
$ sit.and.bend.forward_cm: num 16.3 12 15.2 0.8 20.7 13.1 19.2 7.2 22.9 17.8 ...
$ sit.ups.counts : num 53 49 53 18 30 28 55 54 39 50 ...
$ gender.F : num 0 0 0 1 1 1 0 0 1 0 ...
$ gender.M : num 1 1 1 0 0 0 1 1 0 1 ...
> train_dummies$class <- train$class
> dummies2 <- dummyVars(class ~., data = test)
> dummies2
Dummy Variable Object
Formula: class ~ .
10 variables, 2 factors
Variables and levels will be separated by '.'
A less than full rank encoding is used
> test_dummies <- as.data.frame(predict(dummies2, newdata = test))
Warning message:
In model.frame.default(Terms, newdata, na.action = na.action, xlev = object$lvls) :
variable 'class' is not a factor
> str(test_dummies)
'data.frame': 4016 obs. of 10 variables:
$ age : num 27 28 36 33 54 28 42 45 38 23 ...
$ height_cm : num 172 174 165 175 167 ...
$ weight_kg : num 75.2 67.7 55.4 77.2 67.5 ...
$ body.fat_. : num 21.3 17.1 22 28 27.6 14.4 19.3 30.9 23.2 29.6 ...
$ diastolic : num 80 70 64 84 85 81 63 93 70 91 ...
$ systolic : num 130 127 119 137 141 141 110 144 111 126 ...
$ sit.and.bend.forward_cm: num 18.4 20.7 21 12.3 18.6 12.1 16 19 19.7 20.7 ...
$ sit.ups.counts : num 60 45 27 42 34 55 50 30 50 32 ...
$ gender.F : num 0 0 1 0 0 0 0 1 0 1 ...
$ gender.M : num 1 1 0 1 1 1 1 0 1 0 ...
> test_dummies$class <- test$class
인자 | Description |
---|---|
nrounds | 최대 반복 횟수(최종 모형에서의 트리 수) |
colsample_bytree | 트리를 생성할 때 표본 추출할 피처 수(비율로 표시됨), 기본값은 1 |
min_child_weight | 부스트되는 트리에서 최소 가중값, 기본값은 1 |
eta | 학습 속도, 해법에 관한 각 트리의 기여도를 의미, 기본값은 0.3 |
gamma | 트리에서 다른 leaf 분할을 하기 위해 필요한 최소 손실 감소 |
subsample | 데이터 관찰값의 비율, 기본값은 1 |
max_depth | 개별 트리의 최대 깊이 |
> grid <- expand.grid(nrounds = c(350, 500),
+ colsample_bytree = 1,
+ min_child_weight = 1,
+ eta = c(0.01, 0.1, 0.3),
+ gamma = c(0.5, 0.25),
+ subsample = 0.5,
+ max_depth = c(2, 3)
+ )
> cntrl <- trainControl(method = 'cv',
+ number = 5,
+ verboseIter = TRUE,
+ returnData = FALSE,
+ returnResamp = 'final')
> train.xgb <- train(x = train_dummies[, 1:10],
+ y = train_dummies[, 11],
+ trControl = cntrl,
+ tuneGrid = grid,
+ method = 'xgbTree')
+ Fold1: eta=0.01, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=500
[16:31:08] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
- Fold1: eta=0.01, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=500
.
.
.
- Fold5: eta=0.30, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=500
Aggregating results
Selecting tuning parameters
Fitting nrounds = 350, max_depth = 3, eta = 0.1, gamma = 0.25, colsample_bytree = 1, min_child_weight = 1, subsample = 0.5 on full training set
상당히 긴 출력이 나오기 때문에 중간의 출력은 생략했다.
최적 인자들의 조합으로 nrounds = 350, max_depth = 3, eta = 0.1, gamma = 0.25, colsample_bytree = 1, min_child_weight = 1, subsample = 0.5이 선택되었다.
> param <- list(booster = "gbtree",
+ objective = "multi:softprob",
+ num_class = 4,
+ eval_metric = 'error',
+ eta = 0.1,
+ max_depth = 3,
+ subsample = 0.5,
+ colsample_bytree = 1,
+ gamma = 0.25)
> x <- as.matrix(train_dummies[, 1:10])
> train.mat <- xgb.DMatrix(data = x, label = as.numeric(train_dummies$class) - 1)
> set.seed(321)
> xgb.fit <- xgb.train(params = param, data = train.mat, nrounds = 350)
impMatrix <- xgb.importance(feature_names = dimnames(x)[[2]], model = xgb.fit)
impMatrix
xgb.plot.importance(impMatrix, main = 'Gain by Feature')
> library(InformationValue)
> pred <- predict(xgb.fit, x)
> optimalCutoff(y, pred)
[1] 0.4399996
> testMat <- as.matrix(test_dummies[, 1:10])
> xgb.test <- predict(xgb.fit, testMat)
> y.test <- as.numeric(test_dummies$class) - 1
> confusionMatrix(y.test, xgb.test, threshold = 0.4399)
Error in table(predicted_dir, actual_dir) :
all arguments must have the same length
> 1 - misClassError(y.test, xgb.test, threshold = 0.4399)
[1] -2
> plotROC(y.test, xgb.test)
AUC의 최댓값은 1로 알고 있었다... 하지만 2.0096이 나왔지... 무엇이 문제일까??
> library(caret)
> library(MASS)
> library(caretEnsemble)
> library(caTools)
> library(mlr)
> library(ggplot2)
> library(HDclassif)
> library(reshape2)
> library(corrplot)
> library(gbm)
> library(mboost)
> control <- trainControl(method = 'cv', number = 5,
+ savePredictions = 'final',
+ classProbs = TRUE,
+ summaryFunction = defaultSummary)
> set.seed(1234)
> models <- caretList(class ~., data = train,
+ trControl = control,
+ metric = 'ROC',
+ methodList = c('rpart', 'earth', 'knn'))
There were 28 warnings (use warnings() to see them)
> models
$rpart
CART
9377 samples
9 predictor
4 classes: 'A', 'B', 'C', 'D'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 7503, 7501, 7501, 7502, 7501
Resampling results across tuning parameters:
cp Accuracy Kappa
0.02531286 0.4917378 0.3223165
0.06214448 0.4499429 0.2665902
0.20221843 0.4134534 0.2179368
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.02531286.
$earth
Multivariate Adaptive Regression Spline
9377 samples
9 predictor
4 classes: 'A', 'B', 'C', 'D'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 7503, 7501, 7501, 7502, 7501
Resampling results across tuning parameters:
nprune Accuracy Kappa
2 0.4363864 0.2484982
8 0.5503914 0.4005246
14 0.5748124 0.4330834
Tuning parameter 'degree' was held constant at a value of 1
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were nprune = 14 and degree = 1.
$knn
k-Nearest Neighbors
9377 samples
9 predictor
4 classes: 'A', 'B', 'C', 'D'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 7503, 7501, 7501, 7502, 7501
Resampling results across tuning parameters:
k Accuracy Kappa
5 0.5325800 0.3767785
7 0.5415352 0.3887178
9 0.5506006 0.4008056
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.
attr(,"class")
[1] "caretList"
> modelCor(resamples(models))
rpart earth knn
rpart 1.0000000 0.8561974 -0.7627144
earth 0.8561974 1.0000000 -0.6274591
knn -0.7627144 -0.6274591 1.0000000
효과적인 앙상블을 위해서는 상호연관성이 높지 않은 것이 좋다. rpart와 earth, knn과 rpart의 상호 연관성이 다소 높아 보인다.
> set.seed(1234)
> models <- caretList(class ~., data = train,
+ trControl = control,
+ metric = 'ROC',
+ methodList = c('treebag', 'earth', 'knn'))
There were 28 warnings (use warnings() to see them)
> models
$treebag
Bagged CART
9377 samples
9 predictor
4 classes: 'A', 'B', 'C', 'D'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 7503, 7501, 7501, 7502, 7501
Resampling results:
Accuracy Kappa
0.6641765 0.5522371
$earth
Multivariate Adaptive Regression Spline
9377 samples
9 predictor
4 classes: 'A', 'B', 'C', 'D'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 7503, 7501, 7501, 7502, 7501
Resampling results across tuning parameters:
nprune Accuracy Kappa
2 0.4363864 0.2484982
8 0.5503914 0.4005246
14 0.5748124 0.4330834
Tuning parameter 'degree' was held constant at a value of 1
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were nprune = 14 and degree = 1.
$knn
k-Nearest Neighbors
9377 samples
9 predictor
4 classes: 'A', 'B', 'C', 'D'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 7503, 7501, 7501, 7502, 7501
Resampling results across tuning parameters:
k Accuracy Kappa
5 0.5303412 0.3737935
7 0.5402554 0.3870114
9 0.5516674 0.4022275
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.
attr(,"class")
[1] "caretList"
> modelCor(resamples(models))
treebag earth knn
treebag 1.0000000 0.0385026 0.1967112
earth 0.0385026 1.0000000 -0.5078148
knn 0.1967112 -0.5078148 1.0000000
상호연관성이 낮아보인다.
> model_preds <- lapply(models, predict, newdata = test, type = 'prob')
> model_preds <- lapply(model_preds, function(x) x[, 'A'])
> model_preds <- data.frame(model_preds)
> stack <- caretStack(models, method = 'glm', metric = 'ROC',
+ trControl = trainControl(
+ method = 'boot', number = 5,
+ savePredictions = 'final',
+ classProbs = TRUE,
+ summaryFunction = defaultSummary
+ ))
Error in check_caretList_model_types(list_of_models) :
Not yet implemented for multiclass problems
caretStack은 multiclass모델에 관해서는 지원하지 않나보다...이럴수가
다중 분류는 너무 어렵다... 익스트림 그레디언트 부스트와 앙상블 분석에서는 제대로 된 결론도 못 내렸다.
다중 분류를 아예 지원하지 않는 패키지도 많고, 정확한 AUC도 구하기 어려웠다...
그렇다면 다중분류를 이중분류로 사용해서 분석하면 어떨까?
A, B, C, D의 분류가 있으면 A를 Good으로 바꾸고 B, C, D를 Bad로 바꾸는 것이다. 혹은 A, B를 Good으로 C, D를 Bad로 재할당해서 다시 데이터 분석을 해봐야겠다.
_'Body Performance Data'를 이용한 분류분석_ver2로 다시 찾아오겠다...ver2_click!
pastecs, psych, PerformanceAnalytics, caret, GGally, reshape2, ggplot2, gridExtra, gapminder, dplyr, class, kknn, e1071, MASS, kernlab, corrplot, rpart, partykit, genridge, randomForest, xgboost, Boruta, InformationValue,
caretEnsemble, caTools, mlr, HDclassif, gbm, mboost