데이터 출처
The widely used Statlog German credit data, as of November 2019, suffers from severe errors in the coding information and does not come with any background information. The 'South German Credit' data provide a correction and some background information, based on the Open Data LMU (2010) representation of the same data and several other German language resources.
Header | Description |
---|---|
status | status of the debtor's checking account with the bank (categorical) |
duration | credit duration in months (quantitative) |
credit_history | history of compliance with previous or concurrent credit contracts (categorical) |
purpose | purpose for which the credit is needed (categorical) |
amount | credit amount in DM (quantitative) |
savings | debtor's savings (categorical) |
employment_duration | duration of debtor's employment with current employer |
installment_rate | credit installments as a percentage of debtor's disposable income |
personal_status_sex | combined information on sex and marital status (categorical) |
other_debtors | Is there another debtor or a guarantor for the credit? (categorical) |
present_residence | length of time (in years) the debtor lives in the present residence |
property | the debtor's most valuable property |
age | age in years (quantitative) |
other_installment_plans | installment plans from providers other than the credit-giving bank (categorical) |
housing | type of housing the debtor lives in (categorical) |
number_credits | number of credits including the current one the debtor has (or had) at this bank |
job | quality of debtor's job (ordinal) |
people_liable | number of persons who financially depend on the debtor |
telephone | Is there a telephone landline registered on the debtor's name? (binary; remember that the data are from the 1970s) |
foreign_worker | Is the debtor a foreign worker? (binary) |
credit_risk | Has the credit contract been complied with (good) or not (bad) ? (binary) |
dat <- read.table("SouthGermanCredit.asc", header=TRUE)
nam_fahrmeirbook <- colnames(dat)
nam_evtree <- c("status", "duration", "credit_history", "purpose", "amount",
"savings", "employment_duration", "installment_rate",
"personal_status_sex", "other_debtors",
"present_residence", "property",
"age", "other_installment_plans",
"housing", "number_credits",
"job", "people_liable", "telephone", "foreign_worker",
"credit_risk")
names(dat) <- nam_evtree
for (i in setdiff(1:21, c(2,4,5,13)))
dat[[i]] <- factor(dat[[i]])
dat[[4]] <- factor(dat[[4]], levels=as.character(0:10))
levels(dat$credit_risk) <- c("bad", "good")
levels(dat$status) = c("no checking account",
"... < 0 DM",
"0<= ... < 200 DM",
"... >= 200 DM / salary for at least 1 year")
levels(dat$credit_history) <- c(
"delay in paying off in the past",
"critical account/other credits elsewhere",
"no credits taken/all credits paid back duly",
"existing credits paid back duly till now",
"all credits at this bank paid back duly")
levels(dat$purpose) <- c(
"others",
"car (new)",
"car (used)",
"furniture/equipment",
"radio/television",
"domestic appliances",
"repairs",
"education",
"vacation",
"retraining",
"business")
levels(dat$savings) <- c("unknown/no savings account",
"... < 100 DM",
"100 <= ... < 500 DM",
"500 <= ... < 1000 DM",
"... >= 1000 DM")
levels(dat$employment_duration) <-
c( "unemployed",
"< 1 yr",
"1 <= ... < 4 yrs",
"4 <= ... < 7 yrs",
">= 7 yrs")
dat$installment_rate <- ordered(dat$installment_rate)
levels(dat$installment_rate) <- c(">= 35",
"25 <= ... < 35",
"20 <= ... < 25",
"< 20")
levels(dat$other_debtors) <- c(
"none",
"co-applicant",
"guarantor"
)
levels(dat$personal_status_sex) <- c(
"male : divorced/separated",
"female : non-single or male : single",
"male : married/widowed",
"female : single")
dat$present_residence <- ordered(dat$present_residence)
levels(dat$present_residence) <- c("< 1 yr",
"1 <= ... < 4 yrs",
"4 <= ... < 7 yrs",
">= 7 yrs")
levels(dat$property) <- c(
"unknown / no property",
"car or other",
"building soc. savings agr./life insurance",
"real estate"
)
levels(dat$other_installment_plans) <- c(
"bank",
"stores",
"none"
)
levels(dat$housing) <- c("for free", "rent", "own")
dat$number_credits <- ordered(dat$number_credits)
levels(dat$number_credits) <- c("1", "2-3", "4-5", ">= 6")
levels(dat$job) <- c(
"unemployed/unskilled - non-resident",
"unskilled - resident",
"skilled employee/official",
"manager/self-empl./highly qualif. employee"
)
levels(dat$people_liable) <- c("3 or more", "0 to 2")
levels(dat$telephone) <- c("no", "yes (under customer name)")
levels(dat$foreign_worker) <- c("yes", "no")
df <- as.data.frame(dat)
dim(df)
## [1] 1000 21
str(df)
## 'data.frame': 1000 obs. of 21 variables:
## $ status : Factor w/ 4 levels "no checking account",..: 1 1 2 1 1 1 1 1 4 2 ...
## $ duration : int 18 9 12 12 12 10 8 6 18 24 ...
## $ credit_history : Factor w/ 5 levels "delay in paying off in the past",..: 5 5 3 5 5 5 5 5 5 3 ...
## $ purpose : Factor w/ 11 levels "others","car (new)",..: 3 1 10 1 1 1 1 1 4 4 ...
## $ amount : int 1049 2799 841 2122 2171 2241 3398 1361 1098 3758 ...
## $ savings : Factor w/ 5 levels "unknown/no savings account",..: 1 1 2 1 1 1 1 1 1 3 ...
## $ employment_duration : Factor w/ 5 levels "unemployed","< 1 yr",..: 2 3 4 3 3 2 4 2 1 1 ...
## $ installment_rate : Ord.factor w/ 4 levels ">= 35"<"25 <= ... < 35"<..: 4 2 2 3 4 1 1 2 4 1 ...
## $ personal_status_sex : Factor w/ 4 levels "male : divorced/separated",..: 2 3 2 3 3 3 3 3 2 2 ...
## $ other_debtors : Factor w/ 3 levels "none","co-applicant",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ present_residence : Ord.factor w/ 4 levels "< 1 yr"<"1 <= ... < 4 yrs"<..: 4 2 4 2 4 3 4 4 4 4 ...
## $ property : Factor w/ 4 levels "unknown / no property",..: 2 1 1 1 2 1 1 1 3 4 ...
## $ age : int 21 36 23 39 38 48 39 40 65 23 ...
## $ other_installment_plans: Factor w/ 3 levels "bank","stores",..: 3 3 3 3 1 3 3 3 3 3 ...
## $ housing : Factor w/ 3 levels "for free","rent",..: 1 1 1 1 2 1 2 2 2 1 ...
## $ number_credits : Ord.factor w/ 4 levels "1"<"2-3"<"4-5"<..: 1 2 1 2 2 2 2 1 2 1 ...
## $ job : Factor w/ 4 levels "unemployed/unskilled - non-resident",..: 3 3 2 2 2 2 2 2 1 1 ...
## $ people_liable : Factor w/ 2 levels "3 or more","0 to 2": 2 1 2 1 2 1 2 1 2 2 ...
## $ telephone : Factor w/ 2 levels "no","yes (under customer name)": 1 1 1 1 1 1 1 1 1 1 ...
## $ foreign_worker : Factor w/ 2 levels "yes","no": 2 2 2 1 1 1 1 1 2 2 ...
## $ credit_risk : Factor w/ 2 levels "bad","good": 2 2 2 2 2 2 2 2 2 2 ...
summary(df)
## status duration
## no checking account :274 Min. : 4.0
## ... < 0 DM :269 1st Qu.:12.0
## 0<= ... < 200 DM : 63 Median :18.0
## ... >= 200 DM / salary for at least 1 year:394 Mean :20.9
## 3rd Qu.:24.0
## Max. :72.0
##
## credit_history purpose
## delay in paying off in the past : 40 furniture/equipment:280
## critical account/other credits elsewhere : 49 others :234
## no credits taken/all credits paid back duly:530 car (used) :181
## existing credits paid back duly till now : 88 car (new) :103
## all credits at this bank paid back duly :293 retraining : 97
## repairs : 50
## (Other) : 55
## amount savings employment_duration
## Min. : 250 unknown/no savings account:603 unemployed : 62
## 1st Qu.: 1366 ... < 100 DM :103 < 1 yr :172
## Median : 2320 100 <= ... < 500 DM : 63 1 <= ... < 4 yrs:339
## Mean : 3271 500 <= ... < 1000 DM : 48 4 <= ... < 7 yrs:174
## 3rd Qu.: 3972 ... >= 1000 DM :183 >= 7 yrs :253
## Max. :18424
##
## installment_rate personal_status_sex
## >= 35 :136 male : divorced/separated : 50
## 25 <= ... < 35:231 female : non-single or male : single:310
## 20 <= ... < 25:157 male : married/widowed :548
## < 20 :476 female : single : 92
##
##
##
## other_debtors present_residence
## none :907 < 1 yr :130
## co-applicant: 41 1 <= ... < 4 yrs:308
## guarantor : 52 4 <= ... < 7 yrs:149
## >= 7 yrs :413
##
##
##
## property age
## unknown / no property :282 Min. :19.00
## car or other :232 1st Qu.:27.00
## building soc. savings agr./life insurance:332 Median :33.00
## real estate :154 Mean :35.54
## 3rd Qu.:42.00
## Max. :75.00
##
## other_installment_plans housing number_credits
## bank :139 for free:179 1 :633
## stores: 47 rent :714 2-3 :333
## none :814 own :107 4-5 : 28
## >= 6: 6
##
##
##
## job people_liable
## unemployed/unskilled - non-resident : 22 3 or more:155
## unskilled - resident :200 0 to 2 :845
## skilled employee/official :630
## manager/self-empl./highly qualif. employee:148
##
##
##
## telephone foreign_worker credit_risk
## no :596 yes: 37 bad :300
## yes (under customer name):404 no :963 good:700
##
##
##
##
##
sum(is.na(df))
## [1] 0
21개의 피처중 3개만 연속형 피처이고 나머지는 factor형이다. 결측치는 없다.
library(DMwR2)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(pastecs)
library(psych)
library(caret)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
## Loading required package: lattice
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(smotefamily)
library(naniar)
library(reshape2)
library(gridExtra)
library(gapminder)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following objects are masked from 'package:pastecs':
##
## first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(PerformanceAnalytics)
## Loading required package: xts
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
## The following objects are masked from 'package:pastecs':
##
## first, last
##
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
##
## legend
library(FSelector)
library(Boruta)
library(ROSE)
## Loaded ROSE 0.0-4
describeBy(df[, c(2, 5, 13)], df$credit_risk, mat = FALSE)
##
## Descriptive statistics by group
## group: bad
## vars n mean sd median trimmed mad min max range skew
## duration 1 300 24.86 13.28 24.0 23.59 17.79 6 72 66 0.83
## amount 2 300 3938.13 3535.82 2574.5 3291.18 2092.69 433 18424 17991 1.57
## age 3 300 33.96 11.23 31.0 32.37 9.64 19 74 55 1.14
## kurtosis se
## duration 0.03 0.77
## amount 2.05 204.14
## age 0.73 0.65
## ------------------------------------------------------------
## group: good
## vars n mean sd median trimmed mad min max range skew
## duration 1 700 19.21 11.08 18 17.88 8.90 4 60 56 1.18
## amount 2 700 2985.44 2401.50 2244 2564.21 1485.57 250 15857 15607 1.94
## age 3 700 36.22 11.35 34 34.92 10.38 19 75 56 0.98
## kurtosis se
## duration 1.38 0.42
## amount 4.62 90.77
## age 0.59 0.43
결과변수인 ’credit_risk’에서 bad그룹의 duration의 평균보다 good그룹의 평균이 약 5개월 정도 낮은것으로 나타났고, amount의 평균은 bad그룹이 good그룹보다 약 1000DM많은 것으로 나타났다. (즉 신용 계약을 준수한 그룹(good)이 계약을 준수하지 않은 그룹(bad)보다 더 작은 돈을 빌렸다.)
또한 나이의 평균은 bad그룹이 33.9, good그룹이 36.2세로 두 그룹의 차가 2.3으로 큰 차이는 아닌 것으로 보인다.
’age’변수의 왜도와 첨도는 bad그룹과 good그룹이 거의 차이나지 않지만, ’duration’과 ’amount’변수의 왜도와 첨도는 bad그룹보다 good그룹이 조금씩 더 크다.
따라서 bad그룹보다 good그룹의 ‘duration’과 ’amount’ 변수의 데이터가 조금 더 왼쪽에 쏠려있고 약간 더 뾰족한 분포를 보일것으로 예측된다.
ggpairs(df, columns = c(2, 5, 13), aes(colour = credit_risk, alpha = 0.4))
위에서 첨도와 왜도분석에서 보았듯이 ‘duration’과 ’amount’ 변수는 bad그룹보다 good그룹이 더 왼쪽으로 쏠리고 더 뾰족한 분포를 보인다.
반대로 ’age’변수는 good그룹보다 bad그룹이 더 왼쪽으로 쏠리고 살짝 더 뾰족한 분포를 보이고 있다.
또한 ‘duration’과 ’amount’ 변수간의 상관관계가 0.625로 살짝 높은편이다.
ggpairs(df, columns = c(1, 3, 4, 6, 7, 8, 9, 10), aes(colour = credit_risk, alpha = 0.4))
왼쪽 위부터 오른쪽 아래로 가는 대각선에 위치하는 그래프를 보면 된다.
’status’변수는 채무자의 은행 당좌 예금 상태를 나타내는데 200DM이상이 가장 많았고 0이상 200DM미만이 가장 작다.
여기서 ’other_debtors’가 가장 분포의 차이가 두드러지는데 none이 907개, co-applicant가 41개, guarantor가 52개의 데이터로 카테고리별 차이가 10배 이상으로 크다.
ggpairs(df, columns = c(11, 12, 14, 15, 16, 17, 18, 19, 20), aes(colour = credit_risk, alpha = 0.4))
여기서 특이한 변수는 ’other_installment_plans’로 보이는데, 신용 대출 은행 이외의 제공자로부터 받은 할부금이 대부분 없는것(none)으로 나타났다.
또한 ’foreign_worker’변수를 보면 은행에서 돈을 빌린 대부분의 사람이 외국인이 아닌 내국인인 것으로 나타났다.
df_numeric <- df[, c(2, 5, 13, 21)]
model_scale <- preProcess(df_numeric, method = c('center', 'scale'))
scaled_df <- predict(model_scale, df_numeric)
melt_df <- melt(scaled_df, id.vars = 'credit_risk')
head(melt_df)
## credit_risk variable value
## 1 good duration -0.2407368
## 2 good duration -0.9870788
## 3 good duration -0.7382981
## 4 good duration -0.7382981
## 5 good duration -0.7382981
## 6 good duration -0.9041519
p1 <- ggplot(melt_df, aes(x = variable, y = value, fill = credit_risk)) + geom_boxplot()
p1
수치형 변수인 duration과 amount, age에 모두 이상치가 있는 것으로 나타났다.
하지만 여기서 이상치로 찍힌 점들은 제 1사분위수 - (1.5IQR)보다 작거나 제3사분위수 + (1.5IQR)보다 큰 수인데, 모든 상황에서 이 수치보다 크거나 작다고 이상치라고 할 수 없다.
현재 age데이터의 최솟값은 19, 최댓값은 75인데 이를 이상치라고 볼 수 없다.
duraion과 amount의 최댓값은 각각 72와 18424인데 이를 명확하게 이상치라고 볼 근거가 없으므로 사분위수 99%범위 밖의 이상치만 75% 사분위수로 바꿔주겠다.
dura_99 <- quantile(df$duration, 0.99)
dura_75 <- quantile(df$duration, 0.75)
amount_99 <- quantile(df$amount, 0.99)
amount_75 <- quantile(df$amount, 0.75)
df$duration <- ifelse(df$duration > dura_99, dura_75, df$duration)
df$amount <- ifelse(df$amount > amount_99, amount_75, df$amount)
df_numeric2 <- df[, c(2, 5, 13, 21)]
model_scale2 <- preProcess(df_numeric2, method = c('center', 'scale'))
scaled_df2 <- predict(model_scale2, df_numeric2)
melt_df2 <- melt(scaled_df2, id.vars = 'credit_risk')
head(melt_df2)
## credit_risk variable value
## 1 good duration -0.2389069
## 2 good duration -0.9920284
## 3 good duration -0.7409879
## 4 good duration -0.7409879
## 5 good duration -0.7409879
## 6 good duration -0.9083482
p1 <- ggplot(melt_df2, aes(x = variable, y = value, fill = credit_risk)) + geom_boxplot()
p1
잘 변경된 것으로 보인다.
nearZeroVar(df, saveMetrics = TRUE)
## freqRatio percentUnique zeroVar nzv
## status 1.437956 0.4 FALSE FALSE
## duration 1.033520 3.2 FALSE FALSE
## credit_history 1.808874 0.5 FALSE FALSE
## purpose 1.196581 1.0 FALSE FALSE
## amount 3.333333 91.4 FALSE FALSE
## savings 3.295082 0.5 FALSE FALSE
## employment_duration 1.339921 0.5 FALSE FALSE
## installment_rate 2.060606 0.4 FALSE FALSE
## personal_status_sex 1.767742 0.4 FALSE FALSE
## other_debtors 17.442308 0.3 FALSE FALSE
## present_residence 1.340909 0.4 FALSE FALSE
## property 1.177305 0.4 FALSE FALSE
## age 1.020408 5.3 FALSE FALSE
## other_installment_plans 5.856115 0.3 FALSE FALSE
## housing 3.988827 0.3 FALSE FALSE
## number_credits 1.900901 0.4 FALSE FALSE
## job 3.150000 0.4 FALSE FALSE
## people_liable 5.451613 0.2 FALSE FALSE
## telephone 1.475248 0.2 FALSE FALSE
## foreign_worker 26.027027 0.2 FALSE TRUE
## credit_risk 2.333333 0.2 FALSE FALSE
table(df$foreign_worker)
##
## yes no
## 37 963
’foreign_worker’변수의 분산이 0에 가까운것으로 나타났다. 위의 그래프에서 보았다싶이 카테고리에서 외국인의 수는 37, 내국인의 수는 963으로 차이가 많이나 모델링에 유용하지 않다.
df2 <- df[, -nearZeroVar(df)]
str(df2)
## 'data.frame': 1000 obs. of 20 variables:
## $ status : Factor w/ 4 levels "no checking account",..: 1 1 2 1 1 1 1 1 4 2 ...
## $ duration : num 18 9 12 12 12 10 8 6 18 24 ...
## $ credit_history : Factor w/ 5 levels "delay in paying off in the past",..: 5 5 3 5 5 5 5 5 5 3 ...
## $ purpose : Factor w/ 11 levels "others","car (new)",..: 3 1 10 1 1 1 1 1 4 4 ...
## $ amount : num 1049 2799 841 2122 2171 ...
## $ savings : Factor w/ 5 levels "unknown/no savings account",..: 1 1 2 1 1 1 1 1 1 3 ...
## $ employment_duration : Factor w/ 5 levels "unemployed","< 1 yr",..: 2 3 4 3 3 2 4 2 1 1 ...
## $ installment_rate : Ord.factor w/ 4 levels ">= 35"<"25 <= ... < 35"<..: 4 2 2 3 4 1 1 2 4 1 ...
## $ personal_status_sex : Factor w/ 4 levels "male : divorced/separated",..: 2 3 2 3 3 3 3 3 2 2 ...
## $ other_debtors : Factor w/ 3 levels "none","co-applicant",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ present_residence : Ord.factor w/ 4 levels "< 1 yr"<"1 <= ... < 4 yrs"<..: 4 2 4 2 4 3 4 4 4 4 ...
## $ property : Factor w/ 4 levels "unknown / no property",..: 2 1 1 1 2 1 1 1 3 4 ...
## $ age : int 21 36 23 39 38 48 39 40 65 23 ...
## $ other_installment_plans: Factor w/ 3 levels "bank","stores",..: 3 3 3 3 1 3 3 3 3 3 ...
## $ housing : Factor w/ 3 levels "for free","rent",..: 1 1 1 1 2 1 2 2 2 1 ...
## $ number_credits : Ord.factor w/ 4 levels "1"<"2-3"<"4-5"<..: 1 2 1 2 2 2 2 1 2 1 ...
## $ job : Factor w/ 4 levels "unemployed/unskilled - non-resident",..: 3 3 2 2 2 2 2 2 1 1 ...
## $ people_liable : Factor w/ 2 levels "3 or more","0 to 2": 2 1 2 1 2 1 2 1 2 2 ...
## $ telephone : Factor w/ 2 levels "no","yes (under customer name)": 1 1 1 1 1 1 1 1 1 1 ...
## $ credit_risk : Factor w/ 2 levels "bad","good": 2 2 2 2 2 2 2 2 2 2 ...
set.seed(123)
feature.selection <- Boruta(credit_risk ~., data = df2, doTrace = 1)
## After 11 iterations, +2.2 secs:
## confirmed 9 attributes: age, amount, credit_history, duration, employment_duration and 4 more;
## still have 10 attributes left.
## After 19 iterations, +3.7 secs:
## rejected 1 attribute: people_liable;
## still have 9 attributes left.
## After 22 iterations, +4.3 secs:
## rejected 1 attribute: telephone;
## still have 8 attributes left.
## After 28 iterations, +5.3 secs:
## confirmed 1 attribute: purpose;
## rejected 1 attribute: present_residence;
## still have 6 attributes left.
## After 40 iterations, +7.3 secs:
## confirmed 1 attribute: other_installment_plans;
## still have 5 attributes left.
## After 45 iterations, +8.1 secs:
## rejected 1 attribute: personal_status_sex;
## still have 4 attributes left.
## After 76 iterations, +13 secs:
## confirmed 1 attribute: installment_rate;
## still have 3 attributes left.
## After 85 iterations, +15 secs:
## confirmed 1 attribute: housing;
## still have 2 attributes left.
table(feature.selection$finalDecision)
##
## Tentative Confirmed Rejected
## 2 13 4
fNames <- getSelectedAttributes(feature.selection, withTentative = TRUE)
fNames
## [1] "status" "duration"
## [3] "credit_history" "purpose"
## [5] "amount" "savings"
## [7] "employment_duration" "installment_rate"
## [9] "other_debtors" "property"
## [11] "age" "other_installment_plans"
## [13] "housing" "number_credits"
## [15] "job"
new_df <- df2[, fNames]
new_df$credit_risk <- df2$credit_risk
str(new_df)
## 'data.frame': 1000 obs. of 16 variables:
## $ status : Factor w/ 4 levels "no checking account",..: 1 1 2 1 1 1 1 1 4 2 ...
## $ duration : num 18 9 12 12 12 10 8 6 18 24 ...
## $ credit_history : Factor w/ 5 levels "delay in paying off in the past",..: 5 5 3 5 5 5 5 5 5 3 ...
## $ purpose : Factor w/ 11 levels "others","car (new)",..: 3 1 10 1 1 1 1 1 4 4 ...
## $ amount : num 1049 2799 841 2122 2171 ...
## $ savings : Factor w/ 5 levels "unknown/no savings account",..: 1 1 2 1 1 1 1 1 1 3 ...
## $ employment_duration : Factor w/ 5 levels "unemployed","< 1 yr",..: 2 3 4 3 3 2 4 2 1 1 ...
## $ installment_rate : Ord.factor w/ 4 levels ">= 35"<"25 <= ... < 35"<..: 4 2 2 3 4 1 1 2 4 1 ...
## $ other_debtors : Factor w/ 3 levels "none","co-applicant",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ property : Factor w/ 4 levels "unknown / no property",..: 2 1 1 1 2 1 1 1 3 4 ...
## $ age : int 21 36 23 39 38 48 39 40 65 23 ...
## $ other_installment_plans: Factor w/ 3 levels "bank","stores",..: 3 3 3 3 1 3 3 3 3 3 ...
## $ housing : Factor w/ 3 levels "for free","rent",..: 1 1 1 1 2 1 2 2 2 1 ...
## $ number_credits : Ord.factor w/ 4 levels "1"<"2-3"<"4-5"<..: 1 2 1 2 2 2 2 1 2 1 ...
## $ job : Factor w/ 4 levels "unemployed/unskilled - non-resident",..: 3 3 2 2 2 2 2 2 1 1 ...
## $ credit_risk : Factor w/ 2 levels "bad","good": 2 2 2 2 2 2 2 2 2 2 ...
idx <- createDataPartition(new_df$credit_risk, p = 0.7)
train <- new_df[idx$Resample1, ]
test <- new_df[-idx$Resample1, ]
str(train)
## 'data.frame': 700 obs. of 16 variables:
## $ status : Factor w/ 4 levels "no checking account",..: 1 2 1 1 1 1 4 1 1 1 ...
## $ duration : num 18 12 12 10 8 6 18 11 30 6 ...
## $ credit_history : Factor w/ 5 levels "delay in paying off in the past",..: 5 3 5 5 5 5 5 5 5 5 ...
## $ purpose : Factor w/ 11 levels "others","car (new)",..: 3 10 1 1 1 1 4 1 2 4 ...
## $ amount : num 1049 841 2171 2241 3398 ...
## $ savings : Factor w/ 5 levels "unknown/no savings account",..: 1 2 1 1 1 1 1 1 2 1 ...
## $ employment_duration : Factor w/ 5 levels "unemployed","< 1 yr",..: 2 4 3 2 4 2 1 3 4 4 ...
## $ installment_rate : Ord.factor w/ 4 levels ">= 35"<"25 <= ... < 35"<..: 4 2 4 1 1 2 4 2 1 1 ...
## $ other_debtors : Factor w/ 3 levels "none","co-applicant",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ property : Factor w/ 4 levels "unknown / no property",..: 2 1 2 1 1 1 3 1 3 3 ...
## $ age : int 21 23 38 48 39 40 65 36 24 31 ...
## $ other_installment_plans: Factor w/ 3 levels "bank","stores",..: 3 3 1 3 3 3 3 3 3 3 ...
## $ housing : Factor w/ 3 levels "for free","rent",..: 1 1 2 1 2 2 2 1 1 2 ...
## $ number_credits : Ord.factor w/ 4 levels "1"<"2-3"<"4-5"<..: 1 1 2 2 2 1 2 2 2 1 ...
## $ job : Factor w/ 4 levels "unemployed/unskilled - non-resident",..: 3 2 2 2 2 2 1 3 3 3 ...
## $ credit_risk : Factor w/ 2 levels "bad","good": 2 2 2 2 2 2 2 2 2 2 ...
model_train <- preProcess(train, method = 'range')
model_test <- preProcess(test, method = 'range')
scaled_train <- predict(model_train, train)
scaled_test <- predict(model_test, test)
dummy <- dummyVars(credit_risk ~., data = train)
train_dummy <- as.data.frame(predict(dummy, newdata = train))
train_dummy$credit_risk <- train$credit_risk
dummy2 <- dummyVars(credit_risk ~., data = test)
test_dummy <- as.data.frame(predict(dummy2, newdata = test))
test_dummy$credit_risk <- test$credit_risk
dummy3 <- dummyVars(credit_risk ~., data = scaled_train)
scaled_train_dummy <- as.data.frame(predict(dummy3, newdata = scaled_train))
scaled_train_dummy$credit_risk <- scaled_train$credit_risk
dummy4 <- dummyVars(credit_risk ~., data = scaled_test)
scaled_test_dummy <- as.data.frame(predict(dummy4, newdata = scaled_test))
scaled_test_dummy$credit_risk <- scaled_test$credit_risk
library(earth)
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
##
## Attaching package: 'plotrix'
## The following object is masked from 'package:psych':
##
## rescale
## Loading required package: TeachingDemos
set.seed(123)
earth.fit <- earth(credit_risk ~., data = train, pmethod = 'cv', nfold = 5,
ncross = 3, degree = 1, minspan = -1,
glm = list(family = binomial))
summary(earth.fit)
## Call: earth(formula=credit_risk~., data=train, pmethod="cv",
## glm=list(family=binomial), degree=1, nfold=5, ncross=3, minspan=-1)
##
## GLM coefficients
## good
## (Intercept) -1.6375354
## status... < 0 DM 0.3056341
## status0<= ... < 200 DM 1.0189310
## status... >= 200 DM / salary for at least 1 year 1.5810760
## credit_historyno credits taken/all credits paid back duly 0.9489170
## credit_historyexisting credits paid back duly till now 1.2806841
## credit_historyall credits at this bank paid back duly 1.6916157
## purposecar (new) 1.9311140
## purposecar (used) 0.5959477
## purposefurniture/equipment 0.7173273
## purposeradio/television 1.1716889
## purposevacation 14.6595748
## purposeretraining 0.8339675
## savings100 <= ... < 500 DM 0.7121511
## savings... >= 1000 DM 0.7113833
## employment_duration4 <= ... < 7 yrs 0.8365365
## employment_duration>= 7 yrs 0.2497641
## installment_rate.L -0.4105132
## other_debtorsguarantor 1.3757302
## housingrent 0.5917223
## h(18-duration) 0.0914916
## h(duration-18) -0.0194139
## h(2348-amount) -0.0004785
## h(amount-2348) -0.0001629
## h(33-age) -0.0507991
##
## GLM (family binomial, link logit):
## nulldev df dev df devratio AIC iters converged
## 855.21 699 637.564 675 0.254 687.6 14 1
##
## Earth selected 25 of 32 terms, and 22 of 46 predictors (pmethod="cv")
## Termination condition: RSq changed by less than 0.001 at 32 terms
## Importance: status... >= 200 DM / salary for at least 1 year, duration, ...
## Number of terms at each degree of interaction: 1 24 (additive model)
## Earth GRSq 0.1595951 RSq 0.2710526 mean.oof.RSq 0.1576443 (sd 0.0787)
##
## pmethod="backward" would have selected:
## 14 terms 12 preds, GRSq 0.1761084 RSq 0.2362594 mean.oof.RSq 0.1452849
위의 코드에서 pmethod = ‘cv’, nfold = 5를 통해 5겹 교차검증의 모형 선택을 명시하고, ncross = 3으로 반복은 3회, degree = 1로 상호작용항 없이 가산모형을 쓰고, mispan = -1로 힌지는 입력 피처 하나당 1개만 사용했다.
plotmo(earth.fit)
## plotmo grid: status duration
## ... >= 200 DM / salary for at least 1 year 18
## credit_history purpose amount
## no credits taken/all credits paid back duly furniture/equipment 2347
## savings employment_duration installment_rate other_debtors
## unknown/no savings account 1 <= ... < 4 yrs < 20 none
## property age other_installment_plans housing
## building soc. savings agr./life insurance 33 none rent
## number_credits job
## 1 skilled employee/official
’status’변수를 보면 클래스별로 결과변수에 차이가 있는 것을 볼 수 있다. 반면, ’other_installment_plans’변수의 클래스별 차이는 거의 없는것으로 보인다.
plotd(earth.fit)
evimp(earth.fit)
## nsubsets gcv rss
## status... >= 200 DM / salary for at least 1 year 24 100.0 100.0
## duration 23 63.9 79.5
## housingrent 22 54.1 73.9
## purposecar (new) 21 47.1 69.8
## credit_historyall credits at this bank paid back duly 20 36.1 64.5
## other_debtorsguarantor 19 28.0 60.7
## status0<= ... < 200 DM 17 19.1 55.6
## credit_historyno credits taken/all credits paid back du 17 9.0 54.1
## credit_historyexisting credits paid back duly till now 17 9.0 54.1
## amount 15 -21.2 47.8
## savings... >= 1000 DM 13 -28.8 41.6
## employment_duration4 <= ... < 7 yrs 12 -30.6 38.8
## age 11 -32.2 35.8
## purposefurniture/equipment 10 -32.2 33.4
## purposeretraining 9 -30.3 31.8
## purposecar (used) 8 -28.4 30.1
## savings100 <= ... < 500 DM 7 -28.4 27.2
## purposevacation 5 -26.2 21.7
## status... < 0 DM 4 -24.9 18.4
## purposeradio/television 3 -22.3 15.5
## installment_rate.L 2 -18.1 12.8
## employment_duration>= 7 yrs 1 -12.6 9.2
‘status’ 변수의 클래스 중 ’… >= 200 DM / salary for at least 1 year’클래스가 가장 중요한 변수로 나타났고, 그 다음으로 중요한 변수는 ’duration’으로 나타났다.
test.earth.fit <- predict(earth.fit, newdata = test, type = 'class')
test.earth.fit <- as.factor(test.earth.fit)
earth.confusion <- confusionMatrix(test.earth.fit, test$credit_risk, positive = 'good')
earth.confusion
## Confusion Matrix and Statistics
##
## Reference
## Prediction bad good
## bad 39 27
## good 51 183
##
## Accuracy : 0.74
## 95% CI : (0.6865, 0.7887)
## No Information Rate : 0.7
## P-Value [Acc > NIR] : 0.072279
##
## Kappa : 0.3299
##
## Mcnemar's Test P-Value : 0.009208
##
## Sensitivity : 0.8714
## Specificity : 0.4333
## Pos Pred Value : 0.7821
## Neg Pred Value : 0.5909
## Prevalence : 0.7000
## Detection Rate : 0.6100
## Detection Prevalence : 0.7800
## Balanced Accuracy : 0.6524
##
## 'Positive' Class : good
##
earth.confusion$byClass
## Sensitivity Specificity Pos Pred Value
## 0.8714286 0.4333333 0.7820513
## Neg Pred Value Precision Recall
## 0.5909091 0.7820513 0.8714286
## F1 Prevalence Detection Rate
## 0.8243243 0.7000000 0.6100000
## Detection Prevalence Balanced Accuracy
## 0.7800000 0.6523810
MARS모델의 Accuracy는 0.74, Kappa는 0.330, F1-score는 0.824인 것으로 나타났다.
library(ROCR)
test.earth.pred <- predict(earth.fit, newdata = test, type = 'response')
head(test.earth.pred)
## good
## [1,] 0.7100890
## [2,] 0.5994287
## [3,] 0.6148317
## [4,] 0.7237764
## [5,] 0.8388008
## [6,] 0.2662603
pred.earth <- prediction(test.earth.pred, test_dummy$credit_risk)
perf.earth <- performance(pred.earth, 'tpr', 'fpr')
plot(perf.earth, col = 1)
legend(0.6, 0.6, c('MARS'), 1)
performance(pred.earth, 'auc')@y.values
## [[1]]
## [1] 0.7833333
MARS모델에서 최종 성능은 다음과 같다. (Accuracy : 0.740, Kappa : 0.330, F1-score : 0.824, AUC : 0.783)
library(class)
library(kknn)
##
## Attaching package: 'kknn'
## The following object is masked from 'package:caret':
##
## contr.dummy
library(e1071)
##
## Attaching package: 'e1071'
## The following objects are masked from 'package:PerformanceAnalytics':
##
## kurtosis, skewness
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(kernlab)
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
##
## alpha
## The following object is masked from 'package:psych':
##
## alpha
library(corrplot)
## corrplot 0.92 loaded
grid1 <- expand.grid(.k = seq(2, 30, by = 1))
control <- trainControl(method = 'cv')
set.seed(123)
knn.train <- train(credit_risk ~., data = scaled_train_dummy,
method = 'knn',
trControl = control,
tuneGrid = grid1)
knn.train
## k-Nearest Neighbors
##
## 700 samples
## 56 predictor
## 2 classes: 'bad', 'good'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 630, 630, 630, 630, 630, 630, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 2 0.6442857 0.1572757
## 3 0.7014286 0.2363887
## 4 0.7100000 0.2528279
## 5 0.7157143 0.2497132
## 6 0.7014286 0.2078147
## 7 0.7171429 0.2446431
## 8 0.7142857 0.2384749
## 9 0.7100000 0.2070216
## 10 0.7185714 0.2263456
## 11 0.7157143 0.2074417
## 12 0.7128571 0.2040102
## 13 0.7157143 0.1915348
## 14 0.7085714 0.1706280
## 15 0.7142857 0.1858419
## 16 0.7114286 0.1768013
## 17 0.7028571 0.1441561
## 18 0.7085714 0.1701390
## 19 0.7085714 0.1585200
## 20 0.7128571 0.1760434
## 21 0.7142857 0.1581206
## 22 0.7128571 0.1420028
## 23 0.7057143 0.1172391
## 24 0.7085714 0.1274051
## 25 0.7100000 0.1269770
## 26 0.7085714 0.1237757
## 27 0.7114286 0.1344843
## 28 0.7057143 0.1039521
## 29 0.7114286 0.1171391
## 30 0.7114286 0.1205338
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 10.
k매개변수 값으로 10이 나왔다.
knn.model <- knn(scaled_train_dummy[, -57], scaled_test_dummy[, -57], scaled_train_dummy[, 57], k = 10)
summary(knn.model)
## bad good
## 34 266
• scaled_test_dummy를 이용해 모형이 얼마나 잘 작동하는지 보자.
knn.confusion <- confusionMatrix(knn.model, scaled_test_dummy$credit_risk, positive = 'good')
knn.confusion
## Confusion Matrix and Statistics
##
## Reference
## Prediction bad good
## bad 22 12
## good 68 198
##
## Accuracy : 0.7333
## 95% CI : (0.6795, 0.7825)
## No Information Rate : 0.7
## P-Value [Acc > NIR] : 0.1149
##
## Kappa : 0.2278
##
## Mcnemar's Test P-Value : 7.788e-10
##
## Sensitivity : 0.9429
## Specificity : 0.2444
## Pos Pred Value : 0.7444
## Neg Pred Value : 0.6471
## Prevalence : 0.7000
## Detection Rate : 0.6600
## Detection Prevalence : 0.8867
## Balanced Accuracy : 0.5937
##
## 'Positive' Class : good
##
knn.confusion$byClass
## Sensitivity Specificity Pos Pred Value
## 0.9428571 0.2444444 0.7443609
## Neg Pred Value Precision Recall
## 0.6470588 0.7443609 0.9428571
## F1 Prevalence Detection Rate
## 0.8319328 0.7000000 0.6600000
## Detection Prevalence Balanced Accuracy
## 0.8866667 0.5936508
Accuracy : 0.733, Kappa : 0.228, F1-score : 0.832의 성능이 나왔다.
kknn.train <- train.kknn(credit_risk ~., data = scaled_train_dummy, kmax = 30, distance = 2,
kernel = c('rectangular', 'triangular', 'epanechnikov', 'cosine', 'gaussian', 'optimal'))
plot(kknn.train)
kknn.train
##
## Call:
## train.kknn(formula = credit_risk ~ ., data = scaled_train_dummy, kmax = 30, distance = 2, kernel = c("rectangular", "triangular", "epanechnikov", "cosine", "gaussian", "optimal"))
##
## Type of response variable: nominal
## Minimal misclassification: 0.27
## Best kernel: rectangular
## Best k: 24
k = 24, rectangular커널을 사용했을 때 27%의 오류가 나온 것으로 나타났다.
kknn.train2 <- train.kknn(credit_risk ~., data = scaled_train_dummy, kmax = 30, distance = 1,
kernel = c('rectangular', 'triangular', 'epanechnikov', 'cosine', 'gaussian', 'optimal'))
plot(kknn.train2)
kknn.train2
##
## Call:
## train.kknn(formula = credit_risk ~ ., data = scaled_train_dummy, kmax = 30, distance = 1, kernel = c("rectangular", "triangular", "epanechnikov", "cosine", "gaussian", "optimal"))
##
## Type of response variable: nominal
## Minimal misclassification: 0.2642857
## Best kernel: rectangular
## Best k: 12
오류율이 더 떨어졌다. 최종 knn 하이퍼파라미터로 유클리드 거리를 사용하고, k가 12, rectangular커널을 사용하자.
kknn.fit <- predict(kknn.train2, newdata = scaled_test_dummy)
kknn.fit <- as.factor(kknn.fit)
kknn.confusion <- confusionMatrix(kknn.fit, scaled_test_dummy$credit_risk, positive = 'good')
kknn.confusion
## Confusion Matrix and Statistics
##
## Reference
## Prediction bad good
## bad 24 12
## good 66 198
##
## Accuracy : 0.74
## 95% CI : (0.6865, 0.7887)
## No Information Rate : 0.7
## P-Value [Acc > NIR] : 0.07228
##
## Kappa : 0.2529
##
## Mcnemar's Test P-Value : 1.96e-09
##
## Sensitivity : 0.9429
## Specificity : 0.2667
## Pos Pred Value : 0.7500
## Neg Pred Value : 0.6667
## Prevalence : 0.7000
## Detection Rate : 0.6600
## Detection Prevalence : 0.8800
## Balanced Accuracy : 0.6048
##
## 'Positive' Class : good
##
kknn.confusion$byClass
## Sensitivity Specificity Pos Pred Value
## 0.9428571 0.2666667 0.7500000
## Neg Pred Value Precision Recall
## 0.6666667 0.7500000 0.9428571
## F1 Prevalence Detection Rate
## 0.8354430 0.7000000 0.6600000
## Detection Prevalence Balanced Accuracy
## 0.8800000 0.6047619
Accuracy : 0.74, Kappa : 0.253, F1-score : 0.835의 성능이 나왔다.
plot(perf.earth, col = 1)
knn.pred <- predict(kknn.train2, newdata = scaled_test_dummy, type = 'prob')
head(knn.pred)
## bad good
## [1,] 0.08333333 0.9166667
## [2,] 0.00000000 1.0000000
## [3,] 0.41666667 0.5833333
## [4,] 0.16666667 0.8333333
## [5,] 0.25000000 0.7500000
## [6,] 0.58333333 0.4166667
knn.pred.good <- knn.pred[, 2]
pred.knn <- prediction(knn.pred.good, scaled_test_dummy$credit_risk)
perf.knn <- performance(pred.knn, 'tpr', 'fpr')
plot(perf.knn, col = 2, add = TRUE)
legend(0.6, 0.6, c('MARS', 'KNN'), 1:2)
performance(pred.earth, 'auc')@y.values
## [[1]]
## [1] 0.7833333
performance(pred.knn, 'auc')@y.values
## [[1]]
## [1] 0.7093122
AUC를 모델 성능 기준으로 삼는다면 MARS모형보다 KNN모형의 성능이 더 떨어지는 것으로 나왔다.
set.seed(321)
linear.tune <- tune.svm(credit_risk ~., data = scaled_train, kernel = 'linear',
cost = c(0.001, 0.01, 0.1, 0.5, 1, 2, 5, 10))
summary(linear.tune)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 0.5
##
## - best performance: 0.2528571
##
## - Detailed performance results:
## cost error dispersion
## 1 1e-03 0.3000000 0.03809524
## 2 1e-02 0.3000000 0.03809524
## 3 1e-01 0.2628571 0.06030911
## 4 5e-01 0.2528571 0.05872801
## 5 1e+00 0.2628571 0.07535254
## 6 2e+00 0.2628571 0.06774636
## 7 5e+00 0.2628571 0.06324555
## 8 1e+01 0.2628571 0.05642405
이 문제에서 최적의 cost함수는 0.5로 나왔고, 분류 오류 비율은 대략 25.28%이다.
best.linear <- linear.tune$best.model
tune.fit <- predict(best.linear, newdata = scaled_test, type = 'class')
tune.fit <- as.factor(tune.fit)
linear.svm.confusion <- confusionMatrix(tune.fit, scaled_test$credit_risk, positive = 'good')
linear.svm.confusion
## Confusion Matrix and Statistics
##
## Reference
## Prediction bad good
## bad 36 19
## good 54 191
##
## Accuracy : 0.7567
## 95% CI : (0.704, 0.8041)
## No Information Rate : 0.7
## P-Value [Acc > NIR] : 0.01741
##
## Kappa : 0.3482
##
## Mcnemar's Test P-Value : 6.909e-05
##
## Sensitivity : 0.9095
## Specificity : 0.4000
## Pos Pred Value : 0.7796
## Neg Pred Value : 0.6545
## Prevalence : 0.7000
## Detection Rate : 0.6367
## Detection Prevalence : 0.8167
## Balanced Accuracy : 0.6548
##
## 'Positive' Class : good
##
linear.svm.confusion$byClass
## Sensitivity Specificity Pos Pred Value
## 0.9095238 0.4000000 0.7795918
## Neg Pred Value Precision Recall
## 0.6545455 0.7795918 0.9095238
## F1 Prevalence Detection Rate
## 0.8395604 0.7000000 0.6366667
## Detection Prevalence Balanced Accuracy
## 0.8166667 0.6547619
Accuracy : 0.757, Kappa : 0.348, F1-score : 0.840의 성능이 나왔다.
set.seed(321)
poly.tune <- tune.svm(credit_risk ~., data = scaled_train,
kernel = 'polynomial',
degree = c(2, 3, 4, 5),
coef0 = c(0.1, 0.5, 1, 2, 3, 4))
summary(poly.tune)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## degree coef0
## 4 3
##
## - best performance: 0.2457143
##
## - Detailed performance results:
## degree coef0 error dispersion
## 1 2 0.1 0.3028571 0.03676240
## 2 3 0.1 0.3000000 0.03809524
## 3 4 0.1 0.3000000 0.03809524
## 4 5 0.1 0.3000000 0.03809524
## 5 2 0.5 0.2914286 0.04216370
## 6 3 0.5 0.2842857 0.04877176
## 7 4 0.5 0.2842857 0.04439056
## 8 5 0.5 0.2885714 0.03353679
## 9 2 1.0 0.2785714 0.06181312
## 10 3 1.0 0.2700000 0.06002645
## 11 4 1.0 0.2614286 0.04668124
## 12 5 1.0 0.2557143 0.06040303
## 13 2 2.0 0.2585714 0.05888226
## 14 3 2.0 0.2528571 0.06062786
## 15 4 2.0 0.2528571 0.06354958
## 16 5 2.0 0.2485714 0.04577377
## 17 2 3.0 0.2585714 0.05771539
## 18 3 3.0 0.2557143 0.07077479
## 19 4 3.0 0.2457143 0.05379056
## 20 5 3.0 0.2871429 0.05489632
## 21 2 4.0 0.2600000 0.05294073
## 22 3 4.0 0.2528571 0.07378648
## 23 4 4.0 0.2514286 0.05561448
## 24 5 4.0 0.2885714 0.04507489
이 모형은 다항식의 차수 degree의 값으로 4, 커널계수는 3을 선택했다.
best.poly <- poly.tune$best.model
poly.test <- predict(best.poly, newdata = scaled_test, type = 'class')
poly.test <- as.factor(poly.test)
poly.svm.confusion <- confusionMatrix(poly.test, scaled_test$credit_risk, positive = 'good')
poly.svm.confusion
## Confusion Matrix and Statistics
##
## Reference
## Prediction bad good
## bad 37 32
## good 53 178
##
## Accuracy : 0.7167
## 95% CI : (0.662, 0.767)
## No Information Rate : 0.7
## P-Value [Acc > NIR] : 0.28736
##
## Kappa : 0.2772
##
## Mcnemar's Test P-Value : 0.03006
##
## Sensitivity : 0.8476
## Specificity : 0.4111
## Pos Pred Value : 0.7706
## Neg Pred Value : 0.5362
## Prevalence : 0.7000
## Detection Rate : 0.5933
## Detection Prevalence : 0.7700
## Balanced Accuracy : 0.6294
##
## 'Positive' Class : good
##
poly.svm.confusion$byClass
## Sensitivity Specificity Pos Pred Value
## 0.8476190 0.4111111 0.7705628
## Neg Pred Value Precision Recall
## 0.5362319 0.7705628 0.8476190
## F1 Prevalence Detection Rate
## 0.8072562 0.7000000 0.5933333
## Detection Prevalence Balanced Accuracy
## 0.7700000 0.6293651
Accuracy : 0.717, Kappa : 0.277, F1-score : 0.807의 성능이 나왔다. 선형 SVM모델보다 성능이 떨어진다.
set.seed(321)
rbf.tune <- tune.svm(credit_risk ~., data = scaled_train,
kernel = 'radial',
gamma = c(0.01, 0.05, 0.1, 0.5, 1, 2, 3, 4))
summary(rbf.tune)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## gamma
## 0.1
##
## - best performance: 0.2528571
##
## - Detailed performance results:
## gamma error dispersion
## 1 0.01 0.3000000 0.03809524
## 2 0.05 0.2685714 0.05163978
## 3 0.10 0.2528571 0.04366955
## 4 0.50 0.2928571 0.04377328
## 5 1.00 0.3014286 0.03592004
## 6 2.00 0.3014286 0.03592004
## 7 3.00 0.3014286 0.03592004
## 8 4.00 0.3014286 0.03592004
최적의 gamma값은 0.1이다.
best.rbf <- rbf.tune$best.model
rbf.test <- predict(best.rbf, newdata = scaled_test, type = 'class')
rbf.test <- as.factor(rbf.test)
rbf.svm.confusion <- confusionMatrix(rbf.test, scaled_test$credit_risk, positive = 'good')
rbf.svm.confusion
## Confusion Matrix and Statistics
##
## Reference
## Prediction bad good
## bad 33 13
## good 57 197
##
## Accuracy : 0.7667
## 95% CI : (0.7146, 0.8134)
## No Information Rate : 0.7
## P-Value [Acc > NIR] : 0.006111
##
## Kappa : 0.3542
##
## Mcnemar's Test P-Value : 2.755e-07
##
## Sensitivity : 0.9381
## Specificity : 0.3667
## Pos Pred Value : 0.7756
## Neg Pred Value : 0.7174
## Prevalence : 0.7000
## Detection Rate : 0.6567
## Detection Prevalence : 0.8467
## Balanced Accuracy : 0.6524
##
## 'Positive' Class : good
##
rbf.svm.confusion$byClass
## Sensitivity Specificity Pos Pred Value
## 0.9380952 0.3666667 0.7755906
## Neg Pred Value Precision Recall
## 0.7173913 0.7755906 0.9380952
## F1 Prevalence Detection Rate
## 0.8491379 0.7000000 0.6566667
## Detection Prevalence Balanced Accuracy
## 0.8466667 0.6523810
Accuracy : 0.767, Kappa : 0.354, F1-score : 0.848의 성능이 나왔다. F1-score를 성능 평가의 기준으로 본다면 선형 SVM모델보다 성능이 좋다.
set.seed(321)
sigmoid.tune <- tune.svm(credit_risk ~., data = scaled_train,
kernel = 'sigmoid',
gamma = c(0.1, 0.5, 1, 2, 3, 4),
coef0 = c(0.1, 0.5, 1, 2, 3, 4, 5))
summary(sigmoid.tune)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## gamma coef0
## 0.5 4
##
## - best performance: 0.2957143
##
## - Detailed performance results:
## gamma coef0 error dispersion
## 1 0.1 0.1 0.3014286 0.05279061
## 2 0.5 0.1 0.3957143 0.05872801
## 3 1.0 0.1 0.4100000 0.03566663
## 4 2.0 0.1 0.3814286 0.06496118
## 5 3.0 0.1 0.3885714 0.05824337
## 6 4.0 0.1 0.3942857 0.06142673
## 7 0.1 0.5 0.3157143 0.05489632
## 8 0.5 0.5 0.3814286 0.03871519
## 9 1.0 0.5 0.3985714 0.05652443
## 10 2.0 0.5 0.3985714 0.07013107
## 11 3.0 0.5 0.4042857 0.07316927
## 12 4.0 0.5 0.3957143 0.06565560
## 13 0.1 1.0 0.3028571 0.04893422
## 14 0.5 1.0 0.4071429 0.04479735
## 15 1.0 1.0 0.3928571 0.06217888
## 16 2.0 1.0 0.3957143 0.06869367
## 17 3.0 1.0 0.3871429 0.06681955
## 18 4.0 1.0 0.3957143 0.06634275
## 19 0.1 2.0 0.3000000 0.03809524
## 20 0.5 2.0 0.3442857 0.03045386
## 21 1.0 2.0 0.3657143 0.05722217
## 22 2.0 2.0 0.3842857 0.05489632
## 23 3.0 2.0 0.3900000 0.06319175
## 24 4.0 2.0 0.3928571 0.06397633
## 25 0.1 3.0 0.3000000 0.03809524
## 26 0.5 3.0 0.3042857 0.04261838
## 27 1.0 3.0 0.3628571 0.05479295
## 28 2.0 3.0 0.3857143 0.05387480
## 29 3.0 3.0 0.3842857 0.05732115
## 30 4.0 3.0 0.3857143 0.07063046
## 31 0.1 4.0 0.3000000 0.03809524
## 32 0.5 4.0 0.2957143 0.04261838
## 33 1.0 4.0 0.3371429 0.04675405
## 34 2.0 4.0 0.3728571 0.05448168
## 35 3.0 4.0 0.3914286 0.05395892
## 36 4.0 4.0 0.3871429 0.05771539
## 37 0.1 5.0 0.3000000 0.03809524
## 38 0.5 5.0 0.3000000 0.03809524
## 39 1.0 5.0 0.3114286 0.04507489
## 40 2.0 5.0 0.3657143 0.05093235
## 41 3.0 5.0 0.3757143 0.05261851
## 42 4.0 5.0 0.3942857 0.05437753
최적의 gamma값은 0.5이고, coef0값은 4이다.
best.sigmoid <- sigmoid.tune$best.model
sigmoid.test <- predict(best.sigmoid, newdata = scaled_test, type = 'class')
sigmoid.test <- as.factor(sigmoid.test)
sigmoid.svm.confusion <- confusionMatrix(sigmoid.test, scaled_test$credit_risk, positive = 'good')
sigmoid.svm.confusion
## Confusion Matrix and Statistics
##
## Reference
## Prediction bad good
## bad 7 11
## good 83 199
##
## Accuracy : 0.6867
## 95% CI : (0.6309, 0.7387)
## No Information Rate : 0.7
## P-Value [Acc > NIR] : 0.7165
##
## Kappa : 0.0329
##
## Mcnemar's Test P-Value : 2.423e-13
##
## Sensitivity : 0.94762
## Specificity : 0.07778
## Pos Pred Value : 0.70567
## Neg Pred Value : 0.38889
## Prevalence : 0.70000
## Detection Rate : 0.66333
## Detection Prevalence : 0.94000
## Balanced Accuracy : 0.51270
##
## 'Positive' Class : good
##
sigmoid.svm.confusion$byClass
## Sensitivity Specificity Pos Pred Value
## 0.94761905 0.07777778 0.70567376
## Neg Pred Value Precision Recall
## 0.38888889 0.70567376 0.94761905
## F1 Prevalence Detection Rate
## 0.80894309 0.70000000 0.66333333
## Detection Prevalence Balanced Accuracy
## 0.94000000 0.51269841
Accuracy : 0.687, Kappa : 0.0329, F1-score : 0.809의 성능이 나왔다. SVM모델 중 sigmoid커널 함수를 사용한 모델이 가장 성능이 나쁘게 나왔다.
plot(perf.earth, col = 1)
plot(perf.knn, col = 2, add = TRUE)
set.seed(321)
rbf.model <- svm(credit_risk ~., data = scaled_train, kernel = 'radial', gamma = 0.1, scale = FALSE, probability = TRUE)
pred_svm <- predict(rbf.model, scaled_test, probability = TRUE)
prob <- attr(pred_svm, 'probabilities')
head(prob)
## good bad
## 2 0.7481733 0.2518267
## 4 0.6750461 0.3249539
## 10 0.5945038 0.4054962
## 15 0.6479892 0.3520108
## 22 0.7126924 0.2873076
## 23 0.2592553 0.7407447
pred.svm <- prediction(prob[, 1], scaled_test$credit_risk)
perf.svm <- performance(pred.svm, 'tpr', 'fpr')
plot(perf.svm, col = 3, add = TRUE)
legend(0.6, 0.6, c('MARS', 'KNN', 'SVM'), 1:3)
performance(pred.earth, 'auc')@y.values
## [[1]]
## [1] 0.7833333
performance(pred.knn, 'auc')@y.values
## [[1]]
## [1] 0.7093122
performance(pred.svm, 'auc')@y.values
## [[1]]
## [1] 0.7931217
AUC를 성능 평가의 기준으로 본다면 방사 기저 함수를 사용한 SVM모델이 가장 좋은 성능을 보이고 있다(다만 MARS모형과 SVM모형의 AUC가 거의 비슷하다).
library(rpart)
library(partykit)
## Loading required package: grid
## Loading required package: libcoin
## Loading required package: mvtnorm
library(genridge)
## Loading required package: car
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:psych':
##
## logit
##
## Attaching package: 'genridge'
## The following object is masked from 'package:caret':
##
## precision
library(randomForest)
## randomForest 4.7-1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:psych':
##
## outlier
library(xgboost)
##
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
##
## slice
set.seed(123)
rf.model <- randomForest(credit_risk ~., data = train)
rf.model
##
## Call:
## randomForest(formula = credit_risk ~ ., data = train)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 24.57%
## Confusion matrix:
## bad good class.error
## bad 76 134 0.63809524
## good 38 452 0.07755102
수행 결과 OOB오차율이 24.57%로 나왔다.
plot(rf.model)
which.min(rf.model$err.rate[, 1])
## [1] 440
모형 정확도를 최적화 하기에 필요한 트리수가 440개면 된다는 결과를 얻었다.
set.seed(123)
rf.model2 <- randomForest(credit_risk ~., data = train, ntree = 440)
rf.model2
##
## Call:
## randomForest(formula = credit_risk ~ ., data = train, ntree = 440)
## Type of random forest: classification
## Number of trees: 440
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 23.43%
## Confusion matrix:
## bad good class.error
## bad 80 130 0.61904762
## good 34 456 0.06938776
OOB오차율이 24.57%에서 23.43%로 약간 떨어졌다.
rf.fit <- predict(rf.model2, newdata = test, type = 'class')
rf.fit <- as.factor(rf.fit)
rf.confusion <- confusionMatrix(rf.fit, test$credit_risk, positive = 'good')
rf.confusion
## Confusion Matrix and Statistics
##
## Reference
## Prediction bad good
## bad 28 13
## good 62 197
##
## Accuracy : 0.75
## 95% CI : (0.697, 0.798)
## No Information Rate : 0.7
## P-Value [Acc > NIR] : 0.03224
##
## Kappa : 0.2951
##
## Mcnemar's Test P-Value : 2.981e-08
##
## Sensitivity : 0.9381
## Specificity : 0.3111
## Pos Pred Value : 0.7606
## Neg Pred Value : 0.6829
## Prevalence : 0.7000
## Detection Rate : 0.6567
## Detection Prevalence : 0.8633
## Balanced Accuracy : 0.6246
##
## 'Positive' Class : good
##
rf.confusion$byClass
## Sensitivity Specificity Pos Pred Value
## 0.9380952 0.3111111 0.7606178
## Neg Pred Value Precision Recall
## 0.6829268 0.7606178 0.9380952
## F1 Prevalence Detection Rate
## 0.8400853 0.7000000 0.6566667
## Detection Prevalence Balanced Accuracy
## 0.8633333 0.6246032
Accuracy : 0.750, Kappa : 0.295, F1-score : 0.840의 성능이 나왔다.
rf_mtry <- seq(4, ncol(train)*0.8, by = 2)
rf_nodesize <- seq(3, 8, by = 1)
rf_sample_size <- nrow(train) * c(0.7, 0.8)
rf_hyper_grid <- expand.grid(mtry = rf_mtry,
nodesize = rf_nodesize,
samplesize = rf_sample_size)
rf_oob_err <- c()
for(i in 1:nrow(rf_hyper_grid)) {
model <- randomForest(credit_risk ~., data = train,
mtry = rf_hyper_grid$mtry[i],
nodesize = rf_hyper_grid$nodesize[i],
samplesize = rf_hyper_grid$samplesize[i])
rf_oob_err[i] <- model$err.rate[nrow(model$err.rate), 'OOB']
}
rf_hyper_grid[which.min(rf_oob_err), ]
## mtry nodesize samplesize
## 17 6 6 490
rf_best_model <- randomForest(credit_risk ~., data = train,
mtry = 6,
nodesize = 6,
samplesize = 490,
proximity = TRUE,
importance = TRUE)
varImpPlot(rf_best_model, type = 1)
varImpPlot(rf_best_model, type = 2)
첫번째 plot을 보자. 변수의 값을 랜덤하게 섞었다면, 모델의 정확도가 감소하는 정도를 측정한다.(type = 1)
변수를 랜덤하게 섞는다는 것은 해당 변수가 예측에 미치는 모든 영향력을 제거하는 것을 의미한다. 정확도는 OOB데이터로부터 얻는다. 이는 결국 교차 타당성과 같은 효과를 얻는다.
두번째 plot을 보자. 특정 변수를 기준으로 분할이 일어난 모든 노드에서 불순도 점수의 평균 감소량을 측정한다.(type = 2)
이 지표는 해당 변수가 노드의 불순도를 개선하는데 얼마나 기여했는지를 나타낸다.
그러나 이 지표는 학습데이터를 기반으로 측정되기 때문에 OOB데이터를 가지고 계산한 것에 비해 믿을만하지 못하다.
우리의 plot에서 첫번째와 두번째 plot의 변수 중요도 순서가 다소 다른 것을 볼 수 있는데, OOB데이터로부터 정확도 감소량을 측정한 첫번째 plot이 더 믿을만하다.
rf.predict <- predict(rf_best_model, newdata = test, type = 'class')
rf.predict <- as.factor(rf.predict)
rf.best.confusion <- confusionMatrix(rf.predict, test$credit_risk, positive = 'good')
rf.best.confusion
## Confusion Matrix and Statistics
##
## Reference
## Prediction bad good
## bad 32 19
## good 58 191
##
## Accuracy : 0.7433
## 95% CI : (0.69, 0.7918)
## No Information Rate : 0.7
## P-Value [Acc > NIR] : 0.05608
##
## Kappa : 0.3025
##
## Mcnemar's Test P-Value : 1.488e-05
##
## Sensitivity : 0.9095
## Specificity : 0.3556
## Pos Pred Value : 0.7671
## Neg Pred Value : 0.6275
## Prevalence : 0.7000
## Detection Rate : 0.6367
## Detection Prevalence : 0.8300
## Balanced Accuracy : 0.6325
##
## 'Positive' Class : good
##
rf.best.confusion$byClass
## Sensitivity Specificity Pos Pred Value
## 0.9095238 0.3555556 0.7670683
## Neg Pred Value Precision Recall
## 0.6274510 0.7670683 0.9095238
## F1 Prevalence Detection Rate
## 0.8322440 0.7000000 0.6366667
## Detection Prevalence Balanced Accuracy
## 0.8300000 0.6325397
Accuracy : 0.743, Kappa : 0.303, F1-score : 0.832의 성능이 나왔다. F1-score를 성능 평가의 기준으로 본다면 기본 랜덤 포레스트 모델보다 성능이 떨어졌다.
rf_mtry <- seq(1, 12, by = 2)
rf_nodesize <- seq(2, 8, by = 1)
rf_sample_size <- seq(350, 700, by = 25)
rf_hyper_grid <- expand.grid(mtry = rf_mtry,
nodesize = rf_nodesize,
samplesize = rf_sample_size)
rf_oob_err <- c()
for(i in 1:nrow(rf_hyper_grid)) {
model2 <- randomForest(credit_risk ~., data = scaled_train,
mtry = rf_hyper_grid$mtry[i],
nodesize = rf_hyper_grid$nodesize[i],
samplesize = rf_hyper_grid$samplesize[i])
rf_oob_err[i] <- model2$err.rate[nrow(model2$err.rate), 'OOB']
}
rf_hyper_grid[which.min(rf_oob_err), ]
## mtry nodesize samplesize
## 176 3 3 450
rf_best_model2 <- randomForest(credit_risk ~., data = scaled_train,
mtry = 3,
nodesize = 6,
samplesize = 525,
proximity = TRUE,
importance = TRUE)
rf.predict2 <- predict(rf_best_model2, newdata = scaled_test, type = 'class')
rf.predict2 <- as.factor(rf.predict2)
rf.best.confusion2 <- confusionMatrix(rf.predict2, scaled_test$credit_risk, positive = 'good')
rf.best.confusion2
## Confusion Matrix and Statistics
##
## Reference
## Prediction bad good
## bad 29 17
## good 61 193
##
## Accuracy : 0.74
## 95% CI : (0.6865, 0.7887)
## No Information Rate : 0.7
## P-Value [Acc > NIR] : 0.07228
##
## Kappa : 0.2804
##
## Mcnemar's Test P-Value : 1.123e-06
##
## Sensitivity : 0.9190
## Specificity : 0.3222
## Pos Pred Value : 0.7598
## Neg Pred Value : 0.6304
## Prevalence : 0.7000
## Detection Rate : 0.6433
## Detection Prevalence : 0.8467
## Balanced Accuracy : 0.6206
##
## 'Positive' Class : good
##
rf.best.confusion2$byClass
## Sensitivity Specificity Pos Pred Value
## 0.9190476 0.3222222 0.7598425
## Neg Pred Value Precision Recall
## 0.6304348 0.7598425 0.9190476
## F1 Prevalence Detection Rate
## 0.8318966 0.7000000 0.6433333
## Detection Prevalence Balanced Accuracy
## 0.8466667 0.6206349
Accuracy : 0.740, Kappa : 0.280, F1-score : 0.832의 성능이 나왔다. 성능이 더 좋아지지 않았다.
plot(perf.earth, col = 1)
plot(perf.knn, col = 2, add = TRUE)
plot(perf.svm, col = 3, add = TRUE)
rf.pred <- predict(rf.model2, newdata = test, type = 'prob')
head(rf.pred)
## bad good
## 2 0.09772727 0.9022727
## 4 0.22045455 0.7795455
## 10 0.31818182 0.6818182
## 15 0.26363636 0.7363636
## 22 0.32272727 0.6772727
## 23 0.57045455 0.4295455
rf.pred.good <- rf.pred[, 2]
pred.rf <- prediction(rf.pred.good, test$credit_risk)
perf.rf <- performance(pred.rf, 'tpr', 'fpr')
plot(perf.rf, col = 4, add = TRUE)
legend(0.6, 0.6, c('MARS', 'KNN', 'SVM', 'RF'), 1:4)
performance(pred.earth, 'auc')@y.values
## [[1]]
## [1] 0.7833333
performance(pred.knn, 'auc')@y.values
## [[1]]
## [1] 0.7093122
performance(pred.svm, 'auc')@y.values
## [[1]]
## [1] 0.7931217
performance(pred.rf, 'auc')@y.values
## [[1]]
## [1] 0.7965608
AUC를 성능 평가 지표의 기준으로 삼는다면 train, test데이터를 사용하고 하이퍼파라미터를 설정하지 않은 랜덤포레스트 모형이 가장 좋은 성능을 보인다.
grid <- expand.grid(nrounds = seq(50, 200, by = 25),
colsample_bytree = 1,
min_child_weight = 1,
eta = c(0.01, 0.1, 0.3),
gamma = c(0.5, 0.25, 0.1),
subsample = 0.5,
max_depth = c(1, 2, 3))
cntrl <- trainControl(method = 'cv',
number = 5,
verboseIter = TRUE,
returnData = FALSE,
returnResamp = 'final')
set.seed(123)
train.xgb <- train(x = scaled_train_dummy[, 1:56],
y = scaled_train_dummy[, 57],
trControl = cntrl,
tuneGrid = grid,
method = 'xgbTree')
## + Fold1: eta=0.01, max_depth=1, gamma=0.10, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=200 =1, min_child_weight=1, subsample=0.5, nrounds=200
...(생략)
## - Fold5: eta=0.30, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=200
## Aggregating results
## Selecting tuning parameters
## Fitting nrounds = 125, max_depth = 2, eta = 0.1, gamma = 0.1, colsample_bytree = 1, min_child_weight = 1, subsample = 0.5 on full training set
param <- list(objective = 'binary:logistic',
booster = 'gbtree',
eval_metric = 'error',
eta = 0.1,
max_depth = 2,
subsample = 0.5,
colsample_bytree = 1,
gamma = 0.1)
x <- as.matrix(scaled_train_dummy[, 1:56])
y <- ifelse(scaled_train_dummy$credit_risk == 'good', 1, 0)
train.mat <- xgb.DMatrix(data = x, label = y)
set.seed(123)
xgb.fit <- xgb.train(params = param, data = train.mat, nrounds = 125)
impMatrix <- xgb.importance(feature_names = dimnames(x)[[2]], model = xgb.fit)
impMatrix
## Feature Gain
## 1: amount 0.157398006
## 2: age 0.111427759
## 3: duration 0.104828102
## 4: status.... >= 200 DM / salary for at least 1 year 0.098779805
## 5: status.no checking account 0.045890670
## 6: credit_history.all credits at this bank paid back duly 0.043147072
## 7: purpose.others 0.041527741
## 8: employment_duration.4 <= ... < 7 yrs 0.027726437
## 9: purpose.car (new) 0.025369274
## 10: housing.rent 0.024689123
## 11: installment_rate.L 0.024615982
## 12: property.unknown / no property 0.024431661
## 13: credit_history.critical account/other credits elsewhere 0.022071490
## 14: savings.unknown/no savings account 0.018883356
## 15: savings.... >= 1000 DM 0.017799799
## 16: credit_history.delay in paying off in the past 0.014739744
## 17: purpose.repairs 0.012970326
## 18: other_debtors.guarantor 0.012083905
## 19: credit_history.existing credits paid back duly till now 0.011969521
## 20: housing.for free 0.009955575
## 21: property.real estate 0.009706637
## 22: employment_duration.1 <= ... < 4 yrs 0.009638779
## 23: job.skilled employee/official 0.009508778
## 24: other_installment_plans.none 0.009203559
## 25: installment_rate.C 0.008579032
## 26: employment_duration.< 1 yr 0.008388248
## 27: number_credits.C 0.008293946
## 28: job.unskilled - resident 0.006975516
## 29: status.... < 0 DM 0.006723607
## 30: status.0<= ... < 200 DM 0.006703870
## 31: other_debtors.co-applicant 0.005636576
## 32: credit_history.no credits taken/all credits paid back duly 0.005447164
## 33: employment_duration.>= 7 yrs 0.005332335
## 34: installment_rate.Q 0.005039385
## 35: housing.own 0.005016942
## 36: purpose.furniture/equipment 0.004801123
## 37: savings.... < 100 DM 0.004398813
## 38: other_debtors.none 0.003846762
## 39: purpose.car (used) 0.003631957
## 40: savings.100 <= ... < 500 DM 0.003357928
## 41: job.manager/self-empl./highly qualif. employee 0.003339993
## 42: property.car or other 0.003102056
## 43: other_installment_plans.stores 0.002855790
## 44: purpose.retraining 0.002484107
## 45: other_installment_plans.bank 0.002402886
## 46: employment_duration.unemployed 0.002156045
## 47: number_credits.L 0.001982209
## 48: property.building soc. savings agr./life insurance 0.001140608
## Feature Gain
## Cover Frequency
## 1: 0.1736933229 0.184065934
## 2: 0.1056685043 0.123626374
## 3: 0.1090901321 0.096153846
## 4: 0.0577161464 0.041208791
## 5: 0.0305699703 0.021978022
## 6: 0.0393700708 0.032967033
## 7: 0.0300374259 0.032967033
## 8: 0.0287408217 0.027472527
## 9: 0.0375002598 0.030219780
## 10: 0.0194150689 0.024725275
## 11: 0.0172320423 0.027472527
## 12: 0.0225202902 0.021978022
## 13: 0.0270201868 0.021978022
## 14: 0.0212742631 0.021978022
## 15: 0.0217733109 0.019230769
## 16: 0.0202348286 0.016483516
## 17: 0.0129862878 0.010989011
## 18: 0.0169555023 0.013736264
## 19: 0.0116250093 0.013736264
## 20: 0.0135117563 0.010989011
## 21: 0.0082403339 0.008241758
## 22: 0.0047013343 0.010989011
## 23: 0.0092542970 0.010989011
## 24: 0.0118169056 0.008241758
## 25: 0.0120544813 0.016483516
## 26: 0.0088895999 0.008241758
## 27: 0.0088333155 0.013736264
## 28: 0.0085558879 0.010989011
## 29: 0.0084992173 0.010989011
## 30: 0.0072959114 0.005494505
## 31: 0.0105030702 0.008241758
## 32: 0.0080729294 0.008241758
## 33: 0.0066758786 0.005494505
## 34: 0.0070290268 0.008241758
## 35: 0.0047924115 0.008241758
## 36: 0.0071471322 0.005494505
## 37: 0.0061498666 0.005494505
## 38: 0.0064975358 0.005494505
## 39: 0.0070939874 0.005494505
## 40: 0.0067544992 0.005494505
## 41: 0.0073590498 0.005494505
## 42: 0.0019204633 0.005494505
## 43: 0.0028404522 0.002747253
## 44: 0.0036840250 0.005494505
## 45: 0.0035308398 0.002747253
## 46: 0.0029040537 0.002747253
## 47: 0.0013373337 0.008241758
## 48: 0.0006309599 0.002747253
## Cover Frequency
xgb.plot.importance(impMatrix, main = 'Gain by Feature')
library(InformationValue)
##
## Attaching package: 'InformationValue'
## The following object is masked from 'package:genridge':
##
## precision
## The following objects are masked from 'package:caret':
##
## confusionMatrix, precision, sensitivity, specificity
pred <- predict(xgb.fit, x)
optimalCutoff(y, pred)
## [1] 0.5027472
testMat <- as.matrix(scaled_test_dummy[, 1:56])
y.test <- ifelse(scaled_test_dummy$credit_risk == 'good', 1, 0)
xgb.test <- predict(xgb.fit, testMat)
confusionMatrix(y.test, xgb.test, threshold = 0.5027472)
## 0 1
## 0 41 22
## 1 49 188
1 - misClassError(y.test, xgb.test, threshold = 0.5027472)
## [1] 0.7633
약 76.91%의 정확도를 보였다.
Precision <- 188 / (22 + 188)
Recall <- 188 / (49 + 188)
f1 <- (2*Precision*Recall) / (Precision + Recall)
f1
## [1] 0.8411633
F1-score는 0.841이다.
plotROC(y.test, xgb.test)
AUC는 0.791로 SVM모형과 랜덤포레스트 모형과 거의 비슷하다.
library(caretEnsemble)
##
## Attaching package: 'caretEnsemble'
## The following object is masked from 'package:ggplot2':
##
## autoplot
library(caTools)
library(mlr)
## Loading required package: ParamHelpers
## Warning message: 'mlr' is in 'maintenance-only' mode since July 2019.
## Future development will only happen in 'mlr3'
## (<https://mlr3.mlr-org.com>). Due to the focus on 'mlr3' there might be
## uncaught bugs meanwhile in {mlr} - please consider switching.
##
## Attaching package: 'mlr'
## The following object is masked _by_ '.GlobalEnv':
##
## f1
## The following object is masked from 'package:e1071':
##
## impute
## The following object is masked from 'package:ROCR':
##
## performance
## The following object is masked from 'package:caret':
##
## train
library(HDclassif)
library(gbm)
## Loaded gbm 2.1.8
library(mboost)
## Loading required package: parallel
## Loading required package: stabs
##
## Attaching package: 'stabs'
## The following object is masked from 'package:mlr':
##
## subsample
##
## Attaching package: 'mboost'
## The following object is masked from 'package:partykit':
##
## varimp
## The following object is masked from 'package:ggplot2':
##
## %+%
## The following objects are masked from 'package:psych':
##
## %+%, AUC
## The following object is masked from 'package:pastecs':
##
## extract
control <- trainControl(method = 'cv', number = 5,
savePredictions = 'final',
classProbs = TRUE,
summaryFunction = twoClassSummary)
set.seed(123)
Models <- caretList(credit_risk ~., data = scaled_train,
trControl = control,
metric = 'ROC',
methodList = c('rpart', 'earth', 'knn'))
Models
## $rpart
## CART
##
## 700 samples
## 15 predictor
## 2 classes: 'bad', 'good'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 560, 560, 560, 560, 560
## Resampling results across tuning parameters:
##
## cp ROC Sens Spec
## 0.02222222 0.7201895 0.32857143 0.9020408
## 0.02857143 0.5852527 0.11428571 0.9571429
## 0.03928571 0.5738338 0.07142857 0.9734694
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.02222222.
##
## $earth
## Multivariate Adaptive Regression Spline
##
## 700 samples
## 15 predictor
## 2 classes: 'bad', 'good'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 560, 560, 560, 560, 560
## Resampling results across tuning parameters:
##
## nprune ROC Sens Spec
## 2 0.6676871 0.0000000 1.0000000
## 18 0.7451895 0.3809524 0.8693878
## 35 0.7498542 0.3904762 0.8714286
##
## Tuning parameter 'degree' was held constant at a value of 1
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were nprune = 35 and degree = 1.
##
## $knn
## k-Nearest Neighbors
##
## 700 samples
## 15 predictor
## 2 classes: 'bad', 'good'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 560, 560, 560, 560, 560
## Resampling results across tuning parameters:
##
## k ROC Sens Spec
## 5 0.6803450 0.3761905 0.8591837
## 7 0.6905005 0.3238095 0.8551020
## 9 0.6962342 0.3047619 0.8857143
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
##
## attr(,"class")
## [1] "caretList"
model_preds <- lapply(Models, predict, newdata = scaled_test, type = 'prob')
model_preds <- lapply(model_preds, function(x) x[, 'good'])
model_preds <- data.frame(model_preds)
stack <- caretStack(Models, method = 'glm', metric = 'ROC',
trControl = trainControl(
method = 'boot', number = 5,
savePredictions = 'final',
classProbs = TRUE,
summaryFunction = twoClassSummary
))
summary(stack)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2296 -0.8923 0.5065 0.7556 1.9016
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.6837 0.2185 12.282 < 2e-16 ***
## rpart -1.7680 0.4872 -3.629 0.000284 ***
## earth -2.1098 0.4928 -4.282 1.85e-05 ***
## knn -1.6765 0.5391 -3.110 0.001873 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 855.21 on 699 degrees of freedom
## Residual deviance: 722.61 on 696 degrees of freedom
## AIC: 730.61
##
## Number of Fisher Scoring iterations: 4
모형의 계수가 모두 유의하다.
prob <- 1 - predict(stack, newdata = scaled_test, type = 'prob')
model_preds$ensemble <- prob
colAUC(model_preds, scaled_test$credit_risk)
## rpart earth knn ensemble
## bad vs. good 0.7144709 0.7701058 0.6968519 0.7653968
각 모형의 AUC를 봤을 때 앙상블 모형이 오히려 단독 MARS모형보다 성능이 떨어진다.
class <- predict(stack, newdata = scaled_test, type = 'raw')
head(class)
## [1] good good good good good bad
## Levels: bad good
detach("package:InformationValue", unload = TRUE)
ensemble.confusion <- confusionMatrix(class, scaled_test$credit_risk, positive = 'good')
ensemble.confusion
## Confusion Matrix and Statistics
##
## Reference
## Prediction bad good
## bad 32 23
## good 58 187
##
## Accuracy : 0.73
## 95% CI : (0.676, 0.7794)
## No Information Rate : 0.7
## P-Value [Acc > NIR] : 0.1417892
##
## Kappa : 0.2768
##
## Mcnemar's Test P-Value : 0.0001582
##
## Sensitivity : 0.8905
## Specificity : 0.3556
## Pos Pred Value : 0.7633
## Neg Pred Value : 0.5818
## Prevalence : 0.7000
## Detection Rate : 0.6233
## Detection Prevalence : 0.8167
## Balanced Accuracy : 0.6230
##
## 'Positive' Class : good
##
ensemble.confusion$byClass
## Sensitivity Specificity Pos Pred Value
## 0.8904762 0.3555556 0.7632653
## Neg Pred Value Precision Recall
## 0.5818182 0.7632653 0.8904762
## F1 Prevalence Detection Rate
## 0.8219780 0.7000000 0.6233333
## Detection Prevalence Balanced Accuracy
## 0.8166667 0.6230159
앙상블 모형의 성능은 AUC : 0.765, Accuracy : 0.730, Kappa : 0.277, F1-score : 0.822이다.
- | MARS | KNN | SVM | RandomForest | XGBoost | Ensemble |
---|---|---|---|---|---|---|
Accuracy | 0.740 | 0.740 | 0.767 | 0.750 | 0.769 | 0.730 |
F1-score | 0.824 | 0.835 | 0.848 | 0.840 | 0.841 | 0.822 |
Kappa | 0.330 | 0.253 | 0.354 | 0.295 | - | 0.277 |
AUC | 0.783 | 0.709 | 0.793 | 0.797 | 0.791 | 0.765 |
6개 모형 중 어떤 모형을 선택해야 할까?
우선 Accuracy는 전체 데이터에서 올바르게 분류한 데이터의 비율인데 단순히 이 지표만 보고 모형을 선택해서는 안된다. 정확도의 역설이라고 불리는 bias에 관한 단점을 가지기 때문이다.
이런 단점을 보완한 평가 지표가 recall과 precision인데 이 둘은 서로 반비례하는 경향이 있으므로 둘의 조화평균을 사용하는 F1-score를 보자.
F1-score를 기준으로 방사 기저 함수 커널을 사용한 SVM모형이 가장 좋은 성능을 보이고 있다.
SVM모형은 데이터가 크면 훈련 속도가 급격히 느려진다는 단점이 있지만 우리의 경우 데이터가 그렇게 크지 않으므로 훈련에 긴 시간이 걸리지 않았다.
또한 AUC를 성능 평가의 기준으로 삼아도 SVM이 랜덤포레스트 모형과 거의 비슷하게 좋은 성능을 보이고 있으므로 최종 모형으로 SVM를 선택하겠다.
caret, pastecs, psych, PerformanceAnalytics, Fselector, GGally, gridExtra, gapminder, dplyr, class, kknn, e1071, MASS, reshape2, ggplot2, kernlab, corrplot, rpart, partykit, genridge, randomForest, xgboost, Boruta, caretEnsemble, caTools, mlr, HDclassif, earth, Formula, plotmo, plotrix, ROCR, InformationValue, gbm, mboost