Data Mining Theory And Reality Example7
데이터마이닝의 이론과 실제 기말고사 연습문제6
SeongJae-Yoo
2022 6 8
1.초기 모델(Initial Model)
유니버설 은행 사례
tidyverse: tidy형식 기본도구
ggplot2, purrr, tibble 3.0.3,
dplyr, tidyr, stringr, readr, forcats
등 8개 패키지를 한 곳에 묶은 형태
tidymodels:tidy 모델 분석도구(데이터마이닝, 통계)
broom, recipes, dials, rsample, infer,
tune, modeldata, workflows, parsnip,
yardstick
tidytext: 텍스트마이닝 도구
rlang, tibble, dplyr, stringr, hunspell,
generics,lifecycle, Matrix, tokenizers,
janeaustenr, purrr
install.packages(“tidyverse”)
install.packages(“tidymodels”)
install.packages(“tidytext”)
install.packages(“skimr”)
install.packages(“vip”)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.8
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tidymodels)
## -- Attaching packages -------------------------------------- tidymodels 0.2.0 --
## v broom 0.8.0 v rsample 0.1.1
## v dials 0.1.1 v tune 0.2.0
## v infer 1.0.0 v workflows 0.2.6
## v modeldata 0.1.1 v workflowsets 0.2.1
## v parsnip 0.2.1 v yardstick 0.0.9
## v recipes 0.2.0
## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x scales::discard() masks purrr::discard()
## x dplyr::filter() masks stats::filter()
## x recipes::fixed() masks stringr::fixed()
## x dplyr::lag() masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step() masks stats::step()
## * Search for functions across packages at https://www.tidymodels.org/find/
library(tidytext)
library(skimr) # 데이터 요약(EDA)
library(vip) # 중요한 변수 찾기
##
## 다음의 패키지를 부착합니다: 'vip'
## The following object is masked from 'package:utils':
##
## vi
01.데이터 불러오기
churn_tb <- read_csv('data6.csv',
col_names = TRUE,
locale=locale('ko', encoding='euc-kr'),
na=".") %>% # csv 데이터 읽어오기
mutate_if(is.character, as.factor)
## Rows: 100 Columns: 5
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (5): phy, psy, cmmt, exp, churn
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(churn_tb)
## tibble [100 x 5] (S3: tbl_df/tbl/data.frame)
## $ phy : num [1:100] 43 54 60 57 60 42 48 46 57 59 ...
## $ psy : num [1:100] 18 27 30 17 30 27 21 18 30 24 ...
## $ cmmt : num [1:100] 28 28 26 23 29 26 23 28 23 26 ...
## $ exp : num [1:100] 1 1 0 0 1 1 0 0 0 0 ...
## $ churn: num [1:100] 0 0 0 0 0 1 0 0 0 0 ...
churn_tb
## # A tibble: 100 x 5
## phy psy cmmt exp churn
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 43 18 28 1 0
## 2 54 27 28 1 0
## 3 60 30 26 0 0
## 4 57 17 23 0 0
## 5 60 30 29 1 0
## 6 42 27 26 1 1
## 7 48 21 23 0 0
## 8 46 18 28 0 0
## 9 57 30 23 0 0
## 10 59 24 26 0 0
## # ... with 90 more rows
02.data 전처리
범주형 변수(factor)로 인식하게 변환
결과변수(class)에서 관심있는 변수를 1번으로 세팅
churn_tb <- churn_tb %>%
mutate(churn = factor(churn,
levels = c(1, 0), #관심변수=Yes
labels = c("이직", "근무"))) %>%
mutate(exp = factor(exp,
levels = c(1,0),
labels = c("있음", "없음")))
str(churn_tb)
## tibble [100 x 5] (S3: tbl_df/tbl/data.frame)
## $ phy : num [1:100] 43 54 60 57 60 42 48 46 57 59 ...
## $ psy : num [1:100] 18 27 30 17 30 27 21 18 30 24 ...
## $ cmmt : num [1:100] 28 28 26 23 29 26 23 28 23 26 ...
## $ exp : Factor w/ 2 levels "있음","없음": 1 1 2 2 1 1 2 2 2 2 ...
## $ churn: Factor w/ 2 levels "이직","근무": 2 2 2 2 2 1 2 2 2 2 ...
head(churn_tb)
## # A tibble: 6 x 5
## phy psy cmmt exp churn
## <dbl> <dbl> <dbl> <fct> <fct>
## 1 43 18 28 있음 근무
## 2 54 27 28 있음 근무
## 3 60 30 26 없음 근무
## 4 57 17 23 없음 근무
## 5 60 30 29 있음 근무
## 6 42 27 26 있음 이직
필요없는 변수제거: ID, 우편번호 제거
recipe에서 제거할 수도 있음
03.데이터 탐색(EDA)
데이터 탐색: 범주형, 연속형 구분
skimr::skim() - package명을 앞에 써서 구분
패키지를 여러개 사용할 경우에 이름이 같은 경우도 있어서
구분이 필요할 경우에 [패키지명::]을 사용
churn_tb %>%
skimr::skim()
Name | Piped data |
Number of rows | 100 |
Number of columns | 5 |
_______________________ | |
Column type frequency: | |
factor | 2 |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
exp | 0 | 1 | FALSE | 2 | 있음: 52, 없음: 48 |
churn | 0 | 1 | FALSE | 2 | 근무: 72, 이직: 28 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
phy | 0 | 1 | 44.86 | 7.45 | 26 | 41 | 45 | 48 | 60 | ▁▂▇▃▃ |
psy | 0 | 1 | 21.78 | 4.88 | 9 | 18 | 22 | 24 | 30 | ▁▅▇▇▆ |
cmmt | 0 | 1 | 22.95 | 4.21 | 12 | 21 | 24 | 26 | 32 | ▂▃▇▇▁ |
churn_tb %>%
group_by(churn) %>%
skimr::skim()
Name | Piped data |
Number of rows | 100 |
Number of columns | 5 |
_______________________ | |
Column type frequency: | |
factor | 1 |
numeric | 3 |
________________________ | |
Group variables | churn |
Variable type: factor
skim_variable | churn | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|---|
exp | 이직 | 0 | 1 | FALSE | 2 | 있음: 18, 없음: 10 |
exp | 근무 | 0 | 1 | FALSE | 2 | 없음: 38, 있음: 34 |
Variable type: numeric
skim_variable | churn | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|---|
phy | 이직 | 0 | 1 | 41.18 | 7.90 | 26 | 35.25 | 43.0 | 46.0 | 56 | ▃▃▆▇▂ |
phy | 근무 | 0 | 1 | 46.29 | 6.80 | 29 | 42.75 | 46.0 | 49.0 | 60 | ▁▃▇▂▃ |
psy | 이직 | 0 | 1 | 19.75 | 4.93 | 11 | 16.00 | 19.0 | 24.0 | 30 | ▃▇▅▆▂ |
psy | 근무 | 0 | 1 | 22.57 | 4.66 | 9 | 19.00 | 23.0 | 24.5 | 30 | ▁▂▇▇▆ |
cmmt | 이직 | 0 | 1 | 18.39 | 3.51 | 12 | 15.75 | 17.5 | 21.0 | 26 | ▂▇▃▆▂ |
cmmt | 근무 | 0 | 1 | 24.72 | 2.94 | 15 | 23.00 | 25.0 | 27.0 | 32 | ▁▂▇▆▁ |
base accuracy
yes 기준으로 0.096
churn_tb %>%
count(churn) %>%
mutate(prop = n/sum(n))
## # A tibble: 2 x 3
## churn n prop
## <fct> <int> <dbl>
## 1 이직 28 0.28
## 2 근무 72 0.72
04.훈련용, 테스트용 데이터 분할: partition
데이터 partition
set.seed(123) # 시드 고정
churn_split <- churn_tb %>%
initial_split(strata = churn) # 결과변수 비율반영
churn_split
## <Analysis/Assess/Total>
## <75/25/100>
training, test용 분리
train_data <- training(churn_split)
test_data <- testing(churn_split)
str(train_data)
## tibble [75 x 5] (S3: tbl_df/tbl/data.frame)
## $ phy : num [1:75] 43 54 60 48 46 57 59 47 43 52 ...
## $ psy : num [1:75] 18 27 30 21 18 30 24 23 24 20 ...
## $ cmmt : num [1:75] 28 28 29 23 28 23 26 27 26 23 ...
## $ exp : Factor w/ 2 levels "있음","없음": 1 1 1 2 2 2 2 1 2 1 ...
## $ churn: Factor w/ 2 levels "이직","근무": 2 2 2 2 2 2 2 2 2 2 ...
str(test_data)
## tibble [25 x 5] (S3: tbl_df/tbl/data.frame)
## $ phy : num [1:25] 60 57 42 36 58 47 37 34 37 45 ...
## $ psy : num [1:25] 30 17 27 18 30 19 20 15 11 20 ...
## $ cmmt : num [1:25] 26 23 26 27 28 20 25 23 17 23 ...
## $ exp : Factor w/ 2 levels "있음","없음": 2 2 1 1 2 2 1 2 2 1 ...
## $ churn: Factor w/ 2 levels "이직","근무": 2 2 1 2 2 2 2 2 1 2 ...
05.Model 만들기(모델 설정)
Model 만들기
모델 인자(argument) 확인
args(decision_tree)
## function (mode = "unknown", engine = "rpart", cost_complexity = NULL,
## tree_depth = NULL, min_n = NULL)
## NULL
tree_model <-
decision_tree() %>%
set_engine("rpart") %>%
set_mode("classification")
recipe 만들기
step_dummy(all_nominal(), -all_outcomes()) : one-hot-ecoding
step_log(Gr_Liv_Area, base = 10) : 로그함수로 변환
step_other(Neighborhood, threshold = 0.01) : 값이 적은 항목을 기타로 변환
step_upsample(churn) # 데이터 균형화
step_zv(all_predictors()) : 단일 고유 값 (예 : 모두 0) 변수 제거.
특히, penalty 사용하는 모델에서 중요(logistic, SVM 등)
step_normalize(all_numeric()) : 데이터 정규화
tree_recipe <-
train_data %>%
recipe(churn ~ .) %>%
step_dummy(all_nominal(), -all_outcomes())
summary(tree_recipe)
## # A tibble: 5 x 4
## variable type role source
## <chr> <chr> <chr> <chr>
## 1 phy numeric predictor original
## 2 psy numeric predictor original
## 3 cmmt numeric predictor original
## 4 exp nominal predictor original
## 5 churn nominal outcome original
06.workflow 만들기
tree_workflow <-
workflow() %>%
add_model(tree_model) %>%
add_recipe(tree_recipe)
tree_workflow
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: decision_tree()
##
## -- Preprocessor ----------------------------------------------------------------
## 1 Recipe Step
##
## * step_dummy()
##
## -- Model -----------------------------------------------------------------------
## Decision Tree Model Specification (classification)
##
## Computational engine: rpart
07.Model 훈련(모델 학습)
훈련데이터로 모델 훈련하기
tree_train_fit <-
tree_workflow %>%
fit(data = train_data)
모델 훈련 결과 확인
tree_train_fit %>%
extract_fit_parsnip()
## parsnip model object
##
## n= 75
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 75 21 근무 (0.28000000 0.72000000)
## 2) cmmt< 18.5 13 1 이직 (0.92307692 0.07692308) *
## 3) cmmt>=18.5 62 9 근무 (0.14516129 0.85483871) *
08.훈련모델 검정
예측결과표 생성
tree_train_pred <-
predict(tree_train_fit,
train_data,
type = "prob") %>%
bind_cols(predict(tree_train_fit,
train_data)) %>%
bind_cols(train_data %>%
select(churn)) %>%
print()
## # A tibble: 75 x 4
## .pred_이직 .pred_근무 .pred_class churn
## <dbl> <dbl> <fct> <fct>
## 1 0.145 0.855 근무 근무
## 2 0.145 0.855 근무 근무
## 3 0.145 0.855 근무 근무
## 4 0.145 0.855 근무 근무
## 5 0.145 0.855 근무 근무
## 6 0.145 0.855 근무 근무
## 7 0.145 0.855 근무 근무
## 8 0.145 0.855 근무 근무
## 9 0.145 0.855 근무 근무
## 10 0.145 0.855 근무 근무
## # ... with 65 more rows
정오분류표(confusion matrix) 만들기
tree_train_conf <-
tree_train_pred %>%
conf_mat(truth = churn,
estimate = .pred_class)
tree_train_conf
## Truth
## Prediction 이직 근무
## 이직 12 1
## 근무 9 53
autoplot(tree_train_conf, type = "heatmap") # mosaic
autoplot(tree_train_conf, type = "mosaic")
summary(tree_train_conf)
## # A tibble: 13 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.867
## 2 kap binary 0.626
## 3 sens binary 0.571
## 4 spec binary 0.981
## 5 ppv binary 0.923
## 6 npv binary 0.855
## 7 mcc binary 0.656
## 8 j_index binary 0.553
## 9 bal_accuracy binary 0.776
## 10 detection_prevalence binary 0.173
## 11 precision binary 0.923
## 12 recall binary 0.571
## 13 f_meas binary 0.706
f1: 재현율(Recall)(↑)과 정밀도(Precision)(↑)
재현율(Recall): 실제 Class 중에 잘 맞춘 것(=TPR=민감도)
정밀도(Precision): 예측 Class 중에 잘 맞춘 것
정확도 (Accuracy) : 클래스 0과 1 모두를 정확하게 분류
ACU(area under the curve): ROC 정확도
tree_train_pred %>%
roc_auc(truth = churn,
estimate =.pred_이직)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.776
ROC 커브
train_auc <-
tree_train_pred %>%
roc_curve(truth = churn,
estimate = .pred_이직) %>%
mutate(model = "train_auc")
autoplot(train_auc)
lift 커브
tree_train_pred %>%
lift_curve(truth = churn,
estimate = .pred_이직) %>%
autoplot()
gain 커브
tree_train_pred %>%
gain_curve(truth = churn,
estimate = .pred_이직) %>%
autoplot()
중요변수 확인
tree_train_fit %>%
pull_workflow_fit() %>%
vip()
## Warning: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
## Please use `extract_fit_parsnip()` instead.
09.테스트 데이터 검정
구축된 모델에 test data로 검정
last_fit 사용
data: churn_split 사용
tree_test_fit <-
tree_workflow %>%
last_fit(churn_split)
tree_test_fit
## # Resampling results
## # Manual resampling
## # A tibble: 1 x 6
## splits id .metrics .notes .predictions .workflow
## <list> <chr> <list> <list> <list> <list>
## 1 <split [75/25]> train/test split <tibble> <tibble> <tibble> <workflow>
예측결과 자동생성: collect_predictions()
tree_test_pred <-
tree_test_fit %>%
collect_predictions()
tree_test_pred
## # A tibble: 25 x 7
## id .pred_이직 .pred_근무 .row .pred_class churn .config
## <chr> <dbl> <dbl> <int> <fct> <fct> <chr>
## 1 train/test split 0.145 0.855 3 근무 근무 Preprocessor1~
## 2 train/test split 0.145 0.855 4 근무 근무 Preprocessor1~
## 3 train/test split 0.145 0.855 6 근무 이직 Preprocessor1~
## 4 train/test split 0.145 0.855 12 근무 근무 Preprocessor1~
## 5 train/test split 0.145 0.855 22 근무 근무 Preprocessor1~
## 6 train/test split 0.145 0.855 26 근무 근무 Preprocessor1~
## 7 train/test split 0.145 0.855 28 근무 근무 Preprocessor1~
## 8 train/test split 0.145 0.855 31 근무 근무 Preprocessor1~
## 9 train/test split 0.923 0.0769 40 이직 이직 Preprocessor1~
## 10 train/test split 0.145 0.855 43 근무 근무 Preprocessor1~
## # ... with 15 more rows
정오분류표(confusion matrix) 만들기
tree_test_conf <-
tree_test_pred %>%
conf_mat(truth = churn,
estimate = .pred_class)
tree_test_conf
## Truth
## Prediction 이직 근무
## 이직 4 0
## 근무 3 18
autoplot(tree_test_conf, type = "heatmap") # mosaic
autoplot(tree_test_conf, type = "mosaic")
summary(tree_test_conf)
## # A tibble: 13 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.88
## 2 kap binary 0.658
## 3 sens binary 0.571
## 4 spec binary 1
## 5 ppv binary 1
## 6 npv binary 0.857
## 7 mcc binary 0.700
## 8 j_index binary 0.571
## 9 bal_accuracy binary 0.786
## 10 detection_prevalence binary 0.16
## 11 precision binary 1
## 12 recall binary 0.571
## 13 f_meas binary 0.727
f1: 재현율(Recall)(↑)과 정밀도(Precision)(↑)
재현율(Recall): 실제 Class 중에 잘 맞춘 것(=TPR=민감도)
정밀도(Precision): 예측 Class 중에 잘 맞춘 것
정확도 (Accuracy) : 클래스 0과 1 모두를 정확하게 분류
ACU(area under the curve): ROC 정확도
tree_test_pred %>%
roc_auc(truth = churn,
estimate =.pred_이직)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.786
ROC 커브
test_auc <-
tree_test_pred %>%
roc_curve(truth = churn,
estimate = .pred_이직) %>%
mutate(model = "test_auc")
autoplot(test_auc)
lift 커브
tree_test_pred %>%
lift_curve(truth = churn,
estimate = .pred_이직) %>%
autoplot()
gain 커브
tree_test_pred %>%
gain_curve(truth = churn,
estimate = .pred_이직) %>%
autoplot()
중요변수 확인
tree_test_fit %>%
pluck(".workflow", 1) %>%
pull_workflow_fit() %>%
vip(num_features = 20)
## Warning: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
## Please use `extract_fit_parsnip()` instead.
10.train, test 검정결과 비교
정오분류표(confusion matrix) 비교
tree_train_conf
## Truth
## Prediction 이직 근무
## 이직 12 1
## 근무 9 53
tree_test_conf
## Truth
## Prediction 이직 근무
## 이직 4 0
## 근무 3 18
autoplot(tree_train_conf, type = "mosaic") # mosaic
autoplot(tree_test_conf, type = "mosaic")
검정결과 비교
summary(tree_train_conf)
## # A tibble: 13 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.867
## 2 kap binary 0.626
## 3 sens binary 0.571
## 4 spec binary 0.981
## 5 ppv binary 0.923
## 6 npv binary 0.855
## 7 mcc binary 0.656
## 8 j_index binary 0.553
## 9 bal_accuracy binary 0.776
## 10 detection_prevalence binary 0.173
## 11 precision binary 0.923
## 12 recall binary 0.571
## 13 f_meas binary 0.706
summary(tree_test_conf)
## # A tibble: 13 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.88
## 2 kap binary 0.658
## 3 sens binary 0.571
## 4 spec binary 1
## 5 ppv binary 1
## 6 npv binary 0.857
## 7 mcc binary 0.700
## 8 j_index binary 0.571
## 9 bal_accuracy binary 0.786
## 10 detection_prevalence binary 0.16
## 11 precision binary 1
## 12 recall binary 0.571
## 13 f_meas binary 0.727
ROC 커브 비교
bind_rows(train_auc, test_auc) %>%
ggplot(mapping = aes(x = 1 - specificity,
y = sensitivity,
color = model)) +
geom_path(lwd = 1.5) +
geom_abline(lty = 5) +
coord_equal()
11.decision tree 만들기
install.packages(“rpart.plot”)
library(rpart.plot)
1번째 방법
rpart_fit <-
tree_train_fit %>%
pull_workflow_fit()
## Warning: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
## Please use `extract_fit_parsnip()` instead.
#2번째 방법
rpart_fit <-
tree_train_fit %>%
extract_fit_parsnip()
모형 1
rpart.plot(x = rpart_fit$fit,
yesno = 2,
type = 2,
extra = 1,
split.font = 1,
varlen = -10)
## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
## Call rpart.plot with roundint=FALSE,
## or rebuild the rpart model with model=TRUE.
모형 2
prp(x = rpart_fit$fit,
type = 1,
extra = 1,
under = TRUE,
split.font = 1,
varlen = -10,
box.col=ifelse(rpart_fit$fit$frame$var == "<leaf>", 'gray', 'white'))
## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
## Call prp with roundint=FALSE,
## or rebuild the rpart model with model=TRUE.
# help(prp) # type = 나무그래프 표현 종류 # extra= 추가 정보 표시, 1=노드의 관측수 표시 # under= box 아래 관측값 표시, default=False # split.font = 글자 font, default=2(bold) # varlen = 변수이름 길이, default=-8, 0=full name # box.col=box 색깔
Leave a comment