Table Of Contents

데이터마이닝의 이론과 실제 기말고사 연습문제6

1.초기 모델(Initial Model)

유니버설 은행 사례

tidyverse: tidy형식 기본도구

ggplot2, purrr, tibble 3.0.3,

dplyr, tidyr, stringr, readr, forcats

등 8개 패키지를 한 곳에 묶은 형태

tidymodels:tidy 모델 분석도구(데이터마이닝, 통계)

broom, recipes, dials, rsample, infer,

tune, modeldata, workflows, parsnip,

yardstick

tidytext: 텍스트마이닝 도구

rlang, tibble, dplyr, stringr, hunspell,

generics,lifecycle, Matrix, tokenizers,

janeaustenr, purrr

install.packages(“tidyverse”)

install.packages(“tidymodels”)

install.packages(“tidytext”)

install.packages(“skimr”)

install.packages(“vip”)

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(tidymodels)
## -- Attaching packages -------------------------------------- tidymodels 0.2.0 --
## v broom        0.8.0     v rsample      0.1.1
## v dials        0.1.1     v tune         0.2.0
## v infer        1.0.0     v workflows    0.2.6
## v modeldata    0.1.1     v workflowsets 0.2.1
## v parsnip      0.2.1     v yardstick    0.0.9
## v recipes      0.2.0
## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x scales::discard() masks purrr::discard()
## x dplyr::filter()   masks stats::filter()
## x recipes::fixed()  masks stringr::fixed()
## x dplyr::lag()      masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step()   masks stats::step()
## * Search for functions across packages at https://www.tidymodels.org/find/
library(tidytext)
library(skimr)           # 데이터 요약(EDA)
library(vip)             # 중요한 변수 찾기
## 
## 다음의 패키지를 부착합니다: 'vip'
## The following object is masked from 'package:utils':
## 
##     vi

01.데이터 불러오기

churn_tb <- read_csv('data6.csv', 
                    col_names = TRUE,
                    locale=locale('ko', encoding='euc-kr'),
                    na=".") %>% # csv 데이터 읽어오기
  mutate_if(is.character, as.factor)
## Rows: 100 Columns: 5
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (5): phy, psy, cmmt, exp, churn
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(churn_tb)
## tibble [100 x 5] (S3: tbl_df/tbl/data.frame)
##  $ phy  : num [1:100] 43 54 60 57 60 42 48 46 57 59 ...
##  $ psy  : num [1:100] 18 27 30 17 30 27 21 18 30 24 ...
##  $ cmmt : num [1:100] 28 28 26 23 29 26 23 28 23 26 ...
##  $ exp  : num [1:100] 1 1 0 0 1 1 0 0 0 0 ...
##  $ churn: num [1:100] 0 0 0 0 0 1 0 0 0 0 ...
churn_tb
## # A tibble: 100 x 5
##      phy   psy  cmmt   exp churn
##    <dbl> <dbl> <dbl> <dbl> <dbl>
##  1    43    18    28     1     0
##  2    54    27    28     1     0
##  3    60    30    26     0     0
##  4    57    17    23     0     0
##  5    60    30    29     1     0
##  6    42    27    26     1     1
##  7    48    21    23     0     0
##  8    46    18    28     0     0
##  9    57    30    23     0     0
## 10    59    24    26     0     0
## # ... with 90 more rows

02.data 전처리

범주형 변수(factor)로 인식하게 변환

결과변수(class)에서 관심있는 변수를 1번으로 세팅

churn_tb <- churn_tb %>%
  mutate(churn = factor(churn, 
                        levels = c(1, 0),              #관심변수=Yes           
                        labels = c("이직", "근무"))) %>%
  mutate(exp  = factor(exp, 
                       levels = c(1,0),
                       labels = c("있음", "없음")))
str(churn_tb)
## tibble [100 x 5] (S3: tbl_df/tbl/data.frame)
##  $ phy  : num [1:100] 43 54 60 57 60 42 48 46 57 59 ...
##  $ psy  : num [1:100] 18 27 30 17 30 27 21 18 30 24 ...
##  $ cmmt : num [1:100] 28 28 26 23 29 26 23 28 23 26 ...
##  $ exp  : Factor w/ 2 levels "있음","없음": 1 1 2 2 1 1 2 2 2 2 ...
##  $ churn: Factor w/ 2 levels "이직","근무": 2 2 2 2 2 1 2 2 2 2 ...
head(churn_tb)
## # A tibble: 6 x 5
##     phy   psy  cmmt exp   churn
##   <dbl> <dbl> <dbl> <fct> <fct>
## 1    43    18    28 있음  근무 
## 2    54    27    28 있음  근무 
## 3    60    30    26 없음  근무 
## 4    57    17    23 없음  근무 
## 5    60    30    29 있음  근무 
## 6    42    27    26 있음  이직

필요없는 변수제거: ID, 우편번호 제거

recipe에서 제거할 수도 있음

03.데이터 탐색(EDA)

데이터 탐색: 범주형, 연속형 구분

skimr::skim() - package명을 앞에 써서 구분

패키지를 여러개 사용할 경우에 이름이 같은 경우도 있어서

구분이 필요할 경우에 [패키지명::]을 사용

churn_tb %>%
  skimr::skim() 
Data summary
Name Piped data
Number of rows 100
Number of columns 5
_______________________
Column type frequency:
factor 2
numeric 3
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
exp 0 1 FALSE 2 있음: 52, 없음: 48
churn 0 1 FALSE 2 근무: 72, 이직: 28

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
phy 0 1 44.86 7.45 26 41 45 48 60 ▁▂▇▃▃
psy 0 1 21.78 4.88 9 18 22 24 30 ▁▅▇▇▆
cmmt 0 1 22.95 4.21 12 21 24 26 32 ▂▃▇▇▁
churn_tb %>%
  group_by(churn) %>%
  skimr::skim() 
Data summary
Name Piped data
Number of rows 100
Number of columns 5
_______________________
Column type frequency:
factor 1
numeric 3
________________________
Group variables churn

Variable type: factor

skim_variable churn n_missing complete_rate ordered n_unique top_counts
exp 이직 0 1 FALSE 2 있음: 18, 없음: 10
exp 근무 0 1 FALSE 2 없음: 38, 있음: 34

Variable type: numeric

skim_variable churn n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
phy 이직 0 1 41.18 7.90 26 35.25 43.0 46.0 56 ▃▃▆▇▂
phy 근무 0 1 46.29 6.80 29 42.75 46.0 49.0 60 ▁▃▇▂▃
psy 이직 0 1 19.75 4.93 11 16.00 19.0 24.0 30 ▃▇▅▆▂
psy 근무 0 1 22.57 4.66 9 19.00 23.0 24.5 30 ▁▂▇▇▆
cmmt 이직 0 1 18.39 3.51 12 15.75 17.5 21.0 26 ▂▇▃▆▂
cmmt 근무 0 1 24.72 2.94 15 23.00 25.0 27.0 32 ▁▂▇▆▁

base accuracy

yes 기준으로 0.096

churn_tb %>% 
  count(churn) %>% 
  mutate(prop = n/sum(n))
## # A tibble: 2 x 3
##   churn     n  prop
##   <fct> <int> <dbl>
## 1 이직     28  0.28
## 2 근무     72  0.72

04.훈련용, 테스트용 데이터 분할: partition

데이터 partition

set.seed(123) # 시드 고정 
churn_split <- churn_tb %>%
  initial_split(strata = churn) # 결과변수 비율반영

churn_split
## <Analysis/Assess/Total>
## <75/25/100>

training, test용 분리

train_data <- training(churn_split)
test_data  <- testing(churn_split)
str(train_data)
## tibble [75 x 5] (S3: tbl_df/tbl/data.frame)
##  $ phy  : num [1:75] 43 54 60 48 46 57 59 47 43 52 ...
##  $ psy  : num [1:75] 18 27 30 21 18 30 24 23 24 20 ...
##  $ cmmt : num [1:75] 28 28 29 23 28 23 26 27 26 23 ...
##  $ exp  : Factor w/ 2 levels "있음","없음": 1 1 1 2 2 2 2 1 2 1 ...
##  $ churn: Factor w/ 2 levels "이직","근무": 2 2 2 2 2 2 2 2 2 2 ...
str(test_data)
## tibble [25 x 5] (S3: tbl_df/tbl/data.frame)
##  $ phy  : num [1:25] 60 57 42 36 58 47 37 34 37 45 ...
##  $ psy  : num [1:25] 30 17 27 18 30 19 20 15 11 20 ...
##  $ cmmt : num [1:25] 26 23 26 27 28 20 25 23 17 23 ...
##  $ exp  : Factor w/ 2 levels "있음","없음": 2 2 1 1 2 2 1 2 2 1 ...
##  $ churn: Factor w/ 2 levels "이직","근무": 2 2 1 2 2 2 2 2 1 2 ...

05.Model 만들기(모델 설정)

Model 만들기

모델 인자(argument) 확인

args(decision_tree) 
## function (mode = "unknown", engine = "rpart", cost_complexity = NULL, 
##     tree_depth = NULL, min_n = NULL) 
## NULL
tree_model <- 
  decision_tree() %>% 
  set_engine("rpart") %>% 
  set_mode("classification")

recipe 만들기

step_dummy(all_nominal(), -all_outcomes()) : one-hot-ecoding

step_log(Gr_Liv_Area, base = 10) : 로그함수로 변환

step_other(Neighborhood, threshold = 0.01) : 값이 적은 항목을 기타로 변환

step_upsample(churn) # 데이터 균형화

step_zv(all_predictors()) : 단일 고유 값 (예 : 모두 0) 변수 제거.

특히, penalty 사용하는 모델에서 중요(logistic, SVM 등)

step_normalize(all_numeric()) : 데이터 정규화

tree_recipe <- 
  train_data %>%
  recipe(churn ~ .) %>%
  step_dummy(all_nominal(), -all_outcomes())
  
summary(tree_recipe)
## # A tibble: 5 x 4
##   variable type    role      source  
##   <chr>    <chr>   <chr>     <chr>   
## 1 phy      numeric predictor original
## 2 psy      numeric predictor original
## 3 cmmt     numeric predictor original
## 4 exp      nominal predictor original
## 5 churn    nominal outcome   original

06.workflow 만들기

tree_workflow <- 
  workflow() %>% 
  add_model(tree_model) %>% 
  add_recipe(tree_recipe)

tree_workflow
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: decision_tree()
## 
## -- Preprocessor ----------------------------------------------------------------
## 1 Recipe Step
## 
## * step_dummy()
## 
## -- Model -----------------------------------------------------------------------
## Decision Tree Model Specification (classification)
## 
## Computational engine: rpart

07.Model 훈련(모델 학습)

훈련데이터로 모델 훈련하기

tree_train_fit <- 
  tree_workflow %>%
  fit(data = train_data)

모델 훈련 결과 확인

tree_train_fit %>%
  extract_fit_parsnip()
## parsnip model object
## 
## n= 75 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 75 21 근무 (0.28000000 0.72000000)  
##   2) cmmt< 18.5 13  1 이직 (0.92307692 0.07692308) *
##   3) cmmt>=18.5 62  9 근무 (0.14516129 0.85483871) *

08.훈련모델 검정

예측결과표 생성

tree_train_pred <- 
  predict(tree_train_fit, 
          train_data, 
          type = "prob") %>%
  bind_cols(predict(tree_train_fit, 
                    train_data)) %>% 
  bind_cols(train_data %>% 
              select(churn)) %>%
  print()
## # A tibble: 75 x 4
##    .pred_이직 .pred_근무 .pred_class churn
##         <dbl>      <dbl> <fct>       <fct>
##  1      0.145      0.855 근무        근무 
##  2      0.145      0.855 근무        근무 
##  3      0.145      0.855 근무        근무 
##  4      0.145      0.855 근무        근무 
##  5      0.145      0.855 근무        근무 
##  6      0.145      0.855 근무        근무 
##  7      0.145      0.855 근무        근무 
##  8      0.145      0.855 근무        근무 
##  9      0.145      0.855 근무        근무 
## 10      0.145      0.855 근무        근무 
## # ... with 65 more rows

정오분류표(confusion matrix) 만들기

tree_train_conf <-
  tree_train_pred  %>%
  conf_mat(truth = churn, 
           estimate = .pred_class)

tree_train_conf
##           Truth
## Prediction 이직 근무
##       이직   12    1
##       근무    9   53
autoplot(tree_train_conf, type = "heatmap") # mosaic

autoplot(tree_train_conf, type = "mosaic")

summary(tree_train_conf)
## # A tibble: 13 x 3
##    .metric              .estimator .estimate
##    <chr>                <chr>          <dbl>
##  1 accuracy             binary         0.867
##  2 kap                  binary         0.626
##  3 sens                 binary         0.571
##  4 spec                 binary         0.981
##  5 ppv                  binary         0.923
##  6 npv                  binary         0.855
##  7 mcc                  binary         0.656
##  8 j_index              binary         0.553
##  9 bal_accuracy         binary         0.776
## 10 detection_prevalence binary         0.173
## 11 precision            binary         0.923
## 12 recall               binary         0.571
## 13 f_meas               binary         0.706

f1: 재현율(Recall)(↑)과 정밀도(Precision)(↑)

재현율(Recall): 실제 Class 중에 잘 맞춘 것(=TPR=민감도)

정밀도(Precision): 예측 Class 중에 잘 맞춘 것

정확도 (Accuracy) : 클래스 0과 1 모두를 정확하게 분류

ACU(area under the curve): ROC 정확도

tree_train_pred %>%
  roc_auc(truth = churn, 
          estimate =.pred_이직)
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.776

ROC 커브

train_auc <-
  tree_train_pred %>%
  roc_curve(truth = churn, 
            estimate = .pred_이직) %>% 
  mutate(model = "train_auc")
autoplot(train_auc)

lift 커브

tree_train_pred %>%
  lift_curve(truth = churn, 
             estimate = .pred_이직) %>%
  autoplot()

gain 커브

tree_train_pred %>%
  gain_curve(truth = churn, 
             estimate = .pred_이직) %>%
  autoplot()

중요변수 확인

tree_train_fit %>% 
  pull_workflow_fit() %>% 
  vip()
## Warning: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
## Please use `extract_fit_parsnip()` instead.

09.테스트 데이터 검정

구축된 모델에 test data로 검정

last_fit 사용

data: churn_split 사용

tree_test_fit <- 
  tree_workflow %>%
  last_fit(churn_split) 
tree_test_fit
## # Resampling results
## # Manual resampling 
## # A tibble: 1 x 6
##   splits          id               .metrics .notes   .predictions .workflow 
##   <list>          <chr>            <list>   <list>   <list>       <list>    
## 1 <split [75/25]> train/test split <tibble> <tibble> <tibble>     <workflow>

예측결과 자동생성: collect_predictions()

tree_test_pred <- 
  tree_test_fit %>%
  collect_predictions()
tree_test_pred
## # A tibble: 25 x 7
##    id               .pred_이직 .pred_근무  .row .pred_class churn .config       
##    <chr>                 <dbl>      <dbl> <int> <fct>       <fct> <chr>         
##  1 train/test split      0.145     0.855      3 근무        근무  Preprocessor1~
##  2 train/test split      0.145     0.855      4 근무        근무  Preprocessor1~
##  3 train/test split      0.145     0.855      6 근무        이직  Preprocessor1~
##  4 train/test split      0.145     0.855     12 근무        근무  Preprocessor1~
##  5 train/test split      0.145     0.855     22 근무        근무  Preprocessor1~
##  6 train/test split      0.145     0.855     26 근무        근무  Preprocessor1~
##  7 train/test split      0.145     0.855     28 근무        근무  Preprocessor1~
##  8 train/test split      0.145     0.855     31 근무        근무  Preprocessor1~
##  9 train/test split      0.923     0.0769    40 이직        이직  Preprocessor1~
## 10 train/test split      0.145     0.855     43 근무        근무  Preprocessor1~
## # ... with 15 more rows

정오분류표(confusion matrix) 만들기

tree_test_conf <-
  tree_test_pred  %>%
  conf_mat(truth = churn, 
           estimate = .pred_class)
tree_test_conf
##           Truth
## Prediction 이직 근무
##       이직    4    0
##       근무    3   18
autoplot(tree_test_conf, type = "heatmap") # mosaic

autoplot(tree_test_conf, type = "mosaic")

summary(tree_test_conf)
## # A tibble: 13 x 3
##    .metric              .estimator .estimate
##    <chr>                <chr>          <dbl>
##  1 accuracy             binary         0.88 
##  2 kap                  binary         0.658
##  3 sens                 binary         0.571
##  4 spec                 binary         1    
##  5 ppv                  binary         1    
##  6 npv                  binary         0.857
##  7 mcc                  binary         0.700
##  8 j_index              binary         0.571
##  9 bal_accuracy         binary         0.786
## 10 detection_prevalence binary         0.16 
## 11 precision            binary         1    
## 12 recall               binary         0.571
## 13 f_meas               binary         0.727

f1: 재현율(Recall)(↑)과 정밀도(Precision)(↑)

재현율(Recall): 실제 Class 중에 잘 맞춘 것(=TPR=민감도)

정밀도(Precision): 예측 Class 중에 잘 맞춘 것

정확도 (Accuracy) : 클래스 0과 1 모두를 정확하게 분류

ACU(area under the curve): ROC 정확도

tree_test_pred %>%
  roc_auc(truth = churn, 
          estimate =.pred_이직)
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.786

ROC 커브

test_auc <-
  tree_test_pred %>%
  roc_curve(truth = churn, 
            estimate = .pred_이직) %>% 
  mutate(model = "test_auc")

autoplot(test_auc)

lift 커브

tree_test_pred %>%
  lift_curve(truth = churn, 
             estimate = .pred_이직) %>%
  autoplot()

gain 커브

tree_test_pred %>%
  gain_curve(truth = churn, 
             estimate = .pred_이직) %>%
  autoplot()

중요변수 확인

tree_test_fit %>%
  pluck(".workflow", 1) %>%   
  pull_workflow_fit() %>% 
  vip(num_features = 20)
## Warning: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
## Please use `extract_fit_parsnip()` instead.

10.train, test 검정결과 비교

정오분류표(confusion matrix) 비교

tree_train_conf
##           Truth
## Prediction 이직 근무
##       이직   12    1
##       근무    9   53
tree_test_conf
##           Truth
## Prediction 이직 근무
##       이직    4    0
##       근무    3   18
autoplot(tree_train_conf, type = "mosaic") # mosaic

autoplot(tree_test_conf, type = "mosaic")

검정결과 비교

summary(tree_train_conf)
## # A tibble: 13 x 3
##    .metric              .estimator .estimate
##    <chr>                <chr>          <dbl>
##  1 accuracy             binary         0.867
##  2 kap                  binary         0.626
##  3 sens                 binary         0.571
##  4 spec                 binary         0.981
##  5 ppv                  binary         0.923
##  6 npv                  binary         0.855
##  7 mcc                  binary         0.656
##  8 j_index              binary         0.553
##  9 bal_accuracy         binary         0.776
## 10 detection_prevalence binary         0.173
## 11 precision            binary         0.923
## 12 recall               binary         0.571
## 13 f_meas               binary         0.706
summary(tree_test_conf)
## # A tibble: 13 x 3
##    .metric              .estimator .estimate
##    <chr>                <chr>          <dbl>
##  1 accuracy             binary         0.88 
##  2 kap                  binary         0.658
##  3 sens                 binary         0.571
##  4 spec                 binary         1    
##  5 ppv                  binary         1    
##  6 npv                  binary         0.857
##  7 mcc                  binary         0.700
##  8 j_index              binary         0.571
##  9 bal_accuracy         binary         0.786
## 10 detection_prevalence binary         0.16 
## 11 precision            binary         1    
## 12 recall               binary         0.571
## 13 f_meas               binary         0.727

ROC 커브 비교

bind_rows(train_auc, test_auc) %>% 
  ggplot(mapping = aes(x = 1 - specificity, 
                       y = sensitivity, 
                       color = model)) + 
  geom_path(lwd = 1.5) +
  geom_abline(lty = 5) + 
  coord_equal()

11.decision tree 만들기

install.packages(“rpart.plot”)

library(rpart.plot)

1번째 방법

rpart_fit <- 
  tree_train_fit %>% 
  pull_workflow_fit()
## Warning: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
## Please use `extract_fit_parsnip()` instead.

#2번째 방법

rpart_fit <- 
      tree_train_fit %>% 
      extract_fit_parsnip()

모형 1

rpart.plot(x = rpart_fit$fit,
           yesno = 2,
           type = 2, 
           extra = 1, 
           split.font = 1, 
           varlen = -10)
## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
##     Call rpart.plot with roundint=FALSE,
##     or rebuild the rpart model with model=TRUE.

모형 2

prp(x = rpart_fit$fit, 
    type = 1, 
    extra = 1, 
    under = TRUE, 
    split.font = 1, 
    varlen = -10,
    box.col=ifelse(rpart_fit$fit$frame$var == "<leaf>", 'gray', 'white'))
## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
##     Call prp with roundint=FALSE,
##     or rebuild the rpart model with model=TRUE.

# help(prp) # type = 나무그래프 표현 종류 # extra= 추가 정보 표시, 1=노드의 관측수 표시 # under= box 아래 관측값 표시, default=False # split.font = 글자 font, default=2(bold) # varlen = 변수이름 길이, default=-8, 0=full name # box.col=box 색깔

Leave a comment