Share on

데이터마이닝의 이론과 실제 기말고사 연습문제6

1.초기 모델(Initial Model)

https://www.tidymodels.org/start/models/

유니버설 은행 사례

tidyverse: tidy형식 기본도구

ggplot2, purrr, tibble 3.0.3,

dplyr, tidyr, stringr, readr, forcats

등 8개 패키지를 한 곳에 묶은 형태

tidymodels:tidy 모델 분석도구(데이터마이닝, 통계)

broom, recipes, dials, rsample, infer,

tune, modeldata, workflows, parsnip,

yardstick

tidytext: 텍스트마이닝 도구

rlang, tibble, dplyr, stringr, hunspell,

generics,lifecycle, Matrix, tokenizers,

janeaustenr, purrr

install.packages(“tidyverse”)

install.packages(“tidymodels”)

install.packages(“tidytext”)

install.packages(“skimr”)

install.packages(“vip”)

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(tidymodels)

## -- Attaching packages -------------------------------------- tidymodels 0.2.0 --

## v broom        0.8.0     v rsample      0.1.1
## v dials        0.1.1     v tune         0.2.0
## v infer        1.0.0     v workflows    0.2.6
## v modeldata    0.1.1     v workflowsets 0.2.1
## v parsnip      0.2.1     v yardstick    0.0.9
## v recipes      0.2.0

## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x scales::discard() masks purrr::discard()
## x dplyr::filter()   masks stats::filter()
## x recipes::fixed()  masks stringr::fixed()
## x dplyr::lag()      masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step()   masks stats::step()
## * Search for functions across packages at https://www.tidymodels.org/find/

library(tidytext)
library(skimr)           # 데이터 요약(EDA)
library(vip)             # 중요한 변수 찾기

## 
## 다음의 패키지를 부착합니다: 'vip'

## The following object is masked from 'package:utils':
## 
##     vi

01.데이터 불러오기

churn_tb <- read_csv('data6.csv', 
                    col_names = TRUE,
                    locale=locale('ko', encoding='euc-kr'),
                    na=".") %>% # csv 데이터 읽어오기
  mutate_if(is.character, as.factor)

## Rows: 100 Columns: 5
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (5): phy, psy, cmmt, exp, churn
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

str(churn_tb)

## tibble [100 x 5] (S3: tbl_df/tbl/data.frame)
##  $ phy  : num [1:100] 43 54 60 57 60 42 48 46 57 59 ...
##  $ psy  : num [1:100] 18 27 30 17 30 27 21 18 30 24 ...
##  $ cmmt : num [1:100] 28 28 26 23 29 26 23 28 23 26 ...
##  $ exp  : num [1:100] 1 1 0 0 1 1 0 0 0 0 ...
##  $ churn: num [1:100] 0 0 0 0 0 1 0 0 0 0 ...

churn_tb

## # A tibble: 100 x 5
##      phy   psy  cmmt   exp churn
##    <dbl> <dbl> <dbl> <dbl> <dbl>
##  1    43    18    28     1     0
##  2    54    27    28     1     0
##  3    60    30    26     0     0
##  4    57    17    23     0     0
##  5    60    30    29     1     0
##  6    42    27    26     1     1
##  7    48    21    23     0     0
##  8    46    18    28     0     0
##  9    57    30    23     0     0
## 10    59    24    26     0     0
## # ... with 90 more rows

02.data 전처리

범주형 변수(factor)로 인식하게 변환

결과변수(class)에서 관심있는 변수를 1번으로 세팅

churn_tb <- churn_tb %>%
  mutate(churn = factor(churn, 
                        levels = c(1, 0),              #관심변수=Yes           
                        labels = c("이직", "근무"))) %>%
  mutate(exp  = factor(exp, 
                       levels = c(1,0),
                       labels = c("있음", "없음")))

str(churn_tb)

## tibble [100 x 5] (S3: tbl_df/tbl/data.frame)
##  $ phy  : num [1:100] 43 54 60 57 60 42 48 46 57 59 ...
##  $ psy  : num [1:100] 18 27 30 17 30 27 21 18 30 24 ...
##  $ cmmt : num [1:100] 28 28 26 23 29 26 23 28 23 26 ...
##  $ exp  : Factor w/ 2 levels "있음","없음": 1 1 2 2 1 1 2 2 2 2 ...
##  $ churn: Factor w/ 2 levels "이직","근무": 2 2 2 2 2 1 2 2 2 2 ...

head(churn_tb)

## # A tibble: 6 x 5
##     phy   psy  cmmt exp   churn
##   <dbl> <dbl> <dbl> <fct> <fct>
## 1    43    18    28 있음  근무 
## 2    54    27    28 있음  근무 
## 3    60    30    26 없음  근무 
## 4    57    17    23 없음  근무 
## 5    60    30    29 있음  근무 
## 6    42    27    26 있음  이직

필요없는 변수제거: ID, 우편번호 제거

recipe에서 제거할 수도 있음

03.데이터 탐색(EDA)

데이터 탐색: 범주형, 연속형 구분

skimr::skim() - package명을 앞에 써서 구분

패키지를 여러개 사용할 경우에 이름이 같은 경우도 있어서

구분이 필요할 경우에 [패키지명::]을 사용

churn_tb %>%
  skimr::skim()

Data summary
Name	Piped data
Number of rows	100
Number of columns	5
_______________________
Column type frequency:
factor	2
numeric	3
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
exp	0	1	FALSE	2	있음: 52, 없음: 48
churn	0	1	FALSE	2	근무: 72, 이직: 28

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
phy	1	44.86	7.45	26	41	45	48	60	▁▂▇▃▃
psy	1	21.78	4.88	9	18	22	24	30	▁▅▇▇▆
cmmt	1	22.95	4.21	12	21	24	26	32	▂▃▇▇▁

churn_tb %>%
  group_by(churn) %>%
  skimr::skim()

Data summary
Name	Piped data
Number of rows	100
Number of columns	5
_______________________
Column type frequency:
factor	1
numeric	3
________________________
Group variables	churn

Variable type: factor

skim_variable	churn	n_missing	complete_rate	ordered	n_unique	top_counts
exp	이직	0	1	FALSE	2	있음: 18, 없음: 10
exp	근무	0	1	FALSE	2	없음: 38, 있음: 34

Variable type: numeric

skim_variable	churn	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
phy	이직	1	41.18	7.90	26	35.25	43.0	46.0	56	▃▃▆▇▂
phy	근무	1	46.29	6.80	29	42.75	46.0	49.0	60	▁▃▇▂▃
psy	이직	1	19.75	4.93	11	16.00	19.0	24.0	30	▃▇▅▆▂
psy	근무	1	22.57	4.66	9	19.00	23.0	24.5	30	▁▂▇▇▆
cmmt	이직	1	18.39	3.51	12	15.75	17.5	21.0	26	▂▇▃▆▂
cmmt	근무	1	24.72	2.94	15	23.00	25.0	27.0	32	▁▂▇▆▁

base accuracy

yes 기준으로 0.096

churn_tb %>% 
  count(churn) %>% 
  mutate(prop = n/sum(n))

## # A tibble: 2 x 3
##   churn     n  prop
##   <fct> <int> <dbl>
## 1 이직     28  0.28
## 2 근무     72  0.72

04.훈련용, 테스트용 데이터 분할: partition

데이터 partition

set.seed(123) # 시드 고정

churn_split <- churn_tb %>%
  initial_split(strata = churn) # 결과변수 비율반영

churn_split

## <Analysis/Assess/Total>
## <75/25/100>

training, test용 분리

train_data <- training(churn_split)
test_data  <- testing(churn_split)

str(train_data)

## tibble [75 x 5] (S3: tbl_df/tbl/data.frame)
##  $ phy  : num [1:75] 43 54 60 48 46 57 59 47 43 52 ...
##  $ psy  : num [1:75] 18 27 30 21 18 30 24 23 24 20 ...
##  $ cmmt : num [1:75] 28 28 29 23 28 23 26 27 26 23 ...
##  $ exp  : Factor w/ 2 levels "있음","없음": 1 1 1 2 2 2 2 1 2 1 ...
##  $ churn: Factor w/ 2 levels "이직","근무": 2 2 2 2 2 2 2 2 2 2 ...

str(test_data)

## tibble [25 x 5] (S3: tbl_df/tbl/data.frame)
##  $ phy  : num [1:25] 60 57 42 36 58 47 37 34 37 45 ...
##  $ psy  : num [1:25] 30 17 27 18 30 19 20 15 11 20 ...
##  $ cmmt : num [1:25] 26 23 26 27 28 20 25 23 17 23 ...
##  $ exp  : Factor w/ 2 levels "있음","없음": 2 2 1 1 2 2 1 2 2 1 ...
##  $ churn: Factor w/ 2 levels "이직","근무": 2 2 1 2 2 2 2 2 1 2 ...

05.Model 만들기(모델 설정)

Model 만들기

모델 인자(argument) 확인

args(decision_tree)

## function (mode = "unknown", engine = "rpart", cost_complexity = NULL, 
##     tree_depth = NULL, min_n = NULL) 
## NULL

tree_model <- 
  decision_tree() %>% 
  set_engine("rpart") %>% 
  set_mode("classification")

recipe 만들기

step_dummy(all_nominal(), -all_outcomes()) : one-hot-ecoding

step_log(Gr_Liv_Area, base = 10) : 로그함수로 변환

step_other(Neighborhood, threshold = 0.01) : 값이 적은 항목을 기타로 변환

step_upsample(churn) # 데이터 균형화

step_zv(all_predictors()) : 단일 고유 값 (예 : 모두 0) 변수 제거.

특히, penalty 사용하는 모델에서 중요(logistic, SVM 등)

step_normalize(all_numeric()) : 데이터 정규화

tree_recipe <- 
  train_data %>%
  recipe(churn ~ .) %>%
  step_dummy(all_nominal(), -all_outcomes())
  
summary(tree_recipe)

## # A tibble: 5 x 4
##   variable type    role      source  
##   <chr>    <chr>   <chr>     <chr>   
## 1 phy      numeric predictor original
## 2 psy      numeric predictor original
## 3 cmmt     numeric predictor original
## 4 exp      nominal predictor original
## 5 churn    nominal outcome   original

06.workflow 만들기

tree_workflow <- 
  workflow() %>% 
  add_model(tree_model) %>% 
  add_recipe(tree_recipe)

tree_workflow

## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: decision_tree()
## 
## -- Preprocessor ----------------------------------------------------------------
## 1 Recipe Step
## 
## * step_dummy()
## 
## -- Model -----------------------------------------------------------------------
## Decision Tree Model Specification (classification)
## 
## Computational engine: rpart

07.Model 훈련(모델 학습)

훈련데이터로 모델 훈련하기

tree_train_fit <- 
  tree_workflow %>%
  fit(data = train_data)

모델 훈련 결과 확인

tree_train_fit %>%
  extract_fit_parsnip()

## parsnip model object
## 
## n= 75 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 75 21 근무 (0.28000000 0.72000000)  
##   2) cmmt< 18.5 13  1 이직 (0.92307692 0.07692308) *
##   3) cmmt>=18.5 62  9 근무 (0.14516129 0.85483871) *

08.훈련모델 검정

예측결과표 생성

tree_train_pred <- 
  predict(tree_train_fit, 
          train_data, 
          type = "prob") %>%
  bind_cols(predict(tree_train_fit, 
                    train_data)) %>% 
  bind_cols(train_data %>% 
              select(churn)) %>%
  print()

## # A tibble: 75 x 4
##    .pred_이직 .pred_근무 .pred_class churn
##         <dbl>      <dbl> <fct>       <fct>
##  1      0.145      0.855 근무        근무 
##  2      0.145      0.855 근무        근무 
##  3      0.145      0.855 근무        근무 
##  4      0.145      0.855 근무        근무 
##  5      0.145      0.855 근무        근무 
##  6      0.145      0.855 근무        근무 
##  7      0.145      0.855 근무        근무 
##  8      0.145      0.855 근무        근무 
##  9      0.145      0.855 근무        근무 
## 10      0.145      0.855 근무        근무 
## # ... with 65 more rows

정오분류표(confusion matrix) 만들기

tree_train_conf <-
  tree_train_pred  %>%
  conf_mat(truth = churn, 
           estimate = .pred_class)

tree_train_conf

##           Truth
## Prediction 이직 근무
##       이직   12    1
##       근무    9   53

autoplot(tree_train_conf, type = "heatmap") # mosaic

autoplot(tree_train_conf, type = "mosaic")

summary(tree_train_conf)

## # A tibble: 13 x 3
##    .metric              .estimator .estimate
##    <chr>                <chr>          <dbl>
##  1 accuracy             binary         0.867
##  2 kap                  binary         0.626
##  3 sens                 binary         0.571
##  4 spec                 binary         0.981
##  5 ppv                  binary         0.923
##  6 npv                  binary         0.855
##  7 mcc                  binary         0.656
##  8 j_index              binary         0.553
##  9 bal_accuracy         binary         0.776
## 10 detection_prevalence binary         0.173
## 11 precision            binary         0.923
## 12 recall               binary         0.571
## 13 f_meas               binary         0.706

f1: 재현율(Recall)(↑)과 정밀도(Precision)(↑)

재현율(Recall): 실제 Class 중에 잘 맞춘 것(=TPR=민감도)

정밀도(Precision): 예측 Class 중에 잘 맞춘 것

정확도 (Accuracy) : 클래스 0과 1 모두를 정확하게 분류

ACU(area under the curve): ROC 정확도

tree_train_pred %>%
  roc_auc(truth = churn, 
          estimate =.pred_이직)

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.776

ROC 커브

train_auc <-
  tree_train_pred %>%
  roc_curve(truth = churn, 
            estimate = .pred_이직) %>% 
  mutate(model = "train_auc")

autoplot(train_auc)

lift 커브

tree_train_pred %>%
  lift_curve(truth = churn, 
             estimate = .pred_이직) %>%
  autoplot()

gain 커브

tree_train_pred %>%
  gain_curve(truth = churn, 
             estimate = .pred_이직) %>%
  autoplot()

중요변수 확인

tree_train_fit %>% 
  pull_workflow_fit() %>% 
  vip()

## Warning: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
## Please use `extract_fit_parsnip()` instead.

09.테스트 데이터 검정

구축된 모델에 test data로 검정

last_fit 사용

data: churn_split 사용

tree_test_fit <- 
  tree_workflow %>%
  last_fit(churn_split)

tree_test_fit

## # Resampling results
## # Manual resampling 
## # A tibble: 1 x 6
##   splits          id               .metrics .notes   .predictions .workflow 
##   <list>          <chr>            <list>   <list>   <list>       <list>    
## 1 <split [75/25]> train/test split <tibble> <tibble> <tibble>     <workflow>

예측결과 자동생성: collect_predictions()

tree_test_pred <- 
  tree_test_fit %>%
  collect_predictions()

tree_test_pred

## # A tibble: 25 x 7
##    id               .pred_이직 .pred_근무  .row .pred_class churn .config       
##    <chr>                 <dbl>      <dbl> <int> <fct>       <fct> <chr>         
##  1 train/test split      0.145     0.855      3 근무        근무  Preprocessor1~
##  2 train/test split      0.145     0.855      4 근무        근무  Preprocessor1~
##  3 train/test split      0.145     0.855      6 근무        이직  Preprocessor1~
##  4 train/test split      0.145     0.855     12 근무        근무  Preprocessor1~
##  5 train/test split      0.145     0.855     22 근무        근무  Preprocessor1~
##  6 train/test split      0.145     0.855     26 근무        근무  Preprocessor1~
##  7 train/test split      0.145     0.855     28 근무        근무  Preprocessor1~
##  8 train/test split      0.145     0.855     31 근무        근무  Preprocessor1~
##  9 train/test split      0.923     0.0769    40 이직        이직  Preprocessor1~
## 10 train/test split      0.145     0.855     43 근무        근무  Preprocessor1~
## # ... with 15 more rows

정오분류표(confusion matrix) 만들기

tree_test_conf <-
  tree_test_pred  %>%
  conf_mat(truth = churn, 
           estimate = .pred_class)

tree_test_conf

##           Truth
## Prediction 이직 근무
##       이직    4    0
##       근무    3   18

autoplot(tree_test_conf, type = "heatmap") # mosaic

autoplot(tree_test_conf, type = "mosaic")

summary(tree_test_conf)

## # A tibble: 13 x 3
##    .metric              .estimator .estimate
##    <chr>                <chr>          <dbl>
##  1 accuracy             binary         0.88 
##  2 kap                  binary         0.658
##  3 sens                 binary         0.571
##  4 spec                 binary         1    
##  5 ppv                  binary         1    
##  6 npv                  binary         0.857
##  7 mcc                  binary         0.700
##  8 j_index              binary         0.571
##  9 bal_accuracy         binary         0.786
## 10 detection_prevalence binary         0.16 
## 11 precision            binary         1    
## 12 recall               binary         0.571
## 13 f_meas               binary         0.727

f1: 재현율(Recall)(↑)과 정밀도(Precision)(↑)

재현율(Recall): 실제 Class 중에 잘 맞춘 것(=TPR=민감도)

정밀도(Precision): 예측 Class 중에 잘 맞춘 것

정확도 (Accuracy) : 클래스 0과 1 모두를 정확하게 분류

ACU(area under the curve): ROC 정확도

tree_test_pred %>%
  roc_auc(truth = churn, 
          estimate =.pred_이직)

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.786

ROC 커브

test_auc <-
  tree_test_pred %>%
  roc_curve(truth = churn, 
            estimate = .pred_이직) %>% 
  mutate(model = "test_auc")

autoplot(test_auc)

lift 커브

tree_test_pred %>%
  lift_curve(truth = churn, 
             estimate = .pred_이직) %>%
  autoplot()

gain 커브

tree_test_pred %>%
  gain_curve(truth = churn, 
             estimate = .pred_이직) %>%
  autoplot()

중요변수 확인

tree_test_fit %>%
  pluck(".workflow", 1) %>%   
  pull_workflow_fit() %>% 
  vip(num_features = 20)

## Warning: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
## Please use `extract_fit_parsnip()` instead.

10.train, test 검정결과 비교

정오분류표(confusion matrix) 비교

tree_train_conf

##           Truth
## Prediction 이직 근무
##       이직   12    1
##       근무    9   53

tree_test_conf

##           Truth
## Prediction 이직 근무
##       이직    4    0
##       근무    3   18

autoplot(tree_train_conf, type = "mosaic") # mosaic

autoplot(tree_test_conf, type = "mosaic")

검정결과 비교

summary(tree_train_conf)

## # A tibble: 13 x 3
##    .metric              .estimator .estimate
##    <chr>                <chr>          <dbl>
##  1 accuracy             binary         0.867
##  2 kap                  binary         0.626
##  3 sens                 binary         0.571
##  4 spec                 binary         0.981
##  5 ppv                  binary         0.923
##  6 npv                  binary         0.855
##  7 mcc                  binary         0.656
##  8 j_index              binary         0.553
##  9 bal_accuracy         binary         0.776
## 10 detection_prevalence binary         0.173
## 11 precision            binary         0.923
## 12 recall               binary         0.571
## 13 f_meas               binary         0.706

summary(tree_test_conf)

## # A tibble: 13 x 3
##    .metric              .estimator .estimate
##    <chr>                <chr>          <dbl>
##  1 accuracy             binary         0.88 
##  2 kap                  binary         0.658
##  3 sens                 binary         0.571
##  4 spec                 binary         1    
##  5 ppv                  binary         1    
##  6 npv                  binary         0.857
##  7 mcc                  binary         0.700
##  8 j_index              binary         0.571
##  9 bal_accuracy         binary         0.786
## 10 detection_prevalence binary         0.16 
## 11 precision            binary         1    
## 12 recall               binary         0.571
## 13 f_meas               binary         0.727

ROC 커브 비교

bind_rows(train_auc, test_auc) %>% 
  ggplot(mapping = aes(x = 1 - specificity, 
                       y = sensitivity, 
                       color = model)) + 
  geom_path(lwd = 1.5) +
  geom_abline(lty = 5) + 
  coord_equal()

11.decision tree 만들기

install.packages(“rpart.plot”)

library(rpart.plot)

1번째 방법

rpart_fit <- 
  tree_train_fit %>% 
  pull_workflow_fit()

## Warning: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
## Please use `extract_fit_parsnip()` instead.

#2번째 방법

rpart_fit <- 
      tree_train_fit %>% 
      extract_fit_parsnip()

모형 1

rpart.plot(x = rpart_fit$fit,
           yesno = 2,
           type = 2, 
           extra = 1, 
           split.font = 1, 
           varlen = -10)

## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
##     Call rpart.plot with roundint=FALSE,
##     or rebuild the rpart model with model=TRUE.

모형 2

prp(x = rpart_fit$fit, 
    type = 1, 
    extra = 1, 
    under = TRUE, 
    split.font = 1, 
    varlen = -10,
    box.col=ifelse(rpart_fit$fit$frame$var == "<leaf>", 'gray', 'white'))

## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
##     Call prp with roundint=FALSE,
##     or rebuild the rpart model with model=TRUE.

# help(prp) # type = 나무그래프 표현 종류 # extra= 추가 정보 표시, 1=노드의 관측수 표시 # under= box 아래 관측값 표시, default=False # split.font = 글자 font, default=2(bold) # varlen = 변수이름 길이, default=-8, 0=full name # box.col=box 색깔

Share on

Copy URL

Kakao Facebook Twitter LinkedIn

SeongJae Yoo

Share on

데이터마이닝의 이론과 실제 기말고사 연습문제6

SeongJae-Yoo

2022 6 8

1.초기 모델(Initial Model)

https://www.tidymodels.org/start/models/

유니버설 은행 사례

tidyverse: tidy형식 기본도구

ggplot2, purrr, tibble 3.0.3,

dplyr, tidyr, stringr, readr, forcats

등 8개 패키지를 한 곳에 묶은 형태

tidymodels:tidy 모델 분석도구(데이터마이닝, 통계)

broom, recipes, dials, rsample, infer,

tune, modeldata, workflows, parsnip,

yardstick

tidytext: 텍스트마이닝 도구

rlang, tibble, dplyr, stringr, hunspell,

generics,lifecycle, Matrix, tokenizers,

janeaustenr, purrr

install.packages(“tidyverse”)

install.packages(“tidymodels”)

install.packages(“tidytext”)

install.packages(“skimr”)

install.packages(“vip”)

01.데이터 불러오기

02.data 전처리

범주형 변수(factor)로 인식하게 변환

결과변수(class)에서 관심있는 변수를 1번으로 세팅

필요없는 변수제거: ID, 우편번호 제거

recipe에서 제거할 수도 있음

03.데이터 탐색(EDA)

데이터 탐색: 범주형, 연속형 구분

skimr::skim() - package명을 앞에 써서 구분

패키지를 여러개 사용할 경우에 이름이 같은 경우도 있어서

구분이 필요할 경우에 [패키지명::]을 사용

base accuracy

yes 기준으로 0.096

04.훈련용, 테스트용 데이터 분할: partition

데이터 partition

training, test용 분리

05.Model 만들기(모델 설정)

Model 만들기

모델 인자(argument) 확인

recipe 만들기

step_dummy(all_nominal(), -all_outcomes()) : one-hot-ecoding

step_log(Gr_Liv_Area, base = 10) : 로그함수로 변환

step_other(Neighborhood, threshold = 0.01) : 값이 적은 항목을 기타로 변환

step_upsample(churn) # 데이터 균형화

step_zv(all_predictors()) : 단일 고유 값 (예 : 모두 0) 변수 제거.

특히, penalty 사용하는 모델에서 중요(logistic, SVM 등)

step_normalize(all_numeric()) : 데이터 정규화

06.workflow 만들기

07.Model 훈련(모델 학습)

훈련데이터로 모델 훈련하기

모델 훈련 결과 확인

08.훈련모델 검정

예측결과표 생성

정오분류표(confusion matrix) 만들기

f1: 재현율(Recall)(↑)과 정밀도(Precision)(↑)

재현율(Recall): 실제 Class 중에 잘 맞춘 것(=TPR=민감도)

정밀도(Precision): 예측 Class 중에 잘 맞춘 것

정확도 (Accuracy) : 클래스 0과 1 모두를 정확하게 분류

ACU(area under the curve): ROC 정확도

ROC 커브

lift 커브

gain 커브

중요변수 확인

09.테스트 데이터 검정

구축된 모델에 test data로 검정

last_fit 사용

data: churn_split 사용

예측결과 자동생성: collect_predictions()

정오분류표(confusion matrix) 만들기

f1: 재현율(Recall)(↑)과 정밀도(Precision)(↑)

재현율(Recall): 실제 Class 중에 잘 맞춘 것(=TPR=민감도)

정밀도(Precision): 예측 Class 중에 잘 맞춘 것

정확도 (Accuracy) : 클래스 0과 1 모두를 정확하게 분류

ACU(area under the curve): ROC 정확도

ROC 커브