Share on

데이터마이닝의 이론과 실제 중간고사 연습문제

연습문제1

K대학에서는 재학생을 대상으로 교육과정에 대한 현황조사를 실시하였다.

학년에 따라 비교과운영 점수가 차이가 있는 가?

R.version

##                _                           
## platform       i386-w64-mingw32            
## arch           i386                        
## os             mingw32                     
## system         i386, mingw32               
## status                                     
## major          4                           
## minor          1.3                         
## year           2022                        
## month          03                          
## day            10                          
## svn rev        81868                       
## language       R                           
## version.string R version 4.1.3 (2022-03-10)
## nickname       One Push-Up

1.기본 package 설정

1.1 library 로드

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(tidymodels)

## -- Attaching packages -------------------------------------- tidymodels 0.2.0 --

## v broom        0.8.0     v rsample      0.1.1
## v dials        0.1.1     v tune         0.2.0
## v infer        1.0.0     v workflows    0.2.6
## v modeldata    0.1.1     v workflowsets 0.2.1
## v parsnip      0.2.1     v yardstick    0.0.9
## v recipes      0.2.0

## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x scales::discard() masks purrr::discard()
## x dplyr::filter()   masks stats::filter()
## x recipes::fixed()  masks stringr::fixed()
## x dplyr::lag()      masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step()   masks stats::step()
## * Use suppressPackageStartupMessages() to eliminate package startup messages

library(rstatix)

## 
## 다음의 패키지를 부착합니다: 'rstatix'

## The following objects are masked from 'package:infer':
## 
##     chisq_test, prop_test, t_test

## The following object is masked from 'package:dials':
## 
##     get_n

## The following object is masked from 'package:stats':
## 
##     filter

library(skimr)

2.데이터 불러오기

edu_tb <- read_csv('data1.csv', 
                   col_names = TRUE,
                   locale=locale('ko', encoding='euc-kr'), # 한글
                   na=".") %>%
  round(2) %>%                 # 소수점 2자리로 반올림
  mutate_if(is.character, as.factor) %>%
  mutate(학년 = factor(학년,
                        levels=c(1:4),
                        labels=c("1학년","2학년","3학년","4학년")))

## Rows: 286 Columns: 6
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (6): 학부, 학년, 교양강의운영, 전공강의운영, 비교과운영, 종합점수
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

str(edu_tb)

## tibble [286 x 6] (S3: tbl_df/tbl/data.frame)
##  $ 학부        : num [1:286] 1 1 1 1 1 1 1 1 2 2 ...
##  $ 학년        : Factor w/ 4 levels "1학년","2학년",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ 교양강의운영: num [1:286] 47.6 33.3 50 35.7 54.7 39.3 46.4 42.9 42.9 32.1 ...
##  $ 전공강의운영: num [1:286] 40.5 35.7 52.4 28.5 92.8 53.6 46.4 67.9 50 28.6 ...
##  $ 비교과운영  : num [1:286] 40 33.3 50 40 43.3 55 45 35 25 30 ...
##  $ 종합점수    : num [1:286] 46.7 33.6 50.4 36.1 56.2 48.9 46.9 53.2 37.4 31.5 ...

edu_tb

## # A tibble: 286 x 6
##     학부 학년  교양강의운영 전공강의운영 비교과운영 종합점수
##    <dbl> <fct>        <dbl>        <dbl>      <dbl>    <dbl>
##  1     1 1학년         47.6         40.5       40       46.7
##  2     1 1학년         33.3         35.7       33.3     33.6
##  3     1 1학년         50           52.4       50       50.4
##  4     1 1학년         35.7         28.5       40       36.1
##  5     1 1학년         54.7         92.8       43.3     56.2
##  6     1 1학년         39.3         53.6       55       48.9
##  7     1 1학년         46.4         46.4       45       46.9
##  8     1 1학년         42.9         67.9       35       53.2
##  9     2 1학년         42.9         50         25       37.4
## 10     2 1학년         32.1         28.6       30       31.5
## # ... with 276 more rows

tail(edu_tb)

## # A tibble: 6 x 6
##    학부 학년  교양강의운영 전공강의운영 비교과운영 종합점수
##   <dbl> <fct>        <dbl>        <dbl>      <dbl>    <dbl>
## 1     4 4학년         61.9         83.3       50       68.3
## 2     4 4학년         73.8         83.3       53.3     47.6
## 3     4 4학년         76.1         80.9       80       75.1
## 4     4 4학년         52.4         50         36.6     53.6
## 5     4 4학년         69           80.9       73.3     72.8
## 6     4 4학년         47.6         54.7       56.6     48.3

3.기술통계분석

skim(edu_tb)

Data summary
Name	edu_tb
Number of rows	286
Number of columns	6
_______________________
Column type frequency:
factor	1
numeric	5
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
학년	0	1	FALSE	4	2학년: 97, 4학년: 72, 3학년: 67, 1학년: 50

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
학부	1	2.85	1.02	1	2.00	3.0	4.00	4	▃▆▁▇▇
교양강의운영	1	50.33	19.58	0	40.40	50.0	58.90	100	▁▃▇▂▁
전공강의운영	1	55.84	20.76	0	45.50	50.0	68.72	100	▁▂▇▃▂
비교과운영	1	46.65	19.99	0	35.00	50.0	56.60	100	▂▅▇▃▁
종합점수	1	51.21	16.34	0	41.73	49.7	60.00	100	▁▂▇▃▁

edu_tb %>% 
  group_by(학년) %>%
  get_summary_stats(비교과운영)

## # A tibble: 4 x 14
##   학년  variable       n   min   max median    q1    q3   iqr   mad  mean    sd
##   <fct> <chr>      <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1학년 비교과운영    50     0 100     41.6  30.8  50    19.2  12.4  40.4  17.6
## 2 2학년 비교과운영    97     0  93.3   46.6  35    56.6  21.6  17.2  45.7  21.3
## 3 3학년 비교과운영    67     0 100     50    43.3  63.3  20    14.8  52.5  19.7
## 4 4학년 비교과운영    72     0  93.3   50    36.6  56.6  20    14.8  46.8  18.7
## # ... with 2 more variables: se <dbl>, ci <dbl>

4.그래프 그리기(박스그래프,히스토그램)

edu_tb %>% 
  ggplot(mapping = aes(x = 학년,
                       y = 비교과운영)) +
  geom_boxplot() +
  labs(y = "비교과운영")

edu_tb %>% 
  ggplot(mapping = aes(x = 비교과운영)) +
  geom_histogram(bins = 10,               # binwidth=1,000: 값차이
                 color = "white", 
                 fill = "steelblue") +
  facet_wrap(~학년)

  #facet_wrap(학부~ 학년)                   # 그룹별 분리

5.이상치 제거

이상치 확인

edu_tb %>%
  select(학년, 비교과운영) %>%
  group_by(학년) %>%
  identify_outliers(비교과운영) %>%
  print(n=50)

## # A tibble: 23 x 4
##    학년  비교과운영 is.outlier is.extreme
##    <fct>      <dbl> <lgl>      <lgl>     
##  1 1학년        0   TRUE       FALSE     
##  2 1학년      100   TRUE       FALSE     
##  3 1학년        0   TRUE       FALSE     
##  4 2학년       90   TRUE       FALSE     
##  5 2학년        0   TRUE       FALSE     
##  6 2학년       93.3 TRUE       FALSE     
##  7 2학년        0   TRUE       FALSE     
##  8 2학년        0   TRUE       FALSE     
##  9 2학년        0   TRUE       FALSE     
## 10 2학년        0   TRUE       FALSE     
## 11 2학년       90   TRUE       FALSE     
## 12 2학년       93.3 TRUE       FALSE     
## 13 3학년      100   TRUE       FALSE     
## 14 3학년      100   TRUE       FALSE     
## 15 3학년      100   TRUE       FALSE     
## 16 3학년      100   TRUE       FALSE     
## 17 3학년        0   TRUE       FALSE     
## 18 3학년        6.7 TRUE       FALSE     
## 19 4학년        0   TRUE       FALSE     
## 20 4학년       93.3 TRUE       FALSE     
## 21 4학년        6.6 TRUE       FALSE     
## 22 4학년        6.6 TRUE       FALSE     
## 23 4학년        0   TRUE       FALSE

edu_tb %>% 
  ggplot(mapping = aes(x = 학년,
                       y = 비교과운영)) +
  geom_boxplot() +
  labs(y = "비교과운영")

!은 제외한 나머지 모두를 가져올때 사용함

edu_tb <- edu_tb %>%
  filter(!(비교과운영 <= 10 | 비교과운영 >= 100)) %>%
  filter(!(학년 == "1학년" & 비교과운영 >= 76.6 )) %>%
  filter(!(학년 == "2학년" & 비교과운영 >= 83.3)) %>%
  filter(!(학년 == "4학년" & 비교과운영 >= 93.3))

edu_tb %>% 
  ggplot(mapping = aes(x = 학년,
                       y = 비교과운영)) +
  geom_boxplot() +
  labs(y = "비교과운영")

# 6.정규분포 검정 ,모든학년이 p값이 0.05 보다 크므로 정규분포를 따른다.

edu_tb %>%
  group_by(학년) %>%
  shapiro_test(비교과운영)

## # A tibble: 4 x 4
##   학년  variable   statistic      p
##   <fct> <chr>          <dbl>  <dbl>
## 1 1학년 비교과운영     0.950 0.0502
## 2 2학년 비교과운영     0.975 0.0999
## 3 3학년 비교과운영     0.979 0.375 
## 4 4학년 비교과운영     0.975 0.204

if(0)" null hypothesis(H0; 귀무가설) 는 데이터가 정규분포를 따른다 이고, Alternative hypothesis(H1; 대립가설) 은 데이터가 정규분포를 따르지 않는다 이다. 이렇게 획득한 shapiro-wilk test 결과, p-value 가 0.05 보다 작을 경우 귀무가설을 reject하고 대립가설을 채택; p-value 가 0.05보다 클 경우 대립가설을 reject하고 귀무가설을 채택 "

7.등분산 검정 (대상 집단의 분산이 같은지 다른지를 통계적으로 검정하는 방법)

이분산일때는 하단의 Welch’s ANOVA test 참조

edu_tb %>%
  levene_test(비교과운영 ~ 학년)  # levene_test p값이 0.05이하이면 이분산 Welch's ANOVA test 사용함

## # A tibble: 1 x 4
##     df1   df2 statistic     p
##   <int> <int>     <dbl> <dbl>
## 1     3   253      1.04 0.375

귀무가설(H0): 모든 집단의 분산은 차이가 없다. 대립가설(H1): 적어도 하나 이상의 집단의 분산에 차이가 있다. 등분산 검정의 결과 유의값(p-value)이 .05 미만인 경우에는 대립가설을 지지하여 ’분산에 차이가 있다.’고 말할 수 있다. 반대로 유의값(p-value)이 .05 이상인 경우에는 귀무가설을 지지하여 ’분산에 차이가 없다.’고 말할 수 있다. R에서 등분산 검정을 수행하는 함수는 levene.test와 bartlett.test가 있다.

8.등분산 ANOVA

등분산일때 ANOVA분석 (종속변수 ~ 그룹변수)

p값이 0.05 보다 작으면 대립가설 성립(모든 그룹의 평균이 같지 않다,)

owa_result <- aov(비교과운영 ~ 학년, 
                     data=edu_tb)

tidy(owa_result)

## # A tibble: 2 x 6
##   term         df  sumsq meansq statistic  p.value
##   <chr>     <dbl>  <dbl>  <dbl>     <dbl>    <dbl>
## 1 학년          3  3185.  1062.      5.26  0.00154
## 2 Residuals   253 51036.   202.     NA    NA

summary(owa_result)

##              Df Sum Sq Mean Sq F value  Pr(>F)   
## 학년          3   3185  1061.7   5.263 0.00154 **
## Residuals   253  51036   201.7                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

edu_tb %>%
  anova_test(비교과운영 ~ 학년)

## Coefficient covariances computed by hccm()

## ANOVA Table (type II tests)
## 
##   Effect DFn DFd     F     p p<.05   ges
## 1   학년   3 253 5.263 0.002     * 0.059

Cohen’s d(effect size) 두 집단의 평균 차이를 일정한 기준으로 표현한 것

0.2 (small effect), 0.5 (moderate effect) and 0.8 (large effect)

edu_tb %>% 
  cohens_d(formula = 비교과운영 ~ 학년, 
           var.equal = TRUE)

## # A tibble: 6 x 7
##   .y.        group1 group2 effsize    n1    n2 magnitude 
## * <chr>      <chr>  <chr>    <dbl> <int> <int> <ord>     
## 1 비교과운영 1학년  2학년   -0.351    45    85 small     
## 2 비교과운영 1학년  3학년   -0.788    45    61 moderate  
## 3 비교과운영 1학년  4학년   -0.642    45    66 moderate  
## 4 비교과운영 2학년  3학년   -0.349    85    61 small     
## 5 비교과운영 2학년  4학년   -0.234    85    66 small     
## 6 비교과운영 3학년  4학년    0.118    61    66 negligible

등분산(2개의 모집단(Population) 에서 추출된 각 sample 간의 분산이 같음)이 성립되므로 Fisher LSD 실행시킴

9.사후검정(Multicamparison test ),(post hoc)

Fisher LSD(Least Significant Difference = “최소한의 유의미한 차이”)

edu_tb %>% 
  pairwise_t_test(비교과운영 ~ 학년, 
                     p.adj="non")

## # A tibble: 6 x 9
##   .y.        group1 group2    n1    n2        p p.signif    p.adj p.adj.signif
## * <chr>      <chr>  <chr>  <int> <int>    <dbl> <chr>       <dbl> <chr>       
## 1 비교과운영 1학년  2학년     45    85 0.0566   ns       0.0566   ns          
## 2 비교과운영 1학년  3학년     45    61 0.000311 ***      0.000311 ***         
## 3 비교과운영 2학년  3학년     85    61 0.0303   *        0.0303   *           
## 4 비교과운영 1학년  4학년     45    66 0.0021   **       0.0021   **          
## 5 비교과운영 2학년  4학년     85    66 0.133    ns       0.133    ns          
## 6 비교과운영 3학년  4학년     61    66 0.508    ns       0.508    ns

Bonferroni

edu_tb %>% 
  pairwise_t_test(비교과운영 ~ 학년, 
                     p.adj="bonferroni")

## # A tibble: 6 x 9
##   .y.        group1 group2    n1    n2        p p.signif   p.adj p.adj.signif
## * <chr>      <chr>  <chr>  <int> <int>    <dbl> <chr>      <dbl> <chr>       
## 1 비교과운영 1학년  2학년     45    85 0.0566   ns       0.34    ns          
## 2 비교과운영 1학년  3학년     45    61 0.000311 ***      0.00187 **          
## 3 비교과운영 2학년  3학년     85    61 0.0303   *        0.182   ns          
## 4 비교과운영 1학년  4학년     45    66 0.0021   **       0.0126  *           
## 5 비교과운영 2학년  4학년     85    66 0.133    ns       0.795   ns          
## 6 비교과운영 3학년  4학년     61    66 0.508    ns       1       ns

Tukey HSD

owa_result %>% 
  TukeyHSD() %>% 
  tidy()

## # A tibble: 6 x 7
##   term  contrast    null.value estimate conf.low conf.high adj.p.value
##   <chr> <chr>            <dbl>    <dbl>    <dbl>     <dbl>       <dbl>
## 1 학년  2학년-1학년          0     5.01   -1.76      11.8      0.224  
## 2 학년  3학년-1학년          0    10.2     2.99      17.4      0.00176
## 3 학년  4학년-1학년          0     8.53    1.43      15.6      0.0112 
## 4 학년  3학년-2학년          0     5.19   -0.973     11.4      0.132  
## 5 학년  4학년-2학년          0     3.52   -2.51       9.54     0.434  
## 6 학년  4학년-3학년          0    -1.67   -8.20       4.85     0.911

학년으로 표현

install.packages(“agricolae”)

library(agricolae)

console=TRUE: 결과를 화면에 표시

학년=TRUE: 그룹으로 묶어서 표시, FALSE: 1:1로 비교

LSD.test, duncan.test, scheffe.test

#본페로니 검정

owa_result %>%
  LSD.test("학년",
           group=FALSE, 
           console = TRUE)

## 
## Study: . ~ "학년"
## 
## LSD t Test for 비교과운영 
## 
## Mean Square Error:  201.7226 
## 
## 학년,  means and individual ( 95 %) CI
## 
##       비교과운영      std  r      LCL      UCL  Min  Max
## 1학년   40.81111 11.54500 45 36.64144 44.98078 15.0 70.0
## 2학년   45.82588 15.50789 85 42.79200 48.85976 13.3 80.0
## 3학년   51.01639 13.90427 61 47.43508 54.59771 20.0 83.3
## 4학년   49.34242 14.34192 66 45.89943 52.78542 16.6 80.0
## 
## Alpha: 0.05 ; DF Error: 253
## Critical Value of t: 1.969385 
## 
## Comparison between treatments means
## 
##               difference pvalue signif.        LCL        UCL
## 1학년 - 2학년  -5.014771 0.0566       . -10.171375  0.1418329
## 1학년 - 3학년 -10.205282 0.0003     *** -15.701825 -4.7087398
## 1학년 - 4학년  -8.531313 0.0021      ** -13.938746 -3.1238807
## 2학년 - 3학년  -5.190511 0.0303       *  -9.884152 -0.4968699
## 2학년 - 4학년  -3.516542 0.1325          -8.105508  1.0724245
## 3학년 - 4학년   1.673969 0.5076          -3.293930  6.6418688

owa_result %>%
  LSD.test("학년",
           group=TRUE, 
           console = TRUE)

## 
## Study: . ~ "학년"
## 
## LSD t Test for 비교과운영 
## 
## Mean Square Error:  201.7226 
## 
## 학년,  means and individual ( 95 %) CI
## 
##       비교과운영      std  r      LCL      UCL  Min  Max
## 1학년   40.81111 11.54500 45 36.64144 44.98078 15.0 70.0
## 2학년   45.82588 15.50789 85 42.79200 48.85976 13.3 80.0
## 3학년   51.01639 13.90427 61 47.43508 54.59771 20.0 83.3
## 4학년   49.34242 14.34192 66 45.89943 52.78542 16.6 80.0
## 
## Alpha: 0.05 ; DF Error: 253
## Critical Value of t: 1.969385 
## 
## Groups according to probability of means differences and alpha level( 0.05 )
## 
## Treatments with the same letter are not significantly different.
## 
##       비교과운영 groups
## 3학년   51.01639      a
## 4학년   49.34242     ab
## 2학년   45.82588     bc
## 1학년   40.81111      c

owa_result %>%
  duncan.test("학년",
              group=TRUE, 
              console = TRUE)

## 
## Study: . ~ "학년"
## 
## Duncan's new multiple range test
## for 비교과운영 
## 
## Mean Square Error:  201.7226 
## 
## 학년,  means
## 
##       비교과운영      std  r  Min  Max
## 1학년   40.81111 11.54500 45 15.0 70.0
## 2학년   45.82588 15.50789 85 13.3 80.0
## 3학년   51.01639 13.90427 61 20.0 83.3
## 4학년   49.34242 14.34192 66 16.6 80.0
## 
## Groups according to probability of means differences and alpha level( 0.05 )
## 
## Means with the same letter are not significantly different.
## 
##       비교과운영 groups
## 3학년   51.01639      a
## 4학년   49.34242      a
## 2학년   45.82588     ab
## 1학년   40.81111      b

owa_result %>%
  duncan.test("학년",
              group=FALSE, 
              console = TRUE)

## 
## Study: . ~ "학년"
## 
## Duncan's new multiple range test
## for 비교과운영 
## 
## Mean Square Error:  201.7226 
## 
## 학년,  means
## 
##       비교과운영      std  r  Min  Max
## 1학년   40.81111 11.54500 45 15.0 70.0
## 2학년   45.82588 15.50789 85 13.3 80.0
## 3학년   51.01639 13.90427 61 20.0 83.3
## 4학년   49.34242 14.34192 66 16.6 80.0
## 
## Comparison between treatments means
## 
##               difference pvalue signif.        LCL         UCL
## 1학년 - 2학년  -5.014771 0.0522       . -10.077902  0.04835986
## 1학년 - 3학년 -10.205282 0.0002     *** -15.713309 -4.69725579
## 1학년 - 4학년  -8.531313 0.0015      ** -13.861222 -3.20140464
## 2학년 - 3학년  -5.190511 0.0565       . -10.520420  0.13939740
## 2학년 - 4학년  -3.516542 0.1726          -8.579673  1.54658922
## 3학년 - 4학년   1.673969 0.5156          -3.389162  6.73710031

owa_result %>%
  scheffe.test("학년",
               group=TRUE, 
               console = TRUE)

## 
## Study: . ~ "학년"
## 
## Scheffe Test for 비교과운영 
## 
## Mean Square Error  : 201.7226 
## 
## 학년,  means
## 
##       비교과운영      std  r  Min  Max
## 1학년   40.81111 11.54500 45 15.0 70.0
## 2학년   45.82588 15.50789 85 13.3 80.0
## 3학년   51.01639 13.90427 61 20.0 83.3
## 4학년   49.34242 14.34192 66 16.6 80.0
## 
## Alpha: 0.05 ; DF Error: 253 
## Critical Value of F: 2.640281 
## 
## Groups according to probability of means differences and alpha level( 0.05 )
## 
## Means with the same letter are not significantly different.
## 
##       비교과운영 groups
## 3학년   51.01639      a
## 4학년   49.34242      a
## 2학년   45.82588     ab
## 1학년   40.81111      b

다중비교 그래프

owa_result %>% 
  TukeyHSD() %>% 
  plot()

library(infer)

t-test 두 집단의 평균차이를 검정

H0(귀무가설): 두 모집단은 평균간의 차이가 없다. H1(대립가설): 두 모집단은 평균간의 차이가 있다.

iris 데이터으로 ANOVA 분석 (Species 별로 Sepal.Length의 모평균이 차이가 있는 지 검정)

https://m.blog.naver.com/pmw9440/221745257123

#ANOVA 일 경우 F검정 사용 , 독립변수 3개이상일 경우 사용 # T 검정은 독립변수가 2개 일때 사용 (성별 : 여자, 남자 의 차이 )

10.추론(infer)을 이용한 가설검정 및 그래프

f_cal 계산

f_cal <- edu_tb %>%
  specify(formula = 비교과운영 ~ 학년) %>% # hypothesize 없음
  calculate(stat = "F") %>%  #독립변수가 1,2,3,4 학년 4개있으므로 F 검정 사용 
  print()

## Response: 비교과운영 (numeric)
## Explanatory: 학년 (factor)
## # A tibble: 1 x 1
##    stat
##   <dbl>
## 1  5.26

Bootstrapping을 이용한 귀무가설 분포 생성

set.seed(123) 
null_dist_f <- edu_tb %>%
  specify(formula = 비교과운영 ~ 학년) %>% # 가설검정 공식을 specify 함수에 명세한다.
  hypothesize(null = "independence") %>% # 귀무가설을 hypothesize 함수에서 적시한다.독립성 검정(test of independence), 귀무가설 H0 : 두 변수 X와 Y는 서로 독립이다 (관련성이 없다), 대립가설 H1 : 두 변수 X와 Y는 서로 독립이 아니다 (관련성이 있다)
  generate(reps = 1000, type = "permute") %>% # 컴퓨터에서 모의실험 난수를 generate에서 생성시킨다.randomly reassigned => permute
  calculate(stat = "F") %>%  # 검정 통계량을 calculate 함수에 명시한다.
  print()

## Response: 비교과운영 (numeric)
## Explanatory: 학년 (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 x 2
##    replicate   stat
##        <int>  <dbl>
##  1         1 0.759 
##  2         2 0.880 
##  3         3 1.51  
##  4         4 1.21  
##  5         5 0.559 
##  6         6 0.417 
##  7         7 0.389 
##  8         8 0.360 
##  9         9 0.0417
## 10        10 2.75  
## # ... with 990 more rows

# stat = "F" ->  3개 이상의 집단을 대상으로 할 때 F-검정을 사용한다.

null_dist_f %>%
  visualize(method = "both") +           #method = "both": 이론분포+boot분포
  shade_p_value(obs_stat = f_cal,
                direction = "greater")

## Warning: Check to make sure the conditions have been met for the theoretical
## method. {infer} currently does not check these for you.

p_value

null_dist_f %>%
  get_p_value(obs_stat = f_cal,           
              direction = "greater") %>%
  print()

## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1   0.003

alternative는 two.sided, greater, less 3개의 값이 있다. default 값은 two.sided 이며, 이는 주어진 평균과 샘플이 단지 다르다는 것을 대립가설(H1)으로 두고자 할 때 사용되며 greater는 오로지 샘플이 주어진 평균보다 크다는 것을 대립가설(H1)을 두고자 할때, less는 작다는 것을 대립가설로 두고자 할 때 사용된다.

부록: 이분산(levene_test < 0.05) 일때 Welch’s ANOVA test

edu_tb %>% 
  welch_anova_test(비교과운영 ~ 학년)

## # A tibble: 1 x 7
##   .y.            n statistic   DFn   DFd       p method     
## * <chr>      <int>     <dbl> <dbl> <dbl>   <dbl> <chr>      
## 1 비교과운영   257      6.63     3  133. 0.00033 Welch ANOVA

edu_tb %>% 
  pairwise_t_test(비교과운영 ~ 학년, 
                     pool.sd = FALSE,      # 등분산 가정 안함
                     p.adj="bonferroni")

## # A tibble: 6 x 10
##   .y.     group1 group2    n1    n2 statistic    df       p   p.adj p.adj.signif
## * <chr>   <chr>  <chr>  <int> <int>     <dbl> <dbl>   <dbl>   <dbl> <chr>       
## 1 비교과~ 1학년  2학년     45    85    -2.08   114. 3.9 e-2 2.36e-1 ns          
## 2 비교과~ 1학년  3학년     45    61    -4.12   102. 7.65e-5 4.59e-4 ***         
## 3 비교과~ 1학년  4학년     45    66    -3.46   106. 7.79e-4 5   e-3 **          
## 4 비교과~ 2학년  3학년     85    61    -2.12   137. 3.6 e-2 2.15e-1 ns          
## 5 비교과~ 2학년  4학년     85    66    -1.44   144. 1.51e-1 9.06e-1 ns          
## 6 비교과~ 3학년  4학년     61    66     0.668  125. 5.06e-1 1   e+0 ns

부록: 정규분포 가정 문제가 있으면 shapiro_test < 0.05

비모수통계분석 Kruskal Wallis H test

edu_tb %>%
  kruskal_test(비교과운영 ~ 학년)

## # A tibble: 1 x 6
##   .y.            n statistic    df       p method        
## * <chr>      <int>     <dbl> <int>   <dbl> <chr>         
## 1 비교과운영   257      16.0     3 0.00113 Kruskal-Wallis

비모수 다중비교 Dunn test

edu_tb %>% 
  dunn_test(비교과운영 ~ 학년)

## # A tibble: 6 x 9
##   .y.        group1 group2    n1    n2 statistic        p   p.adj p.adj.signif
## * <chr>      <chr>  <chr>  <int> <int>     <dbl>    <dbl>   <dbl> <chr>       
## 1 비교과운영 1학년  2학년     45    85     1.88  0.0598   0.179   ns          
## 2 비교과운영 1학년  3학년     45    61     3.58  0.000337 0.00202 **          
## 3 비교과운영 1학년  4학년     45    66     3.24  0.00120  0.00599 **          
## 4 비교과운영 2학년  3학년     85    61     2.13  0.0332   0.133   ns          
## 5 비교과운영 2학년  4학년     85    66     1.70  0.0888   0.179   ns          
## 6 비교과운영 3학년  4학년     61    66    -0.441 0.659    0.659   ns

연습문제2 K대학에서는 재학생(1)과 교원(2)을 대상으로 교육과정에 대한 현황조사를 실시하였다. 종합점수가 재학생과 교원이 차이가 있는가?

1.기본 package 설정

1.1 library 로드

library(tidyverse)
library(tidymodels)
library(rstatix)
library(skimr)

2.데이터 불러오기

edu_tb <- read_csv('data2.csv', 
                   col_names = TRUE,
                   locale=locale('ko', encoding='euc-kr'), # 한글
                   na=".") %>%
  round(2) %>%                 # 소수점 2자리로 반올림
  mutate_if(is.character, as.factor) %>%
  mutate(구분 = factor(구분,
                       levels=c(1,2),
                       labels=c("재학생","교원")))

## Rows: 100 Columns: 3
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (3): 번호, 구분, 종합
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

str(edu_tb)

## tibble [100 x 3] (S3: tbl_df/tbl/data.frame)
##  $ 번호: num [1:100] 42 10 23 31 20 29 44 45 35 18 ...
##  $ 구분: Factor w/ 2 levels "재학생","교원": 1 1 1 1 1 1 1 1 1 1 ...
##  $ 종합: num [1:100] 0 5 70.8 71.6 71.7 71.7 73.3 73.3 78.3 79.1 ...

edu_tb

## # A tibble: 100 x 3
##     번호 구분    종합
##    <dbl> <fct>  <dbl>
##  1    42 재학생   0  
##  2    10 재학생   5  
##  3    23 재학생  70.8
##  4    31 재학생  71.6
##  5    20 재학생  71.7
##  6    29 재학생  71.7
##  7    44 재학생  73.3
##  8    45 재학생  73.3
##  9    35 재학생  78.3
## 10    18 재학생  79.1
## # ... with 90 more rows

3.기술통계분석

skim(edu_tb)

Data summary
Name	edu_tb
Number of rows	100
Number of columns	3
_______________________
Column type frequency:
factor	1
numeric	2
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
구분	0	1	FALSE	2	재학생: 50, 교원: 50

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
번호	0	1	50.50	29.01	1	25.75	50.5	75.25	100.0	▇▇▇▇▇
종합	0	1	73.54	16.05	0	66.60	73.3	84.32	98.3	▁▁▂▇▇

edu_tb %>% 
  group_by(구분) %>%
  get_summary_stats(종합)

## # A tibble: 2 x 14
##   구분   variable     n   min   max median    q1    q3   iqr   mad  mean    sd
##   <fct>  <chr>    <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 재학생 종합        50   0    95.8     70  62.0  78.9  16.9  12.5  68.6  17.6
## 2 교원   종합        50  39.1  98.3     80  70.2  87.5  17.3  13.6  78.5  12.7
## # ... with 2 more variables: se <dbl>, ci <dbl>

4.그래프 그리기(박스그래프,히스토그램)

edu_tb %>% 
  ggplot(mapping = aes(x = 구분,
                       y = 종합)) +
  geom_boxplot() +
  labs(y = "종합")

edu_tb %>% 
  ggplot(mapping = aes(x = 종합)) +
  geom_histogram(bins = 100,               # binwidth=1,000: 값차이
                 color = "white", 
                 fill = "steelblue") +
  facet_wrap(~ 구분)                   # 그룹별 분리

5.이상치 제거

이상치 확인

edu_tb %>%
  group_by(구분) %>%
  identify_outliers(종합)

## # A tibble: 3 x 5
##   구분    번호  종합 is.outlier is.extreme
##   <fct>  <dbl> <dbl> <lgl>      <lgl>     
## 1 재학생    42   0   TRUE       TRUE      
## 2 재학생    10   5   TRUE       TRUE      
## 3 교원      70  39.1 TRUE       FALSE

이상치 제거

edu_tb <- edu_tb %>%
  filter(!(구분 == "재학생" & 종합 <= 5 )) %>%
  filter(!(구분 == "교원" & 종합 <= 39.1 ))

edu_tb %>% 
  ggplot(mapping = aes(x = 구분,
                       y = 종합)) +
  geom_boxplot() +
  labs(y = "종합")

6.정규분포 검정, p값이 0.05 보다 크므로 정규분포를 따른다.

edu_tb %>%
  group_by(구분) %>%
  shapiro_test(종합)

## # A tibble: 2 x 4
##   구분   variable statistic     p
##   <fct>  <chr>        <dbl> <dbl>
## 1 재학생 종합         0.969 0.227
## 2 교원   종합         0.967 0.192

7.등분산 검정

edu_tb %>%
  levene_test(종합 ~ 구분)

## # A tibble: 1 x 4
##     df1   df2 statistic     p
##   <int> <int>     <dbl> <dbl>
## 1     1    95     0.231 0.632

# 0.05이므로 등분산 성립

8.등분산 t-검정

two-sided test: alternative = c(“two.sided”)

right-sided test: alternative = c(“greater”)

left-sided test: alternative = c(“less”)

등분산이면 var.equal=TRUE, 이분산이면 var.equal=FALSE

edu_tb %>% 
  t_test(formula = 종합 ~ 구분,
         comparisons = c("재학생", "교원"),
         paired = FALSE,
         var.equal = TRUE,
         alternative = "two.sided", # 양측 검정 (재학생,교원->두 집단의 차이만 검정 ) <-> 단측 검정 ( 모수의 평균으로 비교 )https://m.blog.naver.com/mykepzzang/220886418140 
         detailed = TRUE)

## # A tibble: 1 x 15
##   estimate estimate1 estimate2 .y.   group1 group2    n1    n2 statistic       p
## *    <dbl>     <dbl>     <dbl> <chr> <chr>  <chr>  <int> <int>     <dbl>   <dbl>
## 1    -8.01      71.3      79.3 종합  재학생 교원      48    49     -3.46 8.04e-4
## # ... with 5 more variables: df <dbl>, conf.low <dbl>, conf.high <dbl>,
## #   method <chr>, alternative <chr>

Cohen’s d(effect size)

0.2 (small effect), 0.5 (moderate effect) and 0.8 (large effect)

edu_tb %>% 
  cohens_d(formula = 종합 ~ 구분, 
           var.equal = TRUE)

## # A tibble: 1 x 7
##   .y.   group1 group2 effsize    n1    n2 magnitude
## * <chr> <chr>  <chr>    <dbl> <int> <int> <ord>    
## 1 종합  재학생 교원    -0.703    48    49 moderate

9.추론(infer)을 이용한 가설검정 및 그래프

9.1 표본평균(x)을 이용한 검정그래프

표본평균(x) 계산

x_bar <- edu_tb %>%
  specify(formula = 종합 ~ 구분) %>%   # hypothesize 없음
  calculate(stat = "diff in means",         # diff in means 변경, 재학생,교원의 차이
            order= c("재학생", "교원")) %>%
  print()

## Response: 종합 (numeric)
## Explanatory: 구분 (factor)
## # A tibble: 1 x 1
##    stat
##   <dbl>
## 1 -8.01

Bootstrapping을 이용한 귀무가설 분포 생성

set.seed(123) 
null_dist_x <- edu_tb %>%
  specify(formula = 종합 ~ 구분) %>%
  hypothesize(null = "independence") %>% # 차이가 있는 가 확인 할때 independence
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in means", 
            order= c("재학생", "교원")) %>%
  print()

## Response: 종합 (numeric)
## Explanatory: 구분 (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 x 2
##    replicate   stat
##        <int>  <dbl>
##  1         1 -0.836
##  2         2  1.61 
##  3         3 -0.914
##  4         4  2.03 
##  5         5 -0.432
##  6         6 -2.34 
##  7         7 -2.25 
##  8         8 -3.36 
##  9         9  0.125
## 10        10  4.79 
## # ... with 990 more rows

신뢰구간 생성

null_dist_ci <- null_dist_x %>%
  get_ci(level = 0.95, 
         type = "percentile") %>%
  print()

## # A tibble: 1 x 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1    -4.82     4.79

그래프 그리기

null_dist_x %>%
  visualize() +                                       # method 없음
  shade_p_value(obs_stat = x_bar,
                direction = "two-sided") +            # x_bar
  shade_confidence_interval(endpoints = null_dist_ci) # CI

p_value : 0.05 보다 작으므로 교원과 재학생들의 종합점수가 차이가 있다.

null_dist_x %>%
  get_p_value(obs_stat = x_bar,           
              direction = "two-sided") %>%
  print()

## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1   0.002

9.2 t값을 이용한 검정그래프

t_cal 계산

t_cal <- edu_tb %>%
  specify(formula = 종합 ~ 구분) %>% # hypothesize 없음
  calculate(stat = "t",  # 두 집단은 t 검정 사용
            order= c("재학생", "교원")) %>%
  print()

## Response: 종합 (numeric)
## Explanatory: 구분 (factor)
## # A tibble: 1 x 1
##    stat
##   <dbl>
## 1 -3.46

Bootstrapping을 이용한 귀무가설 분포 생성

set.seed(123) 
null_dist_t <- edu_tb %>%
  specify(formula = 종합 ~ 구분) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "t", 
            order= c("재학생", "교원")) %>%
  print()

## Response: 종합 (numeric)
## Explanatory: 구분 (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 x 2
##    replicate    stat
##        <int>   <dbl>
##  1         1 -0.341 
##  2         2  0.657 
##  3         3 -0.373 
##  4         4  0.829 
##  5         5 -0.176 
##  6         6 -0.959 
##  7         7 -0.924 
##  8         8 -1.38  
##  9         9  0.0508
## 10        10  1.99  
## # ... with 990 more rows

신뢰구간 생성

null_dist_ci <- null_dist_t %>%
  get_ci(level = 0.95, 
         type = "percentile") %>%
  print()

## # A tibble: 1 x 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1    -2.00     1.99

null_dist_t %>%
  visualize(method = "both") +           #method = "both": 이론분포+boot분포
  shade_p_value(obs_stat = t_cal,
                direction = "two-sided") +
  shade_confidence_interval(endpoints = null_dist_ci)

## Warning: Check to make sure the conditions have been met for the theoretical
## method. {infer} currently does not check these for you.

부록. 비모수통계분석

정규분포검정 p < 0.05일 때 비모수 wilcox.test로 분석

edu_tb %>% 
  wilcox_test(formula = 종합 ~ 구분,
              alternative = "two.sided")

## # A tibble: 1 x 7
##   .y.   group1 group2    n1    n2 statistic        p
## * <chr> <chr>  <chr>  <int> <int>     <dbl>    <dbl>
## 1 종합  재학생 교원      48    49       718 0.000958

연습문제3

K대학에서는 재학생 만족도 조사를 실시하였다.

재학생 만족도가 50점이라고 할 수 있는가?

1.기본 package 설정

1.1 library 로드

library(tidyverse)
library(tidymodels)
library(rstatix)
library(skimr)

2.데이터 불러오기

sat_tb <- read_csv('data3.csv', 
                   col_names = TRUE,
                   locale=locale('ko', encoding='euc-kr'), # 한글
                   na=".") %>%
  round(2) %>%                  # 소수점 2자리로 반올림
  mutate_if(is.character, as.factor)

## Rows: 200 Columns: 2
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (2): no, satis
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

str(sat_tb)

## tibble [200 x 2] (S3: tbl_df/tbl/data.frame)
##  $ no   : num [1:200] 1 2 3 4 5 6 8 9 10 11 ...
##  $ satis: num [1:200] 33.8 39.3 41.8 59.3 41.8 31.3 18.7 43.6 43.1 50.4 ...

glimpse(sat_tb)

## Rows: 200
## Columns: 2
## $ no    <dbl> 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 13, 14, 16, 17, 18, 19, 20, 21, ~
## $ satis <dbl> 33.8, 39.3, 41.8, 59.3, 41.8, 31.3, 18.7, 43.6, 43.1, 50.4, 46.7~

sat_tb

## # A tibble: 200 x 2
##       no satis
##    <dbl> <dbl>
##  1     1  33.8
##  2     2  39.3
##  3     3  41.8
##  4     4  59.3
##  5     5  41.8
##  6     6  31.3
##  7     8  18.7
##  8     9  43.6
##  9    10  43.1
## 10    11  50.4
## # ... with 190 more rows

3.기본통계치 확인

skim(sat_tb)

Data summary
Name	sat_tb
Number of rows	200
Number of columns	2
_______________________
Column type frequency:
numeric	2
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
no	0	1	112.82	64.58	1.0	56.75	113.5	168.25	222.0	▇▇▇▇▇
satis	0	1	49.98	17.17	16.2	39.45	48.4	59.32	96.3	▃▇▇▂▂

sat_tb %>%
  get_summary_stats(satis)

## # A tibble: 1 x 13
##   variable     n   min   max median    q1    q3   iqr   mad  mean    sd    se
##   <chr>    <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 satis      200  16.2  96.3   48.4  39.4  59.3  19.9  15.0  50.0  17.2  1.21
## # ... with 1 more variable: ci <dbl>

4.그래프 그리기(박스그래프,히스토그램)

sat_tb %>% 
  ggplot(aes(y = satis)) +
  geom_boxplot() +
  labs(y = "만족도")

sat_tb %>% 
  ggplot(mapping = aes(x = satis)) +
  geom_histogram(binwidth = 10) +
  coord_cartesian()

#coord_cartesian(xlim = c(0, 167), ylim = c(0, 114))

5.이상치 제거

이상치 확인

sat_tb %>% 
  identify_outliers(satis)

## # A tibble: 7 x 4
##      no satis is.outlier is.extreme
##   <dbl> <dbl> <lgl>      <lgl>     
## 1    27  96.3 TRUE       FALSE     
## 2    88  94.1 TRUE       FALSE     
## 3   112  89.5 TRUE       FALSE     
## 4   119  90   TRUE       FALSE     
## 5   122  94.4 TRUE       FALSE     
## 6   209  92.1 TRUE       FALSE     
## 7   219  94   TRUE       FALSE

이상치 제거

sat_tb <- sat_tb %>%
  filter(!(satis >= 86))

sat_tb %>% 
  ggplot(aes(y = satis)) +
  geom_boxplot() +
  labs(y = "만족도")

6.정규분포 검정 ,0.05보다 크므로 정규분포 만족

sat_tb %>%
  shapiro_test(satis)

## # A tibble: 1 x 3
##   variable statistic     p
##   <chr>        <dbl> <dbl>
## 1 satis        0.990 0.177

7.통계분석

two-sided test: alternative = c(“two.sided”)

right-sided test: alternative = c(“greater”)

left-sided test: alternative = c(“less”)

sat_tb %>% 
  t_test(formula = satis ~ 1,
         alternative = "two.sided",  # two.sided 이며, 이는 주어진 평균(50)과 샘플이 단지 다르다는 것을 대립가설(H1)으로 두고자 할 때 사용되며
         mu = 50.0, 
         conf.level = 0.95,
         detailed = TRUE)

## # A tibble: 1 x 12
##   estimate .y.   group1 group2       n statistic      p    df conf.low conf.high
## *    <dbl> <chr> <chr>  <chr>    <int>     <dbl>  <dbl> <dbl>    <dbl>     <dbl>
## 1     48.0 satis 1      null mo~   191     -1.83 0.0683   190     45.9      50.1
## # ... with 2 more variables: method <chr>, alternative <chr>

Cohen’s d(effect size)

0.2 (small effect), 0.5 (moderate effect) and 0.8 (large effect)

sat_tb %>% 
  cohens_d(formula = satis ~ 1, 
           mu = 50.0)

## # A tibble: 1 x 6
##   .y.   group1 group2     effsize     n magnitude 
## * <chr> <chr>  <chr>        <dbl> <int> <ord>     
## 1 satis 1      null model  -0.133   191 negligible

negligible: 무시할수 있을만큼 차이가 없다.

8.추론(infer)을 이용한 가설검정 및 그래프

p.21 설명

8.1 표본평균(x)을 이용한 검정그래프

표본평균(x) 계산

x_bar <- sat_tb %>%
  specify(response = satis) %>%    # hypothesize 없음
  calculate(stat = "mean") %>%      # stat = "mean"
  print()

## Response: satis (numeric)
## # A tibble: 1 x 1
##    stat
##   <dbl>
## 1  48.0

Bootstrapping을 이용한 귀무가설 분포 생성

set.seed(123) 
null_dist_x <- sat_tb %>%
  specify(response = satis) %>%
  hypothesize(null = "point",  # 평균의 비교 : point
              mu = 50) %>%
  generate(reps = 1000, 
           type = "bootstrap") %>%
  calculate(stat = "mean") %>%
  print()

## Response: satis (numeric)
## Null Hypothesis: point
## # A tibble: 1,000 x 2
##    replicate  stat
##        <int> <dbl>
##  1         1  49.4
##  2         2  50.3
##  3         3  50.0
##  4         4  49.5
##  5         5  49.3
##  6         6  50.7
##  7         7  50.9
##  8         8  51.1
##  9         9  48.6
## 10        10  51.6
## # ... with 990 more rows

신뢰구간 생성

null_dist_ci <- null_dist_x %>%
  get_ci(level = 0.95, 
         type = "percentile") %>%
  print()

## # A tibble: 1 x 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1     47.9     52.1

그래프 그리기

null_dist_x %>%
  visualize() +                                       # method 없음
  shade_p_value(obs_stat = x_bar,
                direction = "two-sided") +            # x_bar
  shade_confidence_interval(endpoints = null_dist_ci) # CI

p_value

null_dist_x %>%
  get_p_value(obs_stat = x_bar,           
              direction = "two-sided") %>%
  print()

## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1   0.066

0.05 보다 크므로 귀무가설 성립 -> 평균 50점과 차이가 없다.

8.2 t값을 이용한 검정그래프

t_cal 계산

t_cal <- sat_tb %>%
  specify(response = satis) %>%
  hypothesize(null = "point",         # hypothesize 필요
              mu = 50) %>%  
  calculate(stat = "t") %>%           # stat = "t"              
  print()

## Response: satis (numeric)
## Null Hypothesis: point
## # A tibble: 1 x 1
##    stat
##   <dbl>
## 1 -1.83

Bootstrapping을 이용한 귀무가설 분포 생성

set.seed(123) 
null_dist_t <- sat_tb %>%
  specify(response = satis) %>%
  hypothesize(null = "point", 
              mu = 50) %>%
  generate(reps = 1000, 
           type = "bootstrap") %>%
  calculate(stat = "t") %>%
  print()

## Response: satis (numeric)
## Null Hypothesis: point
## # A tibble: 1,000 x 2
##    replicate    stat
##        <int>   <dbl>
##  1         1 -0.540 
##  2         2  0.299 
##  3         3  0.0284
##  4         4 -0.543 
##  5         5 -0.675 
##  6         6  0.653 
##  7         7  0.783 
##  8         8  1.08  
##  9         9 -1.42  
## 10        10  1.49  
## # ... with 990 more rows

신뢰구간 생성

null_dist_ci <- null_dist_t %>%
  get_ci(level = 0.95, 
         type = "percentile") %>%
  print()

## # A tibble: 1 x 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1    -1.99     1.94

null_dist_t %>%
  visualize(method = "both") +           #method = "both": 이론분포+boot분포
  shade_p_value(obs_stat = t_cal,
                direction = "two-sided") +
  shade_confidence_interval(endpoints = null_dist_ci)

## Warning: Check to make sure the conditions have been met for the theoretical
## method. {infer} currently does not check these for you.

연습문제4

다음은 호흡과 뇌파와의 관계를 연구한 자료이다.

총 4개 채널이 있는데, 채널별로 알파파(al)와 베타파(be) 간에는 각각 차이가 있는가?

Ch1의 알파타와 베타파의 차이를 검증하고, 나머지 변수도 해보세요.

1.기본 package 설정

library(tidyverse)
library(tidymodels)
library(rstatix)
library(skimr)

2.데이터 불러오기

brain_tb <- read_csv('data4.csv', 
                     col_names = TRUE,
                     locale=locale('ko', encoding='euc-kr'), # 한글
                     na=".") %>%
  round(2) %>%                 # 소수점 2자리로 반올림
  mutate_if(is.character, as.factor)

## Rows: 144 Columns: 8
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (8): ch1al, ch2al, ch3al, ch4al, ch1be, ch2be, ch3be, ch4be
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

str(brain_tb)

## tibble [144 x 8] (S3: tbl_df/tbl/data.frame)
##  $ ch1al: num [1:144] 0.03 0.05 0.05 0.01 0.04 0.03 0.02 0.04 0.03 0.03 ...
##  $ ch2al: num [1:144] 0.03 0.04 0.02 0.01 0.04 0.03 0.02 0.05 0.03 0.05 ...
##  $ ch3al: num [1:144] 0.02 0.07 0.06 0.02 0.05 0.03 0.02 0.05 0.02 0.08 ...
##  $ ch4al: num [1:144] 0.02 0.07 0.06 0.02 0.06 0.02 0.02 0.03 0.03 0.08 ...
##  $ ch1be: num [1:144] 0.18 0.09 0.13 0.08 0.18 0.17 0.17 0.18 0.08 0.1 ...
##  $ ch2be: num [1:144] 0.14 0.14 0.15 0.09 0.13 0.28 0.17 0.16 0.1 0.1 ...
##  $ ch3be: num [1:144] 0.14 0.12 0.14 0.07 0.14 0.17 0.19 0.18 0.07 0.13 ...
##  $ ch4be: num [1:144] 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.1 0.1 ...

brain_tb

## # A tibble: 144 x 8
##    ch1al ch2al ch3al ch4al ch1be ch2be ch3be ch4be
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  0.03  0.03  0.02  0.02  0.18  0.14  0.14   0.1
##  2  0.05  0.04  0.07  0.07  0.09  0.14  0.12   0.1
##  3  0.05  0.02  0.06  0.06  0.13  0.15  0.14   0.1
##  4  0.01  0.01  0.02  0.02  0.08  0.09  0.07   0.1
##  5  0.04  0.04  0.05  0.06  0.18  0.13  0.14   0.1
##  6  0.03  0.03  0.03  0.02  0.17  0.28  0.17   0.2
##  7  0.02  0.02  0.02  0.02  0.17  0.17  0.19   0.2
##  8  0.04  0.05  0.05  0.03  0.18  0.16  0.18   0.2
##  9  0.03  0.03  0.02  0.03  0.08  0.1   0.07   0.1
## 10  0.03  0.05  0.08  0.08  0.1   0.1   0.13   0.1
## # ... with 134 more rows

long형으로 변형

brain_tb_long <- brain_tb %>%
  select(ch1al, ch1be) %>% # 필요한 변수만 선택
  pivot_longer(c("ch1al","ch1be"), #c("1999, 2000")에러남 
               names_to = "채널",
               values_to = "뇌파")

tail(brain_tb_long)

## # A tibble: 6 x 2
##   채널   뇌파
##   <chr> <dbl>
## 1 ch1al  0.07
## 2 ch1be  0.09
## 3 ch1al  0.07
## 4 ch1be  0.14
## 5 ch1al  0.02
## 6 ch1be  0.15

차이 추가

brain_tb <- brain_tb %>%
  mutate(차이 = ch1al-ch1be) # mutate -> 새로운 칼럼 생성

brain_tb

## # A tibble: 144 x 9
##    ch1al ch2al ch3al ch4al ch1be ch2be ch3be ch4be  차이
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  0.03  0.03  0.02  0.02  0.18  0.14  0.14   0.1 -0.15
##  2  0.05  0.04  0.07  0.07  0.09  0.14  0.12   0.1 -0.04
##  3  0.05  0.02  0.06  0.06  0.13  0.15  0.14   0.1 -0.08
##  4  0.01  0.01  0.02  0.02  0.08  0.09  0.07   0.1 -0.07
##  5  0.04  0.04  0.05  0.06  0.18  0.13  0.14   0.1 -0.14
##  6  0.03  0.03  0.03  0.02  0.17  0.28  0.17   0.2 -0.14
##  7  0.02  0.02  0.02  0.02  0.17  0.17  0.19   0.2 -0.15
##  8  0.04  0.05  0.05  0.03  0.18  0.16  0.18   0.2 -0.14
##  9  0.03  0.03  0.02  0.03  0.08  0.1   0.07   0.1 -0.05
## 10  0.03  0.05  0.08  0.08  0.1   0.1   0.13   0.1 -0.07
## # ... with 134 more rows

3.기술통계분석

skim(brain_tb)

Data summary
Name	brain_tb
Number of rows	144
Number of columns	9
_______________________
Column type frequency:
numeric	9
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
ch1al	1	0.04	0.03	0.00	0.03	0.04	0.05	0.17	▇▇▂▁▁
ch2al	1	0.04	0.02	0.00	0.03	0.04	0.05	0.14	▂▇▂▁▁
ch3al	1	0.05	0.03	0.00	0.03	0.04	0.06	0.15	▅▇▂▁▁
ch4al	1	0.04	0.02	0.00	0.03	0.04	0.05	0.15	▆▇▁▁▁
ch1be	1	0.12	0.04	0.03	0.10	0.12	0.14	0.28	▃▇▅▁▁
ch2be	1	0.13	0.04	0.02	0.10	0.12	0.15	0.28	▂▇▆▂▁
ch3be	1	0.13	0.04	0.02	0.11	0.12	0.15	0.24	▁▂▇▂▁
ch4be	1	0.12	0.05	0.00	0.10	0.10	0.12	0.20	▁▁▇▁▃
차이	1	-0.08	0.05	-0.25	-0.11	-0.08	-0.05	0.12	▁▃▇▁▁

brain_tb %>% 
  get_summary_stats(ch1be, ch1al, 차이)

## # A tibble: 3 x 13
##   variable     n   min   max median    q1     q3   iqr   mad   mean    sd    se
##   <chr>    <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1 ch1al      144  0     0.17   0.04  0.03  0.053 0.023 0.015  0.044 0.025 0.002
## 2 ch1be      144  0.03  0.28   0.12  0.1   0.14  0.04  0.03   0.123 0.039 0.003
## 3 차이       144 -0.25  0.12  -0.08 -0.11 -0.05  0.06  0.044 -0.078 0.053 0.004
## # ... with 1 more variable: ci <dbl>

4.그래프 그리기(박스그래프,히스토그램)

brain_tb %>% 
  ggplot(aes(y = 차이)) +
  geom_boxplot() +
  labs(y = "뇌파")

brain_tb %>% 
  ggplot(aes(x = 차이)) +
  geom_histogram(binwidth = 0.05, 
                 color = "white", 
                 fill = "steelblue")

5.이상치 제거

이상치 확인

brain_tb %>%
  identify_outliers(차이)

## # A tibble: 6 x 11
##   ch1al ch2al ch3al ch4al ch1be ch2be ch3be ch4be  차이 is.outlier is.extreme
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>      <lgl>     
## 1  0.15  0.06  0.12  0.11  0.03  0.11  0.08   0.1  0.12 TRUE       FALSE     
## 2  0.17  0.14  0.15  0.15  0.06  0.08  0.06   0.1  0.11 TRUE       FALSE     
## 3  0.03  0.04  0.01  0.01  0.28  0.18  0.19   0.2 -0.25 TRUE       FALSE     
## 4  0.11  0.12  0.13  0.14  0.03  0.02  0.03   0    0.08 TRUE       FALSE     
## 5  0.11  0.11  0.12  0.07  0.06  0.06  0.07   0.2  0.05 TRUE       FALSE     
## 6  0     0     0     0     0.22  0.23  0.24   0.2 -0.22 TRUE       FALSE

이상치 제거

brain_tb <- brain_tb %>%
  filter(!(차이 >= 0.01 | 차이 <= -0.22))

차이는 AND 값으로 ‘|’ 설정

6.정규분포 검정 ,0.05보다 크므로 정규분포 만족

brain_tb %>%
  shapiro_test(차이)

## # A tibble: 1 x 3
##   variable statistic     p
##   <chr>        <dbl> <dbl>
## 1 차이         0.989 0.360

aired-test에서는 Pre에서 Post를 뺀 값을 가지고 정규성 검정을 해야 된다.

알파 - 베타 = 차이로 검정해야한다.

7.paired t-검정

two-sided test: alternative = c(“two.sided”)

right-sided test: alternative = c(“greater”)

left-sided test: alternative = c(“less”)

paired = TRUE, paired t-test 정규분포를 따를때 사용 (p값 0.05 클때)

실험전과 후의 차이 또는 알파파와 베타파의 차이는 paired t-test 검정

brain_tb_long %>% 
  t_test(formula = 뇌파 ~ 채널,
         ref.group = "ch1al",
         paired = TRUE,
         alternative = "two.sided",
         detailed = TRUE)

## # A tibble: 1 x 13
##   estimate .y.   group1 group2    n1    n2 statistic        p    df conf.low
## *    <dbl> <chr> <chr>  <chr>  <int> <int>     <dbl>    <dbl> <dbl>    <dbl>
## 1  -0.0785 뇌파  ch1al  ch1be    144   144     -17.9 2.98e-38   143  -0.0872
## # ... with 3 more variables: conf.high <dbl>, method <chr>, alternative <chr>

Cohen’s d(effect size)

0.2 (small effect), 0.5 (moderate effect) and 0.8 (large effect)

brain_tb_long %>% 
  cohens_d(formula = 뇌파 ~ 채널, 
           paired = TRUE)

## # A tibble: 1 x 7
##   .y.   group1 group2 effsize    n1    n2 magnitude
## * <chr> <chr>  <chr>    <dbl> <int> <int> <ord>    
## 1 뇌파  ch1al  ch1be    -1.49   144   144 large

8.추론(infer)을 이용한 가설검정 및 그래프

8.1 표본평균(x)을 이용한 검정그래프

표본평균(x) 계산

x_bar <- brain_tb %>%
  specify(response = 차이) %>%    # hypothesize 없음
  calculate(stat = "mean") %>%      # stat = "mean"
  print()

## Response: 차이 (numeric)
## # A tibble: 1 x 1
##      stat
##     <dbl>
## 1 -0.0826

Bootstrapping을 이용한 귀무가설 분포 생성

set.seed(123) 
null_dist_x <- brain_tb %>%
  specify(response = 차이) %>%
  hypothesize(null = "point", 
              mu = 0) %>% # 차이가 소수점 0.033...등 이므로 0으로 설정 
  generate(reps = 1000, 
           type = "bootstrap") %>%
  calculate(stat = "mean") %>%
  print()

## Response: 차이 (numeric)
## Null Hypothesis: point
## # A tibble: 1,000 x 2
##    replicate       stat
##        <int>      <dbl>
##  1         1 -0.00331  
##  2         2  0.00493  
##  3         3  0.00110  
##  4         4 -0.00272  
##  5         5  0.00118  
##  6         6  0.0000735
##  7         7  0.00287  
##  8         8  0.00434  
##  9         9  0.000368 
## 10        10 -0.00485  
## # ... with 990 more rows

신뢰구간 생성

null_dist_ci <- null_dist_x %>%
  get_ci(level = 0.95, 
         type = "percentile") %>%
  print()

## # A tibble: 1 x 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1 -0.00640  0.00647

그래프 그리기

null_dist_x %>%
  visualize() +                                       # method 없음
  shade_p_value(obs_stat = x_bar,
                direction = "two-sided") +            # x_bar
  shade_confidence_interval(endpoints = null_dist_ci) # CI

p_value

null_dist_x %>%
  get_p_value(obs_stat = x_bar,           
              direction = "two-sided") %>%
  print()

## Warning: Please be cautious in reporting a p-value of 0. This result is an
## approximation based on the number of `reps` chosen in the `generate()` step. See
## `?get_p_value()` for more information.

## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1       0

8.2 t값을 이용한 검정그래프

t_cal 계산

t_cal <- brain_tb %>%
  specify(response = 차이) %>%
  hypothesize(null = "point",         # hypothesize 필요
              mu = 0) %>%  
  calculate(stat = "t") %>%           # stat = "t"              
  print()

## Response: 차이 (numeric)
## Null Hypothesis: point
## # A tibble: 1 x 1
##    stat
##   <dbl>
## 1 -24.4

Bootstrapping을 이용한 귀무가설 분포 생성

set.seed(123) 
null_dist_t <- brain_tb %>%
  specify(response = 차이) %>%
  hypothesize(null = "point", 
              mu = 0) %>% 
  generate(reps = 1000, 
           type = "bootstrap") %>%
  calculate(stat = "t") %>%
  print()

## Response: 차이 (numeric)
## Null Hypothesis: point
## # A tibble: 1,000 x 2
##    replicate    stat
##        <int>   <dbl>
##  1         1 -0.989 
##  2         2  1.33  
##  3         3  0.303 
##  4         4 -0.854 
##  5         5  0.369 
##  6         6  0.0218
##  7         7  0.840 
##  8         8  1.35  
##  9         9  0.105 
## 10        10 -1.51  
## # ... with 990 more rows

신뢰구간 생성

null_dist_ci <- null_dist_t %>%
  get_ci(level = 0.95, 
         type = "percentile") %>%
  print()

## # A tibble: 1 x 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1    -1.88     1.92

null_dist_t %>%
  visualize(method = "both") +           #method = "both": 이론분포+boot분포
  shade_p_value(obs_stat = t_cal,
                direction = "two-sided") +
  shade_confidence_interval(endpoints = null_dist_ci)

## Warning: Check to make sure the conditions have been met for the theoretical
## method. {infer} currently does not check these for you.

연습문제5

K식품에서는 햄버거의 칼로리를 연구하고 있다.

햄버거의 칼로리가 500kcal 보다 작다고 말 할 수 있는가?

1.기본 package 설정

1.1 library 로드

library(tidyverse)
library(tidymodels)
library(rstatix)
library(skimr)

2.데이터 불러오기

cal_tb <- read_csv('data5.csv', 
                   col_names = TRUE,
                   locale=locale('ko', encoding='euc-kr'), # 한글
                   na=".") %>%
  round(2) %>%                  # 소수점 2자리로 반올림
  mutate_if(is.character, as.factor)

## Rows: 40 Columns: 1
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (1): cscore
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

str(cal_tb)

## tibble [40 x 1] (S3: tbl_df/tbl/data.frame)
##  $ cscore: num [1:40] 509 481 501 502 498 503 497 489 504 502 ...

glimpse(cal_tb)

## Rows: 40
## Columns: 1
## $ cscore <dbl> 509, 481, 501, 502, 498, 503, 497, 489, 504, 502, 499, 510, 504~

cal_tb

## # A tibble: 40 x 1
##    cscore
##     <dbl>
##  1    509
##  2    481
##  3    501
##  4    502
##  5    498
##  6    503
##  7    497
##  8    489
##  9    504
## 10    502
## # ... with 30 more rows

3.기본통계치 확인

skim(cal_tb) # mean -> 498

Data summary
Name	cal_tb
Number of rows	40
Number of columns	1
_______________________
Column type frequency:
numeric	1
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
cscore	0	1	498.18	6.37	476	496.75	499.5	502	510	▁▁▂▇▁

cal_tb %>%
  get_summary_stats(cscore)

## # A tibble: 1 x 13
##   variable     n   min   max median    q1    q3   iqr   mad  mean    sd    se
##   <chr>    <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 cscore      40   476   510   500.  497.   502  5.25  3.71  498.  6.37  1.01
## # ... with 1 more variable: ci <dbl>

4.그래프 그리기(박스그래프,히스토그램)

cal_tb %>% 
  ggplot(aes(y = cscore)) +
  geom_boxplot() +
  labs(y = "칼로리")

cal_tb %>% 
  ggplot(mapping = aes(x = cscore)) +
  geom_histogram(binwidth = 10) +
  coord_cartesian()

5.이상치 제거

이상치 확인

cal_tb %>% 
  identify_outliers(cscore)

## # A tibble: 4 x 3
##   cscore is.outlier is.extreme
##    <dbl> <lgl>      <lgl>     
## 1    481 TRUE       FALSE     
## 2    510 TRUE       FALSE     
## 3    476 TRUE       TRUE      
## 4    488 TRUE       FALSE

이상치 제거

cal_tb <- cal_tb %>%
  filter(!(cscore <= 489 | cscore >= 510))

cal_tb %>% 
  ggplot(aes(y = cscore)) +
  geom_boxplot() +
  labs(y = "칼로리")

6.정규분포 검정

cal_tb %>%
  shapiro_test(cscore)

## # A tibble: 1 x 3
##   variable statistic     p
##   <chr>        <dbl> <dbl>
## 1 cscore       0.967 0.376

7.통계분석

two-sided test: alternative = c(“two.sided”)

right-sided test: alternative = c(“greater”)

left-sided test: alternative = c(“less”)

cal_tb %>% 
  t_test(formula = cscore ~ 1,
         alternative = "less", # 가설이 '작다고 말 할 수 있는가?' 이므로 less 설정  
         mu = 500.0,  #  평균 칼로리 500칼로리
         conf.level = 0.95,
         detailed = TRUE)

## # A tibble: 1 x 12
##   estimate .y.    group1 group2       n statistic     p    df conf.low conf.high
## *    <dbl> <chr>  <chr>  <chr>    <int>     <dbl> <dbl> <dbl>    <dbl>     <dbl>
## 1     500. cscore 1      null mo~    35    -0.831 0.206    34     -Inf      501.
## # ... with 2 more variables: method <chr>, alternative <chr>

Cohen’s d(effect size)

0.2 (small effect), 0.5 (moderate effect) and 0.8 (large effect)

cal_tb %>% 
  cohens_d(formula = cscore ~ 1, 
           mu = 500.0)

## # A tibble: 1 x 6
##   .y.    group1 group2     effsize     n magnitude 
## * <chr>  <chr>  <chr>        <dbl> <int> <ord>     
## 1 cscore 1      null model  -0.140    35 negligible

negligible 차이가 없다.

8.추론(infer)을 이용한 가설검정 및 그래프

p.21 설명

8.1 표본평균(x)을 이용한 검정그래프

표본평균(x) 계산

x_bar <- cal_tb %>%
  specify(response = cscore) %>%    # hypothesize 없음
  calculate(stat = "mean") %>%      # stat = "mean"
  print()

## Response: cscore (numeric)
## # A tibble: 1 x 1
##    stat
##   <dbl>
## 1  500.

Bootstrapping을 이용한 귀무가설 분포 생성

set.seed(123) 
null_dist_x <- cal_tb %>%
  specify(response = cscore) %>%
  hypothesize(null = "point", 
              mu = 500) %>%
  generate(reps = 1000, 
           type = "bootstrap") %>%
  calculate(stat = "mean") %>%
  print()

## Response: cscore (numeric)
## Null Hypothesis: point
## # A tibble: 1,000 x 2
##    replicate  stat
##        <int> <dbl>
##  1         1  501.
##  2         2  499.
##  3         3  500.
##  4         4  500.
##  5         5  500.
##  6         6  499.
##  7         7  499.
##  8         8  499.
##  9         9  500.
## 10        10  500.
## # ... with 990 more rows

신뢰구간 생성

null_dist_ci <- null_dist_x %>%
  get_ci(level = 0.95, 
         type = "percentile") %>%
  print()

## # A tibble: 1 x 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1     499.     501.

그래프 그리기

null_dist_x %>%
  visualize() +                                       # method 없음
  shade_p_value(obs_stat = x_bar,
                direction = "less") +            # x_bar
  shade_confidence_interval(endpoints = null_dist_ci) # CI

p_value

null_dist_x %>%
  get_p_value(obs_stat = x_bar,           
              direction = "less") %>%
  print()

## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1   0.207

8.2 t값을 이용한 검정그래프

t_cal 계산

t_cal <- cal_tb %>%
  specify(response = cscore) %>%
  hypothesize(null = "point",         # hypothesize 필요
              mu = 500) %>%  
  calculate(stat = "t") %>%           # stat = "t", 평균과 비교할때도 t검정         
  print()

## Response: cscore (numeric)
## Null Hypothesis: point
## # A tibble: 1 x 1
##     stat
##    <dbl>
## 1 -0.831

Bootstrapping을 이용한 귀무가설 분포 생성

set.seed(123) 
null_dist_t <- cal_tb %>%
  specify(response = cscore) %>%
  hypothesize(null = "point", 
              mu = 500) %>%
  generate(reps = 1000, 
           type = "bootstrap") %>%
  calculate(stat = "t") %>%
  print()

## Response: cscore (numeric)
## Null Hypothesis: point
## # A tibble: 1,000 x 2
##    replicate    stat
##        <int>   <dbl>
##  1         1  1.85  
##  2         2 -1.65  
##  3         3  0.0455
##  4         4 -0.544 
##  5         5 -0.354 
##  6         6 -1.34  
##  7         7 -1.06  
##  8         8 -1.08  
##  9         9 -0.982 
## 10        10 -0.799 
## # ... with 990 more rows

신뢰구간 생성

null_dist_ci <- null_dist_t %>%
  get_ci(level = 0.95, 
         type = "percentile") %>%
  print()

## # A tibble: 1 x 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1    -2.07     1.94

null_dist_t %>%
  visualize(method = "both") +           #method = "both": 이론분포+boot분포
  shade_p_value(obs_stat = t_cal,
                direction = "less") +
  shade_confidence_interval(endpoints = null_dist_ci)

## Warning: Check to make sure the conditions have been met for the theoretical
## method. {infer} currently does not check these for you.

Share on

Copy URL

Kakao Facebook Twitter LinkedIn

자동주식투자프로그램 세팅방법

자동주식투자프로그램 프로젝트 환경설정

Data Mining Theory And Reality Midtestexample04

<!DOCTYPE html>

Data Mining Theory And Reality Midtestexample03

<!DOCTYPE html>

Data Mining Theory And Reality Midtestexample02

<!DOCTYPE html>

SeongJae Yoo

Share on

데이터마이닝의 이론과 실제 중간고사 연습문제

연습문제1

K대학에서는 재학생을 대상으로 교육과정에 대한 현황조사를 실시하였다.

학년에 따라 비교과운영 점수가 차이가 있는 가?

1.기본 package 설정

1.1 library 로드

2.데이터 불러오기

3.기술통계분석

4.그래프 그리기(박스그래프,히스토그램)

5.이상치 제거

이상치 확인

!은 제외한 나머지 모두를 가져올때 사용함

7.등분산 검정 (대상 집단의 분산이 같은지 다른지를 통계적으로 검정하는 방법)

이분산일때는 하단의 Welch’s ANOVA test 참조

8.등분산 ANOVA

등분산일때 ANOVA분석 (종속변수 ~ 그룹변수)

p값이 0.05 보다 작으면 대립가설 성립(모든 그룹의 평균이 같지 않다,)

Cohen’s d(effect size) 두 집단의 평균 차이를 일정한 기준으로 표현한 것

0.2 (small effect), 0.5 (moderate effect) and 0.8 (large effect)

등분산(2개의 모집단(Population) 에서 추출된 각 sample 간의 분산이 같음)이 성립되므로 Fisher LSD 실행시킴

9.사후검정(Multicamparison test ),(post hoc)

Fisher LSD(Least Significant Difference = “최소한의 유의미한 차이”)

Bonferroni

Tukey HSD

학년으로 표현

install.packages(“agricolae”)

console=TRUE: 결과를 화면에 표시

학년=TRUE: 그룹으로 묶어서 표시, FALSE: 1:1로 비교

LSD.test, duncan.test, scheffe.test

다중비교 그래프

t-test 두 집단의 평균차이를 검정

iris 데이터으로 ANOVA 분석 (Species 별로 Sepal.Length의 모평균이 차이가 있는 지 검정)

https://m.blog.naver.com/pmw9440/221745257123

10.추론(infer)을 이용한 가설검정 및 그래프

f_cal 계산

Bootstrapping을 이용한 귀무가설 분포 생성

p_value

부록: 이분산(levene_test < 0.05) 일때 Welch’s ANOVA test

부록: 정규분포 가정 문제가 있으면 shapiro_test < 0.05

비모수통계분석 Kruskal Wallis H test

비모수 다중비교 Dunn test

연습문제2 K대학에서는 재학생(1)과 교원(2)을 대상으로 교육과정에 대한 현황조사를 실시하였다. 종합점수가 재학생과 교원이 차이가 있는가?

1.기본 package 설정

1.1 library 로드

2.데이터 불러오기

3.기술통계분석

4.그래프 그리기(박스그래프,히스토그램)

5.이상치 제거

이상치 확인

이상치 제거

6.정규분포 검정, p값이 0.05 보다 크므로 정규분포를 따른다.

7.등분산 검정

8.등분산 t-검정

two-sided test: alternative = c(“two.sided”)

right-sided test: alternative = c(“greater”)

left-sided test: alternative = c(“less”)

등분산이면 var.equal=TRUE, 이분산이면 var.equal=FALSE

Cohen’s d(effect size)

0.2 (small effect), 0.5 (moderate effect) and 0.8 (large effect)

9.추론(infer)을 이용한 가설검정 및 그래프

9.1 표본평균(x)을 이용한 검정그래프

표본평균(x) 계산

Bootstrapping을 이용한 귀무가설 분포 생성

신뢰구간 생성

그래프 그리기

p_value : 0.05 보다 작으므로 교원과 재학생들의 종합점수가 차이가 있다.

9.2 t값을 이용한 검정그래프

t_cal 계산

Bootstrapping을 이용한 귀무가설 분포 생성

신뢰구간 생성

부록. 비모수통계분석

정규분포검정 p < 0.05일 때 비모수 wilcox.test로 분석

연습문제3

K대학에서는 재학생 만족도 조사를 실시하였다.

재학생 만족도가 50점이라고 할 수 있는가?

1.기본 package 설정

1.1 library 로드

2.데이터 불러오기