Share on

2022-06-10-Data-mining-theory-and-reality-midtestExample02

2 대학에는 국어학과, 수학과, 물리학과가 있다. 이교수는 각 학과별로 학생들의 만족도의 차이가 있는지를 검증하고자 한다. 과연 3개 학과간의 학생들의 만족도는 차이가 있는가?(data2.sav)

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(tidymodels)

## -- Attaching packages -------------------------------------- tidymodels 0.2.0 --

## v broom        0.8.0     v rsample      0.1.1
## v dials        0.1.1     v tune         0.2.0
## v infer        1.0.0     v workflows    0.2.6
## v modeldata    0.1.1     v workflowsets 0.2.1
## v parsnip      0.2.1     v yardstick    0.0.9
## v recipes      0.2.0

## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x scales::discard() masks purrr::discard()
## x dplyr::filter()   masks stats::filter()
## x recipes::fixed()  masks stringr::fixed()
## x dplyr::lag()      masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step()   masks stats::step()
## * Learn how to get started at https://www.tidymodels.org/start/

library(rstatix)

## 
## 다음의 패키지를 부착합니다: 'rstatix'

## The following objects are masked from 'package:infer':
## 
##     chisq_test, prop_test, t_test

## The following object is masked from 'package:dials':
## 
##     get_n

## The following object is masked from 'package:stats':
## 
##     filter

library(skimr)

edu_tb <- read_csv('data2.csv', 
                   col_names = TRUE,
                   locale=locale('ko', encoding='euc-kr'), # 한글
                   na=".") %>%
  mutate_if(is.character, as.factor) %>%
  mutate(학과 = factor(학과,
                       levels=c(1:3),
                       labels=c("국어학과","수학과","물리학과")))

## Rows: 140 Columns: 2
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (2): 학과, 만족도
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

str(edu_tb)

## tibble [140 x 2] (S3: tbl_df/tbl/data.frame)
##  $ 학과  : Factor w/ 3 levels "국어학과","수학과",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ 만족도: num [1:140] 85 82 87 88 93 83 86 85 94 93 ...

edu_tb

## # A tibble: 140 x 2
##    학과     만족도
##    <fct>     <dbl>
##  1 국어학과     85
##  2 국어학과     82
##  3 국어학과     87
##  4 국어학과     88
##  5 국어학과     93
##  6 국어학과     83
##  7 국어학과     86
##  8 국어학과     85
##  9 국어학과     94
## 10 국어학과     93
## # ... with 130 more rows

기술통계분석

skim(edu_tb)

Data summary
Name	edu_tb
Number of rows	140
Number of columns	2
_______________________
Column type frequency:
factor	1
numeric	1
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
학과	0	1	FALSE	3	물리학: 50, 국어학: 45, 수학과: 45

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
만족도	0	1	86.48	8.44	10	83	87	91	99	▁▁▁▂▇

edu_tb %>% 
  group_by(학과) %>%
  get_summary_stats(만족도)

## # A tibble: 3 x 14
##   학과     variable     n   min   max median    q1    q3   iqr   mad  mean    sd
##   <fct>    <chr>    <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 국어학과 만족도      45    10    99     87  84      92  8     5.93  86.3 12.8 
## 2 수학과   만족도      45    78    99     87  84      92  8     5.93  87.9  5.08
## 3 물리학과 만족도      50    76    96     86  81.2    89  7.75  5.93  85.4  5.36
## # ... with 2 more variables: se <dbl>, ci <dbl>

그래프 그리기(박스그래프,히스토그램)

edu_tb %>% 
  ggplot(mapping = aes(x = 학과,
                       y = 만족도)) +
  geom_boxplot() +
  labs(y = "만족도")

edu_tb %>% 
  ggplot(mapping = aes(x = 만족도)) +
  geom_histogram(bins = 10,              
                 color = "white", 
                 fill = "steelblue") +
  facet_wrap(~학과)

이상치 제거

이상치 확인

edu_tb %>%
  select(학과, 만족도) %>%
  group_by(학과) %>%
  identify_outliers(만족도) %>%
  print(n=50)

## # A tibble: 1 x 4
##   학과     만족도 is.outlier is.extreme
##   <fct>     <dbl> <lgl>      <lgl>     
## 1 국어학과     10 TRUE       TRUE

이상치 제거

edu_tb <- edu_tb %>%
  filter(!(학과 == "국어학과" & 만족도 <= 10 ))

edu_tb %>% 
  ggplot(mapping = aes(x = 학과,
                       y = 만족도)) +
  geom_boxplot() +
  labs(y = "만족도")

정규분포 확인

edu_tb %>%
  group_by(학과) %>%
  shapiro_test(만족도)

## # A tibble: 3 x 4
##   학과     variable statistic     p
##   <fct>    <chr>        <dbl> <dbl>
## 1 국어학과 만족도       0.974 0.412
## 2 수학과   만족도       0.973 0.385
## 3 물리학과 만족도       0.971 0.243

모두 0.05 보다 크므로 정규분포 만족함

등분산 검정

edu_tb %>%
  levene_test(만족도 ~ 학과)

## # A tibble: 1 x 4
##     df1   df2 statistic     p
##   <int> <int>     <dbl> <dbl>
## 1     2   136     0.209 0.812

0.05 보다 크므로 등분산 성립

등분산 ANOVA

owa_result <- aov(만족도 ~ 학과, 
                       data=edu_tb)

tidy(owa_result)

## # A tibble: 2 x 6
##   term         df sumsq meansq statistic p.value
##   <chr>     <dbl> <dbl>  <dbl>     <dbl>   <dbl>
## 1 학과          2  213.  106.       3.80  0.0248
## 2 Residuals   136 3807.   28.0     NA    NA

edu_tb %>%
  anova_test(만족도 ~ 학과)

## Coefficient covariances computed by hccm()

## ANOVA Table (type II tests)
## 
##   Effect DFn DFd     F     p p<.05   ges
## 1   학과   2 136 3.799 0.025     * 0.053

edu_tb %>% 
  cohens_d(formula = 만족도 ~ 학과, 
           var.equal = TRUE)

## # A tibble: 3 x 7
##   .y.    group1   group2   effsize    n1    n2 magnitude 
## * <chr>  <chr>    <chr>      <dbl> <int> <int> <ord>     
## 1 만족도 국어학과 수학과    0.0255    44    45 negligible
## 2 만족도 국어학과 물리학과  0.490     44    50 small     
## 3 만족도 수학과   물리학과  0.480     45    50 small

사후검정

edu_tb %>% 
  pairwise_t_test(만족도 ~ 학과, 
                       p.adj="non")

## # A tibble: 3 x 9
##   .y.    group1   group2      n1    n2      p p.signif  p.adj p.adj.signif
## * <chr>  <chr>    <chr>    <int> <int>  <dbl> <chr>     <dbl> <chr>       
## 1 만족도 국어학과 수학과      44    45 0.905  ns       0.905  ns          
## 2 만족도 국어학과 물리학과    44    50 0.017  *        0.017  *           
## 3 만족도 수학과   물리학과    45    50 0.0225 *        0.0225 *

edu_tb %>% 
  pairwise_t_test(만족도 ~ 학과, 
                       p.adj="bonferroni")

## # A tibble: 3 x 9
##   .y.    group1   group2      n1    n2      p p.signif  p.adj p.adj.signif
## * <chr>  <chr>    <chr>    <int> <int>  <dbl> <chr>     <dbl> <chr>       
## 1 만족도 국어학과 수학과      44    45 0.905  ns       1      ns          
## 2 만족도 국어학과 물리학과    44    50 0.017  *        0.051  ns          
## 3 만족도 수학과   물리학과    45    50 0.0225 *        0.0676 ns

owa_result %>% 
  TukeyHSD() %>% 
  tidy()

## # A tibble: 3 x 7
##   term  contrast          null.value estimate conf.low conf.high adj.p.value
##   <chr> <chr>                  <dbl>    <dbl>    <dbl>     <dbl>       <dbl>
## 1 학과  수학과-국어학과            0   -0.134    -2.79    2.52        0.992 
## 2 학과  물리학과-국어학과          0   -2.64     -5.23   -0.0511      0.0445
## 3 학과  물리학과-수학과            0   -2.51     -5.09    0.0674      0.0581

library(agricolae)

owa_result %>%
  LSD.test("학과",
           group=TRUE, 
           console = TRUE)

## 
## Study: . ~ "학과"
## 
## LSD t Test for 만족도 
## 
## Mean Square Error:  27.99413 
## 
## 학과,  means and individual ( 95 %) CI
## 
##            만족도      std  r      LCL      UCL Min Max
## 국어학과 88.02273 5.428129 44 86.44534 89.60011  78  99
## 물리학과 85.38000 5.356248 50 83.90028 86.85972  76  96
## 수학과   87.88889 5.077679 45 86.32913 89.44865  78  99
## 
## Alpha: 0.05 ; DF Error: 136
## Critical Value of t: 1.977561 
## 
## Groups according to probability of means differences and alpha level( 0.05 )
## 
## Treatments with the same letter are not significantly different.
## 
##            만족도 groups
## 국어학과 88.02273      a
## 수학과   87.88889      a
## 물리학과 85.38000      b

owa_result %>%
  duncan.test("학과",
              group=TRUE, 
              console = TRUE)

## 
## Study: . ~ "학과"
## 
## Duncan's new multiple range test
## for 만족도 
## 
## Mean Square Error:  27.99413 
## 
## 학과,  means
## 
##            만족도      std  r Min Max
## 국어학과 88.02273 5.428129 44  78  99
## 물리학과 85.38000 5.356248 50  76  96
## 수학과   87.88889 5.077679 45  78  99
## 
## Groups according to probability of means differences and alpha level( 0.05 )
## 
## Means with the same letter are not significantly different.
## 
##            만족도 groups
## 국어학과 88.02273      a
## 수학과   87.88889      a
## 물리학과 85.38000      b

owa_result %>%
  scheffe.test("학과",
               group=TRUE, 
               console = TRUE)

## 
## Study: . ~ "학과"
## 
## Scheffe Test for 만족도 
## 
## Mean Square Error  : 27.99413 
## 
## 학과,  means
## 
##            만족도      std  r Min Max
## 국어학과 88.02273 5.428129 44  78  99
## 물리학과 85.38000 5.356248 50  76  96
## 수학과   87.88889 5.077679 45  78  99
## 
## Alpha: 0.05 ; DF Error: 136 
## Critical Value of F: 3.0627 
## 
## Groups according to probability of means differences and alpha level( 0.05 )
## 
## Means with the same letter are not significantly different.
## 
##            만족도 groups
## 국어학과 88.02273      a
## 수학과   87.88889      a
## 물리학과 85.38000      a

다중비교 그래프

owa_result %>% 
  TukeyHSD() %>% 
  plot()

library(infer)

추론(infer)을 이용한 가설검정 및 그래프

f_cal <- edu_tb %>%
  specify(formula = 만족도 ~ 학과) %>% # hypothesize 없음
  calculate(stat = "F") %>%   
  print()

## Response: 만족도 (numeric)
## Explanatory: 학과 (factor)
## # A tibble: 1 x 1
##    stat
##   <dbl>
## 1  3.80

Bootstrapping을 이용한 귀무가설 분포 생성

set.seed(123) 
null_dist_f <- edu_tb %>%
  specify(formula = 만족도 ~ 학과) %>% # 가설검정 공식을 specify 함수에 명세한다.
  hypothesize(null = "independence") %>% # 귀무가설을 hypothesize 함수에서 적시한다.독립성 검정(test of independence), 귀무가설 H0 : 두 변수 X와 Y는 서로 독립이다 (관련성이 없다), 대립가설 H1 : 두 변수 X와 Y는 서로 독립이 아니다 (관련성이 있다)
  generate(reps = 1000, type = "permute") %>% # 컴퓨터에서 모의실험 난수를 generate에서 생성시킨다.randomly reassigned => permute
  calculate(stat = "F") %>%  # 검정 통계량을 calculate 함수에 명시한다.
  print()

## Response: 만족도 (numeric)
## Explanatory: 학과 (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 x 2
##    replicate   stat
##        <int>  <dbl>
##  1         1 1.60  
##  2         2 0.469 
##  3         3 1.57  
##  4         4 1.25  
##  5         5 1.92  
##  6         6 0.879 
##  7         7 2.53  
##  8         8 0.0482
##  9         9 0.506 
## 10        10 2.35  
## # ... with 990 more rows

stat = “F” -> 3개 이상의 집단을 대상으로 할 때 F-검정을 사용한다.

null_dist_f %>%
  visualize(method = "both") +           #method = "both": 이론분포+boot분포
  shade_p_value(obs_stat = f_cal,
                direction = "greater")

## Warning: Check to make sure the conditions have been met for the theoretical
## method. {infer} currently does not check these for you.

p_value

null_dist_f %>%
  get_p_value(obs_stat = f_cal,           
              direction = "greater") %>%
  print()

## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1   0.029

p값이 0.05보다 작으므로 3개 학과간의 학생들의 만족도는 차이가 있다.

Share on

Copy URL

Kakao Facebook Twitter LinkedIn

SeongJae Yoo

Data Mining Theory And Reality Midtestexample02

Share on

2022-06-10-Data-mining-theory-and-reality-midtestExample02

SeongJae-Yoo

2022 6 10

2 대학에는 국어학과, 수학과, 물리학과가 있다. 이교수는 각 학과별로 학생들의 만족도의 차이가 있는지를 검증하고자 한다. 과연 3개 학과간의 학생들의 만족도는 차이가 있는가?(data2.sav)

기술통계분석

그래프 그리기(박스그래프,히스토그램)

이상치 제거

이상치 확인

이상치 제거

정규분포 확인

모두 0.05 보다 크므로 정규분포 만족함

등분산 검정

0.05 보다 크므로 등분산 성립

등분산 ANOVA

사후검정

다중비교 그래프

추론(infer)을 이용한 가설검정 및 그래프

Bootstrapping을 이용한 귀무가설 분포 생성

stat = “F” -> 3개 이상의 집단을 대상으로 할 때 F-검정을 사용한다.

p_value

p값이 0.05보다 작으므로 3개 학과간의 학생들의 만족도는 차이가 있다.

Share on

Leave a comment

You may also enjoy

자동주식투자프로그램 세팅방법

Data Mining Theory And Reality Midtestexample04

Data Mining Theory And Reality Midtestexample03

Data Mining Theory And Reality Midtestexample01