데이터마이닝의 이론과 실제 기말고사 연습문제5

연습문제5

2006년 "약물 사용: 대학생 운동선수의 약물 사용에 대한 NCAA 연구 보고서"

운동선수의 스테로이드 사용과 Division과 관계가 있는가

독립성 검정: 사후사례대조

1.기본 package 설정

1.1 library 로드

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(tidymodels)
## -- Attaching packages -------------------------------------- tidymodels 0.2.0 --
## v broom        0.8.0     v rsample      0.1.1
## v dials        0.1.1     v tune         0.2.0
## v infer        1.0.0     v workflows    0.2.6
## v modeldata    0.1.1     v workflowsets 0.2.1
## v parsnip      0.2.1     v yardstick    0.0.9
## v recipes      0.2.0
## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x scales::discard() masks purrr::discard()
## x dplyr::filter()   masks stats::filter()
## x recipes::fixed()  masks stringr::fixed()
## x dplyr::lag()      masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step()   masks stats::step()
## * Learn how to get started at https://www.tidymodels.org/start/
library(rstatix)
## 
## 다음의 패키지를 부착합니다: 'rstatix'
## The following objects are masked from 'package:infer':
## 
##     chisq_test, prop_test, t_test
## The following object is masked from 'package:dials':
## 
##     get_n
## The following object is masked from 'package:stats':
## 
##     filter
library(skimr)

2.데이터 불러오기

ncaa_tb <- read_csv('data5.csv', 
                      col_names = TRUE,
                      locale=locale('ko', encoding='euc-kr'), # 한글
                      na=".") %>%
  round(2) %>%                 # 소수점 2자리로 반올림
  mutate_if(is.character, as.factor) %>%
  mutate(steroid = factor(steroid,
                          levels=c(1:2),
                          labels=c("Yes", "NO"))) %>%
  mutate(division = factor(division,
                           levels=c(1:3),
                           labels=c("D1","D2","D3")))
## Rows: 6 Columns: 3
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (3): steroid, division, count
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(ncaa_tb)
## tibble [6 x 3] (S3: tbl_df/tbl/data.frame)
##  $ steroid : Factor w/ 2 levels "Yes","NO": 1 1 1 2 2 2
##  $ division: Factor w/ 3 levels "D1","D2","D3": 1 2 3 1 2 3
##  $ count   : num [1:6] 103 52 65 8440 4289 ...
ncaa_tb
## # A tibble: 6 x 3
##   steroid division count
##   <fct>   <fct>    <dbl>
## 1 Yes     D1         103
## 2 Yes     D2          52
## 3 Yes     D3          65
## 4 NO      D1        8440
## 5 NO      D2        4289
## 6 NO      D3        6428
skim(ncaa_tb)
Data summary
Name ncaa_tb
Number of rows 6
Number of columns 3
_______________________
Column type frequency:
factor 2
numeric 1
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
steroid 0 1 FALSE 2 Yes: 3, NO: 3
division 0 1 FALSE 3 D1: 2, D2: 2, D3: 2

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
count 0 1 3229.5 3698.32 52 74.5 2196 5893.25 8440 ▇▁▂▂▂

표로 정리되어 있는 2차 데이터로 처리 : xtabs(표의 값~도수 가로 + 세로)의 형식

ncaa_table <- xtabs(count ~ steroid + division  ,
                      data=ncaa_tb)
ncaa_table
##        division
## steroid   D1   D2   D3
##     Yes  103   52   65
##     NO  8440 4289 6428

3.그래프 그리기(모자이크)

mosaicplot(~ steroid + division, 
           data = ncaa_table, 
           color = TRUE, cex = 1,.2)

4.카이스케어 분석

install.packages(“gmodels”)

library(gmodels)
ncaa_fit <- CrossTable(ncaa_table,
                         expected=TRUE,
                         chisq=TRUE,
                         asresid=F)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |              Expected N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  19377 
## 
##  
##              | division 
##      steroid |        D1 |        D2 |        D3 | Row Total | 
## -------------|-----------|-----------|-----------|-----------|
##          Yes |       103 |        52 |        65 |       220 | 
##              |    96.994 |    49.286 |    73.719 |           | 
##              |     0.372 |     0.149 |     1.031 |           | 
##              |     0.468 |     0.236 |     0.295 |     0.011 | 
##              |     0.012 |     0.012 |     0.010 |           | 
##              |     0.005 |     0.003 |     0.003 |           | 
## -------------|-----------|-----------|-----------|-----------|
##           NO |      8440 |      4289 |      6428 |     19157 | 
##              |  8446.006 |  4291.714 |  6419.281 |           | 
##              |     0.004 |     0.002 |     0.012 |           | 
##              |     0.441 |     0.224 |     0.336 |     0.989 | 
##              |     0.988 |     0.988 |     0.990 |           | 
##              |     0.436 |     0.221 |     0.332 |           | 
## -------------|-----------|-----------|-----------|-----------|
## Column Total |      8543 |      4341 |      6493 |     19377 | 
##              |     0.441 |     0.224 |     0.335 |           | 
## -------------|-----------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  1.570407     d.f. =  2     p =  0.4560268 
## 
## 
## 

5.수정된 표준잔차

ncaa_fit$chisq$stdres
##        division
## steroid         D1         D2         D3
##     Yes  0.8201878  0.4413273 -1.2525387
##     NO  -0.8201878 -0.4413273  1.2525387

6.오즈비(odds ratio)

위험요인과 질병 발생간의 연관성을 1을 기준으로 나타낸 척도

흡연을 하면 폐암에 걸릴 확률이 몇 배나 높아질 것인지?

d1_odds <- ncaa_fit$t[2,1]/ncaa_fit$t[1,1]
d2_odds <- ncaa_fit$t[2,2]/ncaa_fit$t[1,2]

d2_odds/d1_odds
## [1] 1.006578

Leave a comment