연습문제5
2006년 "약물 사용: 대학생 운동선수의 약물 사용에 대한 NCAA 연구 보고서"
운동선수의 스테로이드 사용과 Division과 관계가 있는가
독립성 검정: 사후사례대조
1.기본 package 설정
1.1 library 로드
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.8
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tidymodels)
## -- Attaching packages -------------------------------------- tidymodels 0.2.0 --
## v broom 0.8.0 v rsample 0.1.1
## v dials 0.1.1 v tune 0.2.0
## v infer 1.0.0 v workflows 0.2.6
## v modeldata 0.1.1 v workflowsets 0.2.1
## v parsnip 0.2.1 v yardstick 0.0.9
## v recipes 0.2.0
## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x scales::discard() masks purrr::discard()
## x dplyr::filter() masks stats::filter()
## x recipes::fixed() masks stringr::fixed()
## x dplyr::lag() masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step() masks stats::step()
## * Learn how to get started at https://www.tidymodels.org/start/
library(rstatix)
##
## 다음의 패키지를 부착합니다: 'rstatix'
## The following objects are masked from 'package:infer':
##
## chisq_test, prop_test, t_test
## The following object is masked from 'package:dials':
##
## get_n
## The following object is masked from 'package:stats':
##
## filter
library(skimr)
2.데이터 불러오기
ncaa_tb <- read_csv('data5.csv',
col_names = TRUE,
locale=locale('ko', encoding='euc-kr'), # 한글
na=".") %>%
round(2) %>% # 소수점 2자리로 반올림
mutate_if(is.character, as.factor) %>%
mutate(steroid = factor(steroid,
levels=c(1:2),
labels=c("Yes", "NO"))) %>%
mutate(division = factor(division,
levels=c(1:3),
labels=c("D1","D2","D3")))
## Rows: 6 Columns: 3
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (3): steroid, division, count
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(ncaa_tb)
## tibble [6 x 3] (S3: tbl_df/tbl/data.frame)
## $ steroid : Factor w/ 2 levels "Yes","NO": 1 1 1 2 2 2
## $ division: Factor w/ 3 levels "D1","D2","D3": 1 2 3 1 2 3
## $ count : num [1:6] 103 52 65 8440 4289 ...
ncaa_tb
## # A tibble: 6 x 3
## steroid division count
## <fct> <fct> <dbl>
## 1 Yes D1 103
## 2 Yes D2 52
## 3 Yes D3 65
## 4 NO D1 8440
## 5 NO D2 4289
## 6 NO D3 6428
skim(ncaa_tb)
Data summary
Name |
ncaa_tb |
Number of rows |
6 |
Number of columns |
3 |
_______________________ |
|
Column type frequency: |
|
factor |
2 |
numeric |
1 |
________________________ |
|
Group variables |
None |
Variable type: factor
steroid |
0 |
1 |
FALSE |
2 |
Yes: 3, NO: 3 |
division |
0 |
1 |
FALSE |
3 |
D1: 2, D2: 2, D3: 2 |
Variable type: numeric
count |
0 |
1 |
3229.5 |
3698.32 |
52 |
74.5 |
2196 |
5893.25 |
8440 |
▇▁▂▂▂ |
표로 정리되어 있는 2차 데이터로 처리 : xtabs(표의 값~도수 가로 + 세로)의 형식
ncaa_table <- xtabs(count ~ steroid + division ,
data=ncaa_tb)
ncaa_table
## division
## steroid D1 D2 D3
## Yes 103 52 65
## NO 8440 4289 6428
3.그래프 그리기(모자이크)
mosaicplot(~ steroid + division,
data = ncaa_table,
color = TRUE, cex = 1,.2)
4.카이스케어 분석
install.packages(“gmodels”)
library(gmodels)
ncaa_fit <- CrossTable(ncaa_table,
expected=TRUE,
chisq=TRUE,
asresid=F)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Expected N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 19377
##
##
## | division
## steroid | D1 | D2 | D3 | Row Total |
## -------------|-----------|-----------|-----------|-----------|
## Yes | 103 | 52 | 65 | 220 |
## | 96.994 | 49.286 | 73.719 | |
## | 0.372 | 0.149 | 1.031 | |
## | 0.468 | 0.236 | 0.295 | 0.011 |
## | 0.012 | 0.012 | 0.010 | |
## | 0.005 | 0.003 | 0.003 | |
## -------------|-----------|-----------|-----------|-----------|
## NO | 8440 | 4289 | 6428 | 19157 |
## | 8446.006 | 4291.714 | 6419.281 | |
## | 0.004 | 0.002 | 0.012 | |
## | 0.441 | 0.224 | 0.336 | 0.989 |
## | 0.988 | 0.988 | 0.990 | |
## | 0.436 | 0.221 | 0.332 | |
## -------------|-----------|-----------|-----------|-----------|
## Column Total | 8543 | 4341 | 6493 | 19377 |
## | 0.441 | 0.224 | 0.335 | |
## -------------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 1.570407 d.f. = 2 p = 0.4560268
##
##
##
5.수정된 표준잔차
ncaa_fit$chisq$stdres
## division
## steroid D1 D2 D3
## Yes 0.8201878 0.4413273 -1.2525387
## NO -0.8201878 -0.4413273 1.2525387
6.오즈비(odds ratio)
위험요인과 질병 발생간의 연관성을 1을 기준으로 나타낸 척도
흡연을 하면 폐암에 걸릴 확률이 몇 배나 높아질 것인지?
d1_odds <- ncaa_fit$t[2,1]/ncaa_fit$t[1,1]
d2_odds <- ncaa_fit$t[2,2]/ncaa_fit$t[1,2]
d2_odds/d1_odds
## [1] 1.006578
Leave a comment