Share on

범주형 데이터, 연속형 데이터 으로 분석하기

독립변수가 범주형자료이고 종속변수가 범주형자료일때 교차분석을 한다

교차분석이란?
참고사이트 : https://m.blog.naver.com/PostView.naver?isHttpsRedirect=true&blogId=aporia25&logNo=221156141366

교차분석(cross-tabulation analysis)은 ‘범주형’으로 구성된 자료들 간의 연관관계를 확인하기 위해 교차표를 만들어 관계를 확인하는 분석 방법을 말한다. 이 방법에서는 변수들의 빈도를 이용하여 연관성을 파악하는데, 이 때 검정통계량으로 카이제곱($\chi^2$) 통계량을 이용한다. 이 때문에 교차분석은 카이제곱($\chi^2$) 검정이라고도 불린다.

귀무가설: 기대도수와 관측도수 간에 차이가 없다

대립가설: 기대도수와 관측도수 간에 차이가 있다

expe

독립변수가 범주형자료이고 종속변수가 연속형자료일때 t-test 또는 분산분석(ANOVA)을 한다

독립변수 범주가 2개일때 t-test
독립변수 범주가 3개 이상일때 분산분석

독립변수가 연속형자료이고 종속변수가 연속형자료일때 상관관계분석 또는 선형회귀분석을 한다
독립변수가 연속형자료이고 종속변수가 범주형자료일때 로지스틱 회귀분석 또는 판별분석 또는 군집분석을 한다

# 실습 예제
a <- read.csv('cosmetics.csv', header = T, sep = ',')
head(a)

gender	marriage	edu	job	mincome	aware	count	amount	decision	propensity	skin	promo	location	satisf_b	satisf_i	satisf_al	repurchase
1	1	4	1	2	2	1	11000	2	1	1	1	2	5	2	2	2
2	1	4	9	2	1	4	30000	1	1	3	2	3	2	3	3	4
2	2	4	4	3	1	6	100000	3	2	3	2	2	4	5	4	4
2	2	4	7	5	2	6	65000	3	2	5	2	3	3	4	4	4
1	2	6	6	5	2	2	50000	2	2	3	2	3	3	3	3	3
2	2	2	7	3	1	2	100000	2	1	4	2	3	3	4	4	3

str(a)

'data.frame':	247 obs. of  17 variables:
 $ gender    : int  1 2 2 2 1 2 2 1 2 2 ...
 $ marriage  : int  1 1 2 2 2 2 1 1 2 2 ...
 $ edu       : int  4 4 4 4 6 2 6 6 4 4 ...
 $ job       : int  1 9 4 7 6 7 4 4 5 5 ...
 $ mincome   : int  2 2 3 5 5 3 5 5 2 2 ...
 $ aware     : int  2 1 1 2 2 1 1 4 2 1 ...
 $ count     : int  1 4 6 6 2 2 5 10 2 2 ...
 $ amount    : int  11000 30000 100000 65000 50000 100000 100000 39000 40000 100000 ...
 $ decision  : int  2 1 3 3 2 2 3 3 3 3 ...
 $ propensity: int  1 1 2 2 2 1 2 2 2 3 ...
 $ skin      : int  1 3 3 5 3 4 5 2 3 3 ...
 $ promo     : int  1 2 2 2 2 2 2 1 2 1 ...
 $ location  : int  2 3 2 3 3 3 3 2 3 3 ...
 $ satisf_b  : int  5 2 4 3 3 3 2 4 3 2 ...
 $ satisf_i  : int  2 3 5 4 3 4 2 4 4 3 ...
 $ satisf_al : int  2 3 4 4 3 4 3 4 4 4 ...
 $ repurchase: int  2 4 4 4 3 3 4 4 4 4 ...

table(a$gender)

  1   2
132 115

table(a$marriage)

  1   2
 71 176

attach(a) # $사용안해도 됨

table(gender)

gender
  1   2
132 115

table(job)

job
 1  2  3  4  5  6  7  8  9 10
13 23 39 89  8 19 27 14  6  9

detach(a)

a$gender<-factor(a$gender, levels=c(1,2), labels = c('male','female'))

str(a)

'data.frame':	247 obs. of  17 variables:
 $ gender    : Factor w/ 2 levels "male","female": 1 2 2 2 1 2 2 1 2 2 ...
 $ marriage  : int  1 1 2 2 2 2 1 1 2 2 ...
 $ edu       : int  4 4 4 4 6 2 6 6 4 4 ...
 $ job       : int  1 9 4 7 6 7 4 4 5 5 ...
 $ mincome   : int  2 2 3 5 5 3 5 5 2 2 ...
 $ aware     : int  2 1 1 2 2 1 1 4 2 1 ...
 $ count     : int  1 4 6 6 2 2 5 10 2 2 ...
 $ amount    : int  11000 30000 100000 65000 50000 100000 100000 39000 40000 100000 ...
 $ decision  : int  2 1 3 3 2 2 3 3 3 3 ...
 $ propensity: int  1 1 2 2 2 1 2 2 2 3 ...
 $ skin      : int  1 3 3 5 3 4 5 2 3 3 ...
 $ promo     : int  1 2 2 2 2 2 2 1 2 1 ...
 $ location  : int  2 3 2 3 3 3 3 2 3 3 ...
 $ satisf_b  : int  5 2 4 3 3 3 2 4 3 2 ...
 $ satisf_i  : int  2 3 5 4 3 4 2 4 4 3 ...
 $ satisf_al : int  2 3 4 4 3 4 3 4 4 4 ...
 $ repurchase: int  2 4 4 4 3 3 4 4 4 4 ...

table(a$gender)

  male female
   132    115

install.packages('descr')

package 'descr' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\MyCom\AppData\Local\Temp\Rtmp8OPV9d\downloaded_packages

library(descr)

Warning message:
"package 'descr' was built under R version 3.6.3"

freq(a$gender) #freq 사용하려면 descr 라이브러리 설치해줘야 함

	Frequency	Percent
male	132	53.4413
female	115	46.5587
Total	247	100.0000

output_18_1

install.packages('ggplot2')
library(ggplot2)

  There is a binary version available but the source version is later:
        binary source needs_compilation
ggplot2  3.3.3  3.3.5             FALSE



installing the source package 'ggplot2'

Registered S3 methods overwritten by 'tibble':
  method     from
  format.tbl pillar
  print.tbl  pillar

ggplot(a,aes(x=gender)) + geom_bar(color='blue')

output_20_0

attach(a)

freq(edu)

	Frequency	Percent
2	30	12.1457490
3	9	3.6437247
4	136	55.0607287
5	2	0.8097166
6	29	11.7408907
7	15	6.0728745
8	26	10.5263158
Total	247	100.0000000

output_22_1

install.packages('car')
library(car)

also installing the dependency 'lme4'

  There are binary versions available but the source versions are later:
     binary   source needs_compilation
lme4 1.1-26 1.1-27.1              TRUE
car  3.0-10   3.0-12             FALSE

  Binaries will be installed
package 'lme4' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\MyCom\AppData\Local\Temp\Rtmp8OPV9d\downloaded_packages

installing the source package 'car'

Loading required package: carData

a$eduM<-recode(edu,"lo:2=1; 3:4=2; 5:hi=3; else='NA'") # edu 컬럼 그대로 두고 eduM 컬럼 만들어 준다. 2를 1로 변경 , 3~4를 2로 변경 해준다.

freq(a$eduM) #3집단으로 묶인것을 확인가능

	Frequency	Percent
1	30	12.14575
2	145	58.70445
3	72	29.14980
Total	247	100.00000

output_25_1

a$eduM <- factor(a$eduM, levels=c(1,2,3), labels=c('중졸이하','고졸이하','대졸이상'))

freq(a$eduM)

	Frequency	Percent
중졸이하	30	12.14575
고졸이하	145	58.70445
대졸이상	72	29.14980
Total	247	100.00000

output_27_1

기술 통계

중심화경향

평균 , 중위수 , 최빈값(Mode) => 중심은 어디인가?

산포도

분산 : 편차의 제곱의 평균
표준편차 : 분산에 루트를 씌운 값
범위
사분위범위

분포도

왜도 : 좌우대칭정도
첨도 : 뾰족함 정도

attach(a)

The following objects are masked from a (pos = 5):

    amount, aware, count, decision, edu, gender, job, location,
    marriage, mincome, promo, propensity, repurchase, satisf_al,
    satisf_b, satisf_i, skin

max(amount)

5000000

min(amount)

3000

sum(amount)

38023000

mean(amount)

153939.271255061

var(amount)

158463699549.06

sd(amount)

398074.992368348

install.packages('psych')

also installing the dependencies 'tmvnsim', 'mnormt'

  There is a binary version available but the source version is later:
      binary source needs_compilation
psych  2.1.3  2.1.9             FALSE

package 'tmvnsim' successfully unpacked and MD5 sums checked
package 'mnormt' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\MyCom\AppData\Local\Temp\Rtmp8OPV9d\downloaded_packages

installing the source package 'psych'

library(psych)

Attaching package: 'psych'

The following object is masked from 'package:car':

    logit

The following objects are masked from 'package:ggplot2':

    %+%, alpha

describe(a)

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
gender*	1	247	1.465587e+00	4.998272e-01	1	1.457286	0.0000	1	2	1	0.13714193	-1.9891964	3.180324e-02
marriage	2	247	1.712551e+00	4.534918e-01	2	1.763819	0.0000	1	2	1	-0.93360038	-1.1329280	2.885499e-02
edu	3	247	4.566802e+00	1.709191e+00	4	4.462312	0.0000	2	8	6	0.63815084	-0.3287273	1.087532e-01
job	4	247	4.578947e+00	2.199603e+00	4	4.422111	1.4826	1	10	9	0.68090959	-0.1651951	1.399574e-01
mincome	5	247	3.757085e+00	1.674079e+00	4	3.819095	1.4826	1	6	5	-0.10186401	-1.2266191	1.065191e-01
aware	6	247	3.319838e+00	5.575692e+00	2	1.924623	0.0000	1	31	30	3.98663626	15.7864331	3.547728e-01
count	7	247	4.327935e+00	4.422061e+00	3	3.492462	2.9652	1	36	35	3.08793674	13.5854742	2.813690e-01
amount	8	247	1.539393e+05	3.980750e+05	52000	83798.994975	47443.2000	3000	5000000	4997000	8.62153257	92.2401960	2.532891e+04
decision	9	247	2.388664e+00	7.615994e-01	3	2.482412	0.0000	1	3	2	-0.77786381	-0.8701841	4.845941e-02
propensity	10	247	1.975709e+00	6.803103e-01	2	1.969849	0.0000	1	3	2	0.02958183	-0.8487310	4.328711e-02
skin	11	247	2.761134e+00	1.488311e+00	3	2.703518	2.9652	1	5	4	0.15331957	-1.3373908	9.469894e-02
promo	12	247	2.016194e+00	8.212998e-01	2	1.919598	0.0000	1	4	3	0.84726856	0.5434697	5.225806e-02
location	13	247	2.465587e+00	1.073437e+00	3	2.371859	1.4826	1	5	4	0.55038656	0.1943319	6.830114e-02
satisf_b	14	247	2.890688e+00	7.809953e-01	3	2.869347	0.0000	1	5	4	0.14047539	0.3237416	4.969354e-02
satisf_i	15	247	3.404858e+00	8.301096e-01	3	3.482412	1.4826	1	5	4	-0.69559430	0.9204758	5.281861e-02
satisf_al	16	247	3.461538e+00	7.527311e-01	4	3.512563	1.4826	1	5	4	-0.98037384	2.1617488	4.789513e-02
repurchase	17	247	3.554656e+00	7.241820e-01	4	3.633166	0.0000	1	5	4	-1.27727971	2.5541785	4.607860e-02
eduM*	18	247	2.170040e+00	6.209693e-01	2	2.211055	0.0000	1	3	2	-0.12856235	-0.5355836	3.951133e-02

summary(amount)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   3000   30000   52000  153939  100000 5000000

aggregate(amount~gender, a, mean)

gender	amount
male	127757.6
female	183991.3

tapply(amount,gender, mean)

<dl class=dl-horizontal> <dt>male</dt> <dd>127757.575757576</dd> <dt>female</dt> <dd>183991.304347826</dd> </dl>

hist(amount)

output_42_0

boxplot(amount) #이상치 판단하기

output_43_0

qplot(amount,geom='histogram')

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

output_44_1

qplot(amount,geom='histogram',bins=50)

output_45_0

qplot(amount,geom='histogram',bandwidth=50)

Warning message:
"Ignoring unknown parameters: bandwidth"`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

output_46_1

qplot(amount,geom='histogram',main='Histogram for amount')

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

output_47_1

이상치 제거

IQR 기준 이상치 탐색 및 제거

이상치: Q1 값에서 IQR의 1.5배를 뺀 값보다 작은 값, Q3 값에서 IQR의 1.5배를 더한값보다 큰 값 이다 .

upQuan <- quantile(amount)[4]

loQuan <- quantile(amount)[2]

IQR=upQuan-loQuan

IQR

75%: 70000

a$amount<-ifelse(amount>upQuan+IQR*1.5 | amount <loQuan-IQR*1.5, NA, a$amount)

describe(a) # amount만 217이므로 이상치 제거해줘야 한다
# 이상치 제거후 71119.815668 값 줄어든것을 확인할 수 있다.

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
gender*	1	247	1.465587	4.998272e-01	1	1.457286	0.0000	1	2	1	0.13714193	-1.9891964	3.180324e-02
marriage	2	247	1.712551	4.534918e-01	2	1.763819	0.0000	1	2	1	-0.93360038	-1.1329280	2.885499e-02
edu	3	247	4.566802	1.709191e+00	4	4.462312	0.0000	2	8	6	0.63815084	-0.3287273	1.087532e-01
job	4	247	4.578947	2.199603e+00	4	4.422111	1.4826	1	10	9	0.68090959	-0.1651951	1.399574e-01
mincome	5	247	3.757085	1.674079e+00	4	3.819095	1.4826	1	6	5	-0.10186401	-1.2266191	1.065191e-01
aware	6	247	3.319838	5.575692e+00	2	1.924623	0.0000	1	31	30	3.98663626	15.7864331	3.547728e-01
count	7	247	4.327935	4.422061e+00	3	3.492462	2.9652	1	36	35	3.08793674	13.5854742	2.813690e-01
amount	8	217	71119.815668	5.355038e+04	50000	63062.857143	44478.0000	3000	200000	197000	1.16454289	0.5555427	3.635237e+03
decision	9	247	2.388664	7.615994e-01	3	2.482412	0.0000	1	3	2	-0.77786381	-0.8701841	4.845941e-02
propensity	10	247	1.975709	6.803103e-01	2	1.969849	0.0000	1	3	2	0.02958183	-0.8487310	4.328711e-02
skin	11	247	2.761134	1.488311e+00	3	2.703518	2.9652	1	5	4	0.15331957	-1.3373908	9.469894e-02
promo	12	247	2.016194	8.212998e-01	2	1.919598	0.0000	1	4	3	0.84726856	0.5434697	5.225806e-02
location	13	247	2.465587	1.073437e+00	3	2.371859	1.4826	1	5	4	0.55038656	0.1943319	6.830114e-02
satisf_b	14	247	2.890688	7.809953e-01	3	2.869347	0.0000	1	5	4	0.14047539	0.3237416	4.969354e-02
satisf_i	15	247	3.404858	8.301096e-01	3	3.482412	1.4826	1	5	4	-0.69559430	0.9204758	5.281861e-02
satisf_al	16	247	3.461538	7.527311e-01	4	3.512563	1.4826	1	5	4	-0.98037384	2.1617488	4.789513e-02
repurchase	17	247	3.554656	7.241820e-01	4	3.633166	0.0000	1	5	4	-1.27727971	2.5541785	4.607860e-02
eduM*	18	247	2.170040	6.209693e-01	2	2.211055	0.0000	1	3	2	-0.12856235	-0.5355836	3.951133e-02

a$amount<-ifelse(amount>100000, NA, a$amount)

describe(a)

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
gender*	1	247	1.465587	4.998272e-01	1	1.457286	0.0000	1	2	1	0.13714193	-1.9891964	3.180324e-02
marriage	2	247	1.712551	4.534918e-01	2	1.763819	0.0000	1	2	1	-0.93360038	-1.1329280	2.885499e-02
edu	3	247	4.566802	1.709191e+00	4	4.462312	0.0000	2	8	6	0.63815084	-0.3287273	1.087532e-01
job	4	247	4.578947	2.199603e+00	4	4.422111	1.4826	1	10	9	0.68090959	-0.1651951	1.399574e-01
mincome	5	247	3.757085	1.674079e+00	4	3.819095	1.4826	1	6	5	-0.10186401	-1.2266191	1.065191e-01
aware	6	247	3.319838	5.575692e+00	2	1.924623	0.0000	1	31	30	3.98663626	15.7864331	3.547728e-01
count	7	247	4.327935	4.422061e+00	3	3.492462	2.9652	1	36	35	3.08793674	13.5854742	2.813690e-01
amount	8	186	53134.408602	3.040054e+04	50000	52640.000000	29652.0000	3000	100000	97000	0.39814918	-1.1001380	2.229076e+03
decision	9	247	2.388664	7.615994e-01	3	2.482412	0.0000	1	3	2	-0.77786381	-0.8701841	4.845941e-02
propensity	10	247	1.975709	6.803103e-01	2	1.969849	0.0000	1	3	2	0.02958183	-0.8487310	4.328711e-02
skin	11	247	2.761134	1.488311e+00	3	2.703518	2.9652	1	5	4	0.15331957	-1.3373908	9.469894e-02
promo	12	247	2.016194	8.212998e-01	2	1.919598	0.0000	1	4	3	0.84726856	0.5434697	5.225806e-02
location	13	247	2.465587	1.073437e+00	3	2.371859	1.4826	1	5	4	0.55038656	0.1943319	6.830114e-02
satisf_b	14	247	2.890688	7.809953e-01	3	2.869347	0.0000	1	5	4	0.14047539	0.3237416	4.969354e-02
satisf_i	15	247	3.404858	8.301096e-01	3	3.482412	1.4826	1	5	4	-0.69559430	0.9204758	5.281861e-02
satisf_al	16	247	3.461538	7.527311e-01	4	3.512563	1.4826	1	5	4	-0.98037384	2.1617488	4.789513e-02
repurchase	17	247	3.554656	7.241820e-01	4	3.633166	0.0000	1	5	4	-1.27727971	2.5541785	4.607860e-02
eduM*	18	247	2.170040	6.209693e-01	2	2.211055	0.0000	1	3	2	-0.12856235	-0.5355836	3.951133e-02

qplot(a$amount,geom='histogram',main='Histogram for amount')

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning message:
"Removed 61 rows containing non-finite values (stat_bin)."

output_58_1

표준 오차 (S.E.)

s/root n

s : 동질성 ( 표준오차가 작다는 것은 그만큼 동질하다는 의미, 모집단도 동일하다는 의미)

root n : 대표성( 표본이 많을수록 모집단의 특성을 더 많이 반영한다는 의미, 많은 표본이 표준오차가 적음)

EX 1.

100명의 여자 대학생을 대상으로 데이트 상대 수를 구한 결과 평균 수가 4명이고 , 표준 편차 5라면, 표준오차는?

표준오차 : 5/ ROOT 100 = 0.5
의미 : 평균의 차이가 0.5이므로 3.5명~4.5명에 들 가능성이 높음

검정통계량

귀무가설과 현상(데이터)간에 차이를 계산한 값이 검정통계량 이다
카이제곱검정

square

검정 통계량이 작다는 것은 귀무가설과 수집된 데이터 간의 차이가 작다는 것이다.
신뢰수준에 들 정도에 차이라면 귀무가설을 채택한다
p-value(유의 확률, significance probability) 값이 0.05 미만이면 대립가설이 성립된다
검정통계량 = 차이/오차
검정통계량이 클수록 대립가설이 채택될 가능성이 높다

교차분석 실습

a

gender	marriage	edu	job	mincome	aware	count	amount	decision	propensity	skin	promo	location	satisf_b	satisf_i	satisf_al	repurchase	eduM
male	1	4	1	2	2	1	11000	2	1	1	1	2	5	2	2	2	고졸이하
female	1	4	9	2	1	4	30000	1	1	3	2	3	2	3	3	4	고졸이하
female	2	4	4	3	1	6	100000	3	2	3	2	2	4	5	4	4	고졸이하
female	2	4	7	5	2	6	65000	3	2	5	2	3	3	4	4	4	고졸이하
male	2	6	6	5	2	2	50000	2	2	3	2	3	3	3	3	3	대졸이상
female	2	2	7	3	1	2	100000	2	1	4	2	3	3	4	4	3	중졸이하
female	1	6	4	5	1	5	100000	3	2	5	2	3	2	2	3	4	대졸이상
male	1	6	4	5	4	10	39000	3	2	2	1	2	4	4	4	4	대졸이상
female	2	4	5	2	2	2	40000	3	2	3	2	3	3	4	4	4	고졸이하
female	2	4	5	2	1	2	100000	3	3	3	1	3	2	3	4	4	고졸이하
female	1	7	4	3	10	3	50000	1	3	1	2	3	3	3	4	4	대졸이상
male	1	2	5	3	2	1	30000	3	2	3	2	2	3	3	3	3	중졸이하
female	2	4	4	3	4	4	NA	2	3	3	3	2	4	4	4	4	고졸이하
female	2	4	4	2	3	2	NA	1	2	3	1	3	3	3	3	3	고졸이하
male	2	4	4	6	2	2	60000	3	2	1	2	5	3	3	3	4	고졸이하
female	1	4	5	2	2	3	50000	1	2	4	1	3	3	4	3	3	고졸이하
male	2	8	3	2	5	3	NA	1	3	1	2	2	3	3	3	3	대졸이상
female	1	3	8	5	1	6	NA	3	3	2	4	1	4	4	4	4	고졸이하
male	2	2	6	2	4	1	80000	2	3	1	2	3	3	3	4	4	중졸이하
male	1	4	4	3	8	3	30000	2	2	3	2	3	3	3	3	3	고졸이하
female	2	2	4	2	8	4	NA	3	2	3	2	2	3	4	4	4	중졸이하
female	2	4	7	6	1	4	NA	3	3	2	2	3	2	3	4	4	고졸이하
female	2	4	7	3	1	25	50000	2	2	1	2	3	3	4	4	4	고졸이하
female	2	2	9	1	1	1	20000	1	1	5	1	3	3	3	3	3	중졸이하
male	1	3	8	4	2	3	42000	1	2	3	1	3	3	3	4	4	고졸이하
male	1	4	8	4	2	3	42000	3	3	2	1	3	3	4	4	4	고졸이하
female	1	4	4	3	2	20	40000	3	1	5	2	3	2	4	4	4	고졸이하
female	2	4	4	6	1	6	70000	3	3	5	2	1	3	4	4	4	고졸이하
female	2	8	4	5	5	6	NA	2	1	4	1	1	4	4	4	3	대졸이상
male	2	4	2	6	2	1	NA	3	2	1	2	2	3	4	4	4	고졸이하
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
female	2	4	5	2	1	10	30000	2	2	2	4	3	3	4	4	4	고졸이하
male	2	4	6	6	7	5	50000	3	2	4	2	2	3	4	4	4	고졸이하
male	2	4	1	4	1	1	10000	1	3	1	1	1	5	1	3	1	고졸이하
male	1	4	4	3	2	1	10000	1	1	4	3	2	2	3	3	3	고졸이하
male	1	4	4	3	2	3	50000	3	2	5	2	3	2	4	4	4	고졸이하
male	2	2	5	4	2	1	60000	1	1	1	4	5	3	3	3	3	중졸이하
male	1	6	1	3	2	2	50000	3	2	5	1	2	4	4	3	4	대졸이상
female	2	6	4	6	1	3	NA	3	3	5	2	1	3	4	4	4	대졸이상
male	2	4	1	3	2	1	50000	3	2	1	2	3	2	3	3	3	고졸이하
female	2	6	3	4	1	1	100000	3	2	1	2	1	3	3	3	3	대졸이상
female	2	4	7	4	1	2	50000	3	1	5	2	2	3	2	3	4	고졸이하
female	2	4	4	2	2	12	20000	2	2	3	2	3	3	3	3	3	고졸이하
male	1	4	4	4	2	4	35000	2	2	1	3	2	3	3	4	4	고졸이하
male	1	4	4	4	2	4	30000	2	2	1	2	3	3	4	4	3	고졸이하
female	2	4	7	1	2	2	50000	1	1	3	2	5	3	4	4	4	고졸이하
female	2	4	7	1	2	3	50000	1	2	3	2	5	3	3	3	3	고졸이하
female	1	4	4	2	2	7	80000	3	2	3	2	2	3	3	3	3	고졸이하
male	1	4	1	3	2	6	20000	3	1	3	1	3	2	3	4	4	고졸이하
male	1	4	10	2	2	2	25000	3	1	3	2	3	3	4	4	4	고졸이하
female	1	3	8	1	1	7	100000	2	1	5	2	3	3	3	1	2	고졸이하
male	2	4	4	3	2	2	50000	1	2	5	2	2	3	4	3	3	고졸이하
male	2	4	4	5	2	1	80000	2	2	1	2	3	3	3	3	3	고졸이하
female	2	6	7	5	1	2	NA	3	3	2	3	2	3	4	3	3	대졸이상
female	2	7	7	4	1	2	NA	2	2	4	3	2	3	3	3	3	대졸이상
male	1	2	1	2	2	5	3000	1	1	2	1	1	1	1	1	1	중졸이하
male	1	4	2	3	2	6	4000	1	1	1	1	4	2	1	1	1	고졸이하
female	2	4	4	2	1	10	NA	3	2	2	1	2	3	4	4	4	고졸이하
female	2	7	8	1	2	3	100000	1	2	1	1	5	2	5	4	4	대졸이상
male	1	4	6	1	3	2	20000	3	1	1	1	3	4	3	3	2	고졸이하
female	2	6	10	1	1	10	NA	3	2	3	1	3	2	3	3	3	대졸이상

table(propensity,skin) #구매성향 , 피부타입 ,비율알려면 액셀로 하기

          skin
propensity  1  2  3  4  5
         1 22  1 12 11 14
         2 39  8 47 16 23
         3 20 11 10  4  9

chisq.test(propensity, skin, correct = F) #검정통계량

Warning message in chisq.test(propensity, skin, correct = F):
"Chi-squared approximation may be incorrect"

	Pearson's Chi-squared test

data:  propensity and skin
X-squared = 24.275, df = 8, p-value = 0.002061

p-value = 0.002061 이므로 대립가설이 채택될 가능성이 높다.

# 성별에 따라 인지하는 정도가 다른가??

table(gender,aware)

        aware
gender    1  2  3  4  5  6  7  8 10 14 15 19 28 29 31
  male   18 83  3 10  3  0  1  2  0  3  2  0  1  0  6
  female 59 43  2  2  2  1  0  2  1  1  0  1  0  1  0

chisq.test(gender,aware,correct = F)

Warning message in chisq.test(gender, aware, correct = F):
"Chi-squared approximation may be incorrect"

	Pearson's Chi-squared test

data:  gender and aware
X-squared = 54.35, df = 14, p-value = 1.119e-06

install.packages('gmodels')

package 'gmodels' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\MyCom\AppData\Local\Temp\Rtmp8OPV9d\downloaded_packages

library(gmodels)

Warning message:
"package 'gmodels' was built under R version 3.6.3"
Attaching package: 'gmodels'

The following object is masked from 'package:descr':

    CrossTable

CrossTable(x=propensity, y= skin, chisq=T) #카이제곱 알수있음

Warning message in chisq.test(t, correct = FALSE, ...):
"Chi-squared approximation may be incorrect"



   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|


Total Observations in Table:  247


             | skin
  propensity |         1 |         2 |         3 |         4 |         5 | Row Total |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|
           1 |        22 |         1 |        12 |        11 |        14 |        60 |
             |     0.274 |     3.064 |     1.352 |     1.599 |     0.715 |           |
             |     0.367 |     0.017 |     0.200 |     0.183 |     0.233 |     0.243 |
             |     0.272 |     0.050 |     0.174 |     0.355 |     0.304 |           |
             |     0.089 |     0.004 |     0.049 |     0.045 |     0.057 |           |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|
           2 |        39 |         8 |        47 |        16 |        23 |       133 |
             |     0.488 |     0.712 |     2.609 |     0.029 |     0.126 |           |
             |     0.293 |     0.060 |     0.353 |     0.120 |     0.173 |     0.538 |
             |     0.481 |     0.400 |     0.681 |     0.516 |     0.500 |           |
             |     0.158 |     0.032 |     0.190 |     0.065 |     0.093 |           |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|
           3 |        20 |        11 |        10 |         4 |         9 |        54 |
             |     0.297 |    10.046 |     1.714 |     1.138 |     0.111 |           |
             |     0.370 |     0.204 |     0.185 |     0.074 |     0.167 |     0.219 |
             |     0.247 |     0.550 |     0.145 |     0.129 |     0.196 |           |
             |     0.081 |     0.045 |     0.040 |     0.016 |     0.036 |           |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|
Column Total |        81 |        20 |        69 |        31 |        46 |       247 |
             |     0.328 |     0.081 |     0.279 |     0.126 |     0.186 |           |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|


Statistics for All Table Factors


Pearson's Chi-squared test
------------------------------------------------------------
Chi^2 =  24.27468     d.f. =  8     p =  0.002060876

cross1<-table(propensity, skin)

cross1

          skin
propensity  1  2  3  4  5
         1 22  1 12 11 14
         2 39  8 47 16 23
         3 20 11 10  4  9

barplot(as.matrix(cross1))

output_80_0

Share on

Copy URL

Kakao Facebook Twitter LinkedIn

SeongJae Yoo

범주형 데이터 , 연속형 데이터로 분석 및 교차분석

Share on

범주형 데이터, 연속형 데이터 으로 분석하기

기술 통계

이상치 제거

IQR 기준 이상치 탐색 및 제거

이상치: Q1 값에서 IQR의 1.5배를 뺀 값보다 작은 값, Q3 값에서 IQR의 1.5배를 더한값보다 큰 값 이다 .

표준 오차 (S.E.)

s/root n

s : 동질성 ( 표준오차가 작다는 것은 그만큼 동질하다는 의미, 모집단도 동일하다는 의미)

root n : 대표성( 표본이 많을수록 모집단의 특성을 더 많이 반영한다는 의미, 많은 표본이 표준오차가 적음)

검정통계량

교차분석 실습

p-value = 0.002061 이므로 대립가설이 채택될 가능성이 높다.

Share on

Leave a comment

You may also enjoy

자동주식투자프로그램 세팅방법

Data Mining Theory And Reality Midtestexample04

Data Mining Theory And Reality Midtestexample03

Data Mining Theory And Reality Midtestexample02