데이터스케일링과 범주특성의 변환

데이터 스케일링

  • 스케일링이란 연속형 특성의 단위가 다를 경우 이로 인해 과대 혹은 과소한 파라미터가 추정될 수 있기 때문에 모든 자료에 대해서 동일한 기준으로 자료를 변환하는 것을 의미한다.

  • 머신러닝/딥러닝 전에 특성변수의 스케일링 과정은 매우 중요하다

  1. Min-Max Scaling
  • 가장 대표적인 머신러닝/딥러닝의 스케일링 방법
  • 각 특성변수의 값과 최소값의 차이를 (최대-최소)로 나눔
  • 이 경우 모든 값은 0 이상의 양(+)의 값을 가짐
  • 이상치에 영향이 있음

sca1

  1. standardization
  • 평균이 0, 표준편차가 1이 되는 통계적인 자료 표준화의 대표적 값
  • 표준화를 하는 이유는?

머신러닝에서 사용하는 Support Vector Machine, Linear Regression, Logistic Regression 모델은 데이터가 가우시안 분포를 가지고 있다고 가정하여 구현되어 있어서 사전에 학습 데이터에 관해 표준화를 적용하는 것이 모델의 예측 성능 향상에 중요하다.

범주특성의 원핫인코딩 변환

  • 머신러닝에서는 one-hot-encoding을 해줘야 함

  • 더미변수로 만들어 준다. (0,1 로 구성)

범주형 컬럼은 원핫인코딩으로 0,1으로 구성되게 변환해줌

연속형 컬럼은 Min-Max Scaling 또는 standardization으로 변환해줌

  • Min-Max Scaling ,standardization 둘다 해봐서 둘 중에 더 정확한 것을 선택해야 한다

실습

1. 데이터 불러오기 & 범주/연속/레이블 분류

1
data<-read.csv("data/vote.csv", header=T)
1
head(data)
genderregioneduincomeagescore_govscore_progressscore_intentionvoteparties
1 4 3 3 3 2 2 4.01 2
1 5 2 3 3 2 4 3.00 3
1 3 1 2 4 1 3 2.81 4
2 1 2 1 3 5 4 2.61 1
1 1 1 2 4 4 3 2.41 1
1 1 1 2 4 1 4 3.81 2
1
data
genderregioneduincomeagescore_govscore_progressscore_intentionvoteparties
1 4 3 3 3 2 2 4.01 2
1 5 2 3 3 2 4 3.00 3
1 3 1 2 4 1 3 2.81 4
2 1 2 1 3 5 4 2.61 1
1 1 1 2 4 4 3 2.41 1
1 1 1 2 4 1 4 3.81 2
1 1 1 2 4 4 4 2.01 1
1 5 2 4 4 3 4 3.61 3
1 2 1 2 4 2 2 2.00 2
1 1 1 2 3 4 2 3.01 1
1 1 1 2 3 2 4 2.20 2
2 4 1 1 3 3 2 2.61 1
1 5 1 2 4 3 2 3.01 1
1 2 2 4 4 3 3 2.41 3
1 4 3 4 3 3 4 3.61 3
1 1 2 3 3 3 3 3.21 4
1 5 2 4 3 4 3 4.01 4
2 1 2 2 3 5 4 2.61 1
2 3 1 2 2 3 2 3.00 4
2 5 3 4 3 3 3 3.01 1
1 1 1 1 2 3 3 2.00 2
2 1 3 2 2 3 4 2.21 4
2 4 2 2 2 3 3 1.41 4
1 1 1 3 3 3 4 1.61 4
1 4 2 3 3 3 2 3.61 2
1 1 2 2 2 3 2 3.21 4
2 1 2 3 2 3 4 2.81 1
2 1 2 4 3 4 4 3.01 1
1 1 1 3 3 1 2 3.01 2
2 1 2 4 4 3 3 2.21 3
..............................
1 1 2 1 2 3 5 2.21 1
1 2 2 1 2 4 3 3.41 4
2 1 1 1 1 4 2 3.41 4
1 4 1 1 2 1 4 2.81 4
2 1 1 1 1 3 4 2.81 2
2 1 1 1 2 2 4 3.00 4
1 5 2 2 3 4 3 3.40 1
1 1 2 1 1 3 3 2.81 4
1 1 1 1 2 3 3 3.01 4
1 5 1 1 1 2 2 2.60 2
2 1 2 1 3 4 4 4.41 4
2 2 1 1 1 3 4 2.80 4
1 5 1 1 2 3 4 3.41 4
1 2 2 1 2 2 1 2.21 2
1 5 1 1 1 3 3 3.00 4
2 1 2 2 3 2 4 3.01 4
2 1 1 1 3 2 3 3.01 4
1 1 2 1 2 4 4 5.01 1
1 1 1 1 2 3 2 2.21 4
2 1 2 1 2 4 3 3.01 1
2 1 2 1 2 4 4 3.60 1
1 1 2 1 2 3 3 3.41 1
2 1 2 2 2 3 3 3.61 4
2 1 1 1 1 5 4 3.20 1
2 1 1 3 4 3 2 1.01 2
1 4 1 4 4 3 3 1.81 2
1 1 2 1 2 3 4 2.61 4
1 2 2 1 2 3 3 2.61 2
1 1 2 3 4 3 2 4.01 4
2 1 2 2 2 3 3 3.81 2

범주형 자료 따로 분리해주기

1
data_cat <-subset(data, select=c(gender, region))
1
data_cat
genderregion
14
15
13
21
11
11
11
15
12
11
11
24
15
12
14
11
15
21
23
25
11
21
24
11
14
11
21
21
11
21
......
11
12
21
14
21
21
15
11
11
15
21
22
15
12
15
21
21
11
11
21
21
11
21
21
21
14
11
12
11
21

연속형 자료 따로 분리해주기

1
data_num <-subset(data, select = c(edu, income, age, score_gov, score_progress, score_intention))
1
data_num
eduincomeagescore_govscore_progressscore_intention
3 3 3 2 2 4.0
2 3 3 2 4 3.0
1 2 4 1 3 2.8
2 1 3 5 4 2.6
1 2 4 4 3 2.4
1 2 4 1 4 3.8
1 2 4 4 4 2.0
2 4 4 3 4 3.6
1 2 4 2 2 2.0
1 2 3 4 2 3.0
1 2 3 2 4 2.2
1 1 3 3 2 2.6
1 2 4 3 2 3.0
2 4 4 3 3 2.4
3 4 3 3 4 3.6
2 3 3 3 3 3.2
2 4 3 4 3 4.0
2 2 3 5 4 2.6
1 2 2 3 2 3.0
3 4 3 3 3 3.0
1 1 2 3 3 2.0
3 2 2 3 4 2.2
2 2 2 3 3 1.4
1 3 3 3 4 1.6
2 3 3 3 2 3.6
2 2 2 3 2 3.2
2 3 2 3 4 2.8
2 4 3 4 4 3.0
1 3 3 1 2 3.0
2 4 4 3 3 2.2
..................
2 1 2 3 5 2.2
2 1 2 4 3 3.4
1 1 1 4 2 3.4
1 1 2 1 4 2.8
1 1 1 3 4 2.8
1 1 2 2 4 3.0
2 2 3 4 3 3.4
2 1 1 3 3 2.8
1 1 2 3 3 3.0
1 1 1 2 2 2.6
2 1 3 4 4 4.4
1 1 1 3 4 2.8
1 1 2 3 4 3.4
2 1 2 2 1 2.2
1 1 1 3 3 3.0
2 2 3 2 4 3.0
1 1 3 2 3 3.0
2 1 2 4 4 5.0
1 1 2 3 2 2.2
2 1 2 4 3 3.0
2 1 2 4 4 3.6
2 1 2 3 3 3.4
2 2 2 3 3 3.6
1 1 1 5 4 3.2
1 3 4 3 2 1.0
1 4 4 3 3 1.8
2 1 2 3 4 2.6
2 1 2 3 3 2.6
2 3 4 3 2 4.0
2 2 2 3 3 3.8

레이블 데이터 분리해주기

1
data_class<-subset(data, select=c(vote, parties))
1
data_class
voteparties
12
03
14
11
11
12
11
13
02
11
02
11
11
13
13
14
14
11
04
11
02
14
14
14
12
14
11
11
12
13
......
11
14
14
14
12
04
01
14
14
02
14
04
14
12
04
14
14
11
14
11
01
11
14
01
12
12
14
12
14
12

2. 범주형 특성의 웟핫인코딩(one-hot-encoding)

1
2
data_cat$gender<-factor(data_cat$gender, labels=c("male", "female"))
str(data_cat)
1
2
3
'data.frame':	211 obs. of  2 variables:
 $ gender: Factor w/ 2 levels "male","female": 1 1 1 2 1 1 1 1 1 1 ...
 $ region: int  4 5 3 1 1 1 1 5 2 1 ...
1
2
data_cat$region<-factor(data_cat$region, labels=c('Sudo', 'Chungcheung', 'Honam', 'Youngnam', 'Others'))
str(data_cat)
1
2
3
'data.frame':	211 obs. of  2 variables:
 $ gender: Factor w/ 2 levels "male","female": 1 1 1 2 1 1 1 1 1 1 ...
 $ region: Factor w/ 5 levels "Sudo","Chungcheung",..: 4 5 3 1 1 1 1 5 2 1 ...
1
data_cat
genderregion
male Youngnam
male Others
male Honam
female Sudo
male Sudo
male Sudo
male Sudo
male Others
male Chungcheung
male Sudo
male Sudo
female Youngnam
male Others
male Chungcheung
male Youngnam
male Sudo
male Others
female Sudo
female Honam
female Others
male Sudo
female Sudo
female Youngnam
male Sudo
male Youngnam
male Sudo
female Sudo
female Sudo
male Sudo
female Sudo
......
male Sudo
male Chungcheung
female Sudo
male Youngnam
female Sudo
female Sudo
male Others
male Sudo
male Sudo
male Others
female Sudo
female Chungcheung
male Others
male Chungcheung
male Others
female Sudo
female Sudo
male Sudo
male Sudo
female Sudo
female Sudo
male Sudo
female Sudo
female Sudo
female Sudo
male Youngnam
male Sudo
male Chungcheung
male Sudo
female Sudo
1
2
3
4
5

install.packages("caret")
library(caret)


1
2
3
4
5
6
7
8
9
10
11
12
13
14
  There is a binary version available but the source version is later:
      binary source needs_compilation
caret 6.0-86 6.0-90              TRUE

  Binaries will be installed
package 'caret' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\MyCom\AppData\Local\Temp\RtmpEDGEx8\downloaded_packages


Warning message:
"package 'caret' was built under R version 3.6.3"Loading required package: lattice
Loading required package: ggplot2
1
2
one_hot <- dummyVars(" ~ .", data = data_cat)
one_hot
1
2
3
4
5
6
7
Dummy Variable Object

Formula: ~.
<environment: 0x00000000691091c0>
2 variables, 2 factors
Variables and levels will be separated by '.'
A less than full rank encoding is used

데이터 프레임으로 바꿔주기

1
2
data_cat2 <- data.frame(predict(one_hot, newdata = data_cat))
head(data_cat2)
gender.malegender.femaleregion.Sudoregion.Chungcheungregion.Honamregion.Youngnamregion.Others
1000010
1000001
1000100
0110000
1010000
1010000

3. 연속형 특성의 Scaling

3-1. 표준화(평균 0, 표준편차 1) scaling

연속형 컬럼은 Min-Max Scaling 또는 standardization으로 변환해줌

  • Min-Max Scaling ,standardization 둘다 해봐서 둘 중에 더 정확한 것을 선택해야 한다
1
2
3
library(caret)
StandardScale <- preProcess(data_num, method=c("center", "scale"))
print(StandardScale)
1
2
3
4
5
6
Created from 211 samples and 6 variables

Pre-processing:
  - centered (6)
  - ignored (0)
  - scaled (6)
  • 데이터 프레임으로 만들어주기
1
2
data_standard <- predict(StandardScale, data_num)
head(data_standard)
eduincomeagescore_govscore_progressscore_intention
1.8095327 0.7421712 0.3966776 -1.1190329 -1.13873228 1.5020451
0.2119955 0.7421712 0.3966776 -1.1190329 0.94154920 0.1228827
-1.3855418 -0.1955421 1.5432389 -2.1778487 -0.09859154-0.1529498
0.2119955 -1.1332554 0.3966776 2.0574147 0.94154920-0.4287823
-1.3855418 -0.1955421 1.5432389 0.9985988 -0.09859154-0.7046147
-1.3855418 -0.1955421 1.5432389 -2.1778487 0.94154920 1.2262127

3-2. min-max scaling

1
2
3
4
MinMaxScale <- preProcess(data_num, method=c("range"))
print(MinMaxScale)


1
2
3
4
5
Created from 211 samples and 6 variables

Pre-processing:
  - ignored (0)
  - re-scaling to [0, 1] (6)
1
2
data_minmax <- predict(MinMaxScale, data_num)
head(data_minmax)
eduincomeagescore_govscore_progressscore_intention
1.0 0.66666670.66666670.25 0.25 0.75
0.5 0.66666670.66666670.25 0.75 0.50
0.0 0.33333331.00000000.00 0.50 0.45
0.5 0.00000000.66666671.00 0.75 0.40
0.0 0.33333331.00000000.75 0.50 0.35
0.0 0.33333331.00000000.00 0.75 0.70

4. 데이터 통합 및 저장

1
2
3
# cbind로 종 데이터를 추가해준다. cbind 외에도 여러가지 방법으로 같은 작업이 가능하다
Fvote = cbind(data_cat2, data_standard, data_class)
Fvote
gender.malegender.femaleregion.Sudoregion.Chungcheungregion.Honamregion.Youngnamregion.Otherseduincomeagescore_govscore_progressscore_intentionvoteparties
1 0 0 0 0 1 0 1.8095327 0.7421712 0.3966776 -1.11903285-1.13873228 1.5020451 1 2
1 0 0 0 0 0 1 0.2119955 0.7421712 0.3966776 -1.11903285 0.94154920 0.1228827 0 3
1 0 0 0 1 0 0 -1.3855418 -0.1955421 1.5432389 -2.17784869-0.09859154-0.1529498 1 4
0 1 1 0 0 0 0 0.2119955 -1.1332554 0.3966776 2.05741467 0.94154920-0.4287823 1 1
1 0 1 0 0 0 0 -1.3855418 -0.1955421 1.5432389 0.99859883-0.09859154-0.7046147 1 1
1 0 1 0 0 0 0 -1.3855418 -0.1955421 1.5432389 -2.17784869 0.94154920 1.2262127 1 2
1 0 1 0 0 0 0 -1.3855418 -0.1955421 1.5432389 0.99859883 0.94154920-1.2562797 1 1
1 0 0 0 0 0 1 0.2119955 1.6798845 1.5432389 -0.06021701 0.94154920 0.9503802 1 3
1 0 0 1 0 0 0 -1.3855418 -0.1955421 1.5432389 -1.11903285-1.13873228-1.2562797 0 2
1 0 1 0 0 0 0 -1.3855418 -0.1955421 0.3966776 0.99859883-1.13873228 0.1228827 1 1
1 0 1 0 0 0 0 -1.3855418 -0.1955421 0.3966776 -1.11903285 0.94154920-0.9804472 0 2
0 1 0 0 0 1 0 -1.3855418 -1.1332554 0.3966776 -0.06021701-1.13873228-0.4287823 1 1
1 0 0 0 0 0 1 -1.3855418 -0.1955421 1.5432389 -0.06021701-1.13873228 0.1228827 1 1
1 0 0 1 0 0 0 0.2119955 1.6798845 1.5432389 -0.06021701-0.09859154-0.7046147 1 3
1 0 0 0 0 1 0 1.8095327 1.6798845 0.3966776 -0.06021701 0.94154920 0.9503802 1 3
1 0 1 0 0 0 0 0.2119955 0.7421712 0.3966776 -0.06021701-0.09859154 0.3987152 1 4
1 0 0 0 0 0 1 0.2119955 1.6798845 0.3966776 0.99859883-0.09859154 1.5020451 1 4
0 1 1 0 0 0 0 0.2119955 -0.1955421 0.3966776 2.05741467 0.94154920-0.4287823 1 1
0 1 0 0 1 0 0 -1.3855418 -0.1955421 -0.7498837 -0.06021701-1.13873228 0.1228827 0 4
0 1 0 0 0 0 1 1.8095327 1.6798845 0.3966776 -0.06021701-0.09859154 0.1228827 1 1
1 0 1 0 0 0 0 -1.3855418 -1.1332554 -0.7498837 -0.06021701-0.09859154-1.2562797 0 2
0 1 1 0 0 0 0 1.8095327 -0.1955421 -0.7498837 -0.06021701 0.94154920-0.9804472 1 4
0 1 0 0 0 1 0 0.2119955 -0.1955421 -0.7498837 -0.06021701-0.09859154-2.0837772 1 4
1 0 1 0 0 0 0 -1.3855418 0.7421712 0.3966776 -0.06021701 0.94154920-1.8079447 1 4
1 0 0 0 0 1 0 0.2119955 0.7421712 0.3966776 -0.06021701-1.13873228 0.9503802 1 2
1 0 1 0 0 0 0 0.2119955 -0.1955421 -0.7498837 -0.06021701-1.13873228 0.3987152 1 4
0 1 1 0 0 0 0 0.2119955 0.7421712 -0.7498837 -0.06021701 0.94154920-0.1529498 1 1
0 1 1 0 0 0 0 0.2119955 1.6798845 0.3966776 0.99859883 0.94154920 0.1228827 1 1
1 0 1 0 0 0 0 -1.3855418 0.7421712 0.3966776 -2.17784869-1.13873228 0.1228827 1 2
0 1 1 0 0 0 0 0.2119955 1.6798845 1.5432389 -0.06021701-0.09859154-0.9804472 1 3
.............................................
1 0 1 0 0 0 0 0.2119955 -1.1332554 -0.7498837 -0.06021701 1.98168994-0.9804472 1 1
1 0 0 1 0 0 0 0.2119955 -1.1332554 -0.7498837 0.99859883-0.09859154 0.6745477 1 4
0 1 1 0 0 0 0 -1.3855418 -1.1332554 -1.8964449 0.99859883-1.13873228 0.6745477 1 4
1 0 0 0 0 1 0 -1.3855418 -1.1332554 -0.7498837 -2.17784869 0.94154920-0.1529498 1 4
0 1 1 0 0 0 0 -1.3855418 -1.1332554 -1.8964449 -0.06021701 0.94154920-0.1529498 1 2
0 1 1 0 0 0 0 -1.3855418 -1.1332554 -0.7498837 -1.11903285 0.94154920 0.1228827 0 4
1 0 0 0 0 0 1 0.2119955 -0.1955421 0.3966776 0.99859883-0.09859154 0.6745477 0 1
1 0 1 0 0 0 0 0.2119955 -1.1332554 -1.8964449 -0.06021701-0.09859154-0.1529498 1 4
1 0 1 0 0 0 0 -1.3855418 -1.1332554 -0.7498837 -0.06021701-0.09859154 0.1228827 1 4
1 0 0 0 0 0 1 -1.3855418 -1.1332554 -1.8964449 -1.11903285-1.13873228-0.4287823 0 2
0 1 1 0 0 0 0 0.2119955 -1.1332554 0.3966776 0.99859883 0.94154920 2.0537101 1 4
0 1 0 1 0 0 0 -1.3855418 -1.1332554 -1.8964449 -0.06021701 0.94154920-0.1529498 0 4
1 0 0 0 0 0 1 -1.3855418 -1.1332554 -0.7498837 -0.06021701 0.94154920 0.6745477 1 4
1 0 0 1 0 0 0 0.2119955 -1.1332554 -0.7498837 -1.11903285-2.17887302-0.9804472 1 2
1 0 0 0 0 0 1 -1.3855418 -1.1332554 -1.8964449 -0.06021701-0.09859154 0.1228827 0 4
0 1 1 0 0 0 0 0.2119955 -0.1955421 0.3966776 -1.11903285 0.94154920 0.1228827 1 4
0 1 1 0 0 0 0 -1.3855418 -1.1332554 0.3966776 -1.11903285-0.09859154 0.1228827 1 4
1 0 1 0 0 0 0 0.2119955 -1.1332554 -0.7498837 0.99859883 0.94154920 2.8812076 1 1
1 0 1 0 0 0 0 -1.3855418 -1.1332554 -0.7498837 -0.06021701-1.13873228-0.9804472 1 4
0 1 1 0 0 0 0 0.2119955 -1.1332554 -0.7498837 0.99859883-0.09859154 0.1228827 1 1
0 1 1 0 0 0 0 0.2119955 -1.1332554 -0.7498837 0.99859883 0.94154920 0.9503802 0 1
1 0 1 0 0 0 0 0.2119955 -1.1332554 -0.7498837 -0.06021701-0.09859154 0.6745477 1 1
0 1 1 0 0 0 0 0.2119955 -0.1955421 -0.7498837 -0.06021701-0.09859154 0.9503802 1 4
0 1 1 0 0 0 0 -1.3855418 -1.1332554 -1.8964449 2.05741467 0.94154920 0.3987152 0 1
0 1 1 0 0 0 0 -1.3855418 0.7421712 1.5432389 -0.06021701-1.13873228-2.6354421 1 2
1 0 0 0 0 1 0 -1.3855418 1.6798845 1.5432389 -0.06021701-0.09859154-1.5321122 1 2
1 0 1 0 0 0 0 0.2119955 -1.1332554 -0.7498837 -0.06021701 0.94154920-0.4287823 1 4
1 0 0 1 0 0 0 0.2119955 -1.1332554 -0.7498837 -0.06021701-0.09859154-0.4287823 1 2
1 0 1 0 0 0 0 0.2119955 0.7421712 1.5432389 -0.06021701-1.13873228 1.5020451 1 4
0 1 1 0 0 0 0 0.2119955 -0.1955421 -0.7498837 -0.06021701-0.09859154 1.2262127 1 2
1
write.csv(Fvote, file="Fvote2.csv", row.names=TRUE)

Meta Info

Categories:

Published At:

Modified At:

Leave a comment