学习笔记
当原有的数据不满足正态分布时,需对数据进行变化
常见的三种变化方式
> mydata <- runif(100,min = 1,max = 2)
> shapiro.test(mydata)
Shapiro-Wilk normality test
data: mydata
W = 0.96198, p-value = 0.005578
> shapiro.test(1/mydata)
Shapiro-Wilk normality test
data: 1/mydata
W = 0.96442, p-value = 0.008406
> shapiro.test(log(mydata))
Shapiro-Wilk normality test
data: log(mydata)
W = 0.96941, p-value = 0.01994
> shapiro.test(sqrt(mydata))
Shapiro-Wilk normality test
data: sqrt(mydata)
W = 0.96717, p-value = 0.01348
# 三种方法p-value都小于0.5,都不合适正态分布
方法四:MASS包里的boxCox() 转换,一种数据转换形式,一种基于lambda值的转换
三种方法都不合适,可考虑boxCox()
> library(MASS)
> head(trees,3)
Girth Height Volume
1 8.3 70 10.3
2 8.6 65 10.3
3 8.8 63 10.2
> shapiro.test(trees$Volume) #p-value小于0.5
Shapiro-Wilk normality test
data: trees$Volume
W = 0.88757, p-value = 0.003579
> shapiro.test(log(trees$Volume))#转换后p-value大于0.5
Shapiro-Wilk normality test
data: log(trees$Volume)
W = 0.96427, p-value = 0.3766
#boxcox的方法
> bc <- boxcox(Volume ~ log(Height) + log(Girth), data = trees,
+ lambda = seq(-0.25, 0.25, length = 10))
#首先画图:如下,按照height,Girth估计volumn,数据是trees,lambda是在-0.25, 0.25之间取值,分成10份,反复取值,返回最佳值
> lambda <- bc$x[which.max(bc$y)] #取出最大y值对应的x,此x值即为lambda值
> Volume_bc <- (trees$Volume^lambda-1)/lambda #转化
> Volume_bc
[1] 2.156176 2.156176 2.147851 2.546707 2.659044 2.697267
[7] 2.505310 2.632460 2.808820 2.705508 2.863994 2.749305
[13] 2.764627 2.760825 2.671999 2.794373 3.129821 2.963511
[19] 2.912290 2.886918 3.145933 3.079254 3.185814 3.227720
[25] 3.310407 3.512022 3.516128 3.550759 3.456365 3.448906
[31] 3.759623
> shapiro.test(Volume_bc)
Shapiro-Wilk normality test
data: Volume_bc
W = 0.96431, p-value = 0.3776 #服从正态分布了
> qqnorm(Volume_bc)
> qqline(Volume_bc)
#三点集中与直线上,服从正态分布
方法五:根据forecast里的lambda值转换
- -1 :1/x
- -0.5 :1/sqrt(x)
- 0 : log
- 0.5 :sqrt()
- 1:无需转换
library(forecast)
> lambda <- BoxCox.lambda(trees$Volume);lambda #通过lambda生成一个判断值,判断方法如上
[1] -0.4954451
# 接近0.5,倒数转换
> new_volum <- 1/sqrt(trees$Volume) #倒数转换
> shapiro.test(new_volum ) #服从正态分布了
Shapiro-Wilk normality test
data: new_volum
W = 0.94523, p-value = 0.1152
方法六:有一个转换方法 car包里的powerTransform
> library(car)
> powerTransform(trees$Volume) #生成一个指数值
Estimated transformation parameter
trees$Volume
-0.07476608
> new_volum <- trees$Volume^-0.07476608 #通过指数值生成新变量
> shapiro.test(new_volum ) #服从正态分布了
Shapiro-Wilk normality test
data: new_volum
W = 0.96428, p-value = 0.3768
总之,并不是所有的函数都适合所有的数据,需要尝试。