学习笔记
当原有的数据不满足正态分布时,需对数据进行变化
常见的三种变化方式
> mydata <- runif(100,min = 1,max = 2)> shapiro.test(mydata)Shapiro-Wilk normality testdata: mydataW = 0.96198, p-value = 0.005578> shapiro.test(1/mydata)Shapiro-Wilk normality testdata: 1/mydataW = 0.96442, p-value = 0.008406> shapiro.test(log(mydata))Shapiro-Wilk normality testdata: log(mydata)W = 0.96941, p-value = 0.01994> shapiro.test(sqrt(mydata))Shapiro-Wilk normality testdata: sqrt(mydata)W = 0.96717, p-value = 0.01348# 三种方法p-value都小于0.5,都不合适正态分布
方法四:MASS包里的boxCox() 转换,一种数据转换形式,一种基于lambda值的转换
三种方法都不合适,可考虑boxCox()
> library(MASS)> head(trees,3)Girth Height Volume1 8.3 70 10.32 8.6 65 10.33 8.8 63 10.2> shapiro.test(trees$Volume) #p-value小于0.5Shapiro-Wilk normality testdata: trees$VolumeW = 0.88757, p-value = 0.003579> shapiro.test(log(trees$Volume))#转换后p-value大于0.5Shapiro-Wilk normality testdata: log(trees$Volume)W = 0.96427, p-value = 0.3766#boxcox的方法> bc <- boxcox(Volume ~ log(Height) + log(Girth), data = trees,+ lambda = seq(-0.25, 0.25, length = 10))#首先画图:如下,按照height,Girth估计volumn,数据是trees,lambda是在-0.25, 0.25之间取值,分成10份,反复取值,返回最佳值> lambda <- bc$x[which.max(bc$y)] #取出最大y值对应的x,此x值即为lambda值> Volume_bc <- (trees$Volume^lambda-1)/lambda #转化> Volume_bc[1] 2.156176 2.156176 2.147851 2.546707 2.659044 2.697267[7] 2.505310 2.632460 2.808820 2.705508 2.863994 2.749305[13] 2.764627 2.760825 2.671999 2.794373 3.129821 2.963511[19] 2.912290 2.886918 3.145933 3.079254 3.185814 3.227720[25] 3.310407 3.512022 3.516128 3.550759 3.456365 3.448906[31] 3.759623> shapiro.test(Volume_bc)Shapiro-Wilk normality testdata: Volume_bcW = 0.96431, p-value = 0.3776 #服从正态分布了

> qqnorm(Volume_bc)> qqline(Volume_bc)#三点集中与直线上,服从正态分布


方法五:根据forecast里的lambda值转换
- -1 :1/x
- -0.5 :1/sqrt(x)
- 0 : log
- 0.5 :sqrt()
- 1:无需转换
library(forecast)> lambda <- BoxCox.lambda(trees$Volume);lambda #通过lambda生成一个判断值,判断方法如上[1] -0.4954451# 接近0.5,倒数转换> new_volum <- 1/sqrt(trees$Volume) #倒数转换> shapiro.test(new_volum ) #服从正态分布了Shapiro-Wilk normality testdata: new_volumW = 0.94523, p-value = 0.1152
方法六:有一个转换方法 car包里的powerTransform
> library(car)> powerTransform(trees$Volume) #生成一个指数值Estimated transformation parametertrees$Volume-0.07476608> new_volum <- trees$Volume^-0.07476608 #通过指数值生成新变量> shapiro.test(new_volum ) #服从正态分布了Shapiro-Wilk normality testdata: new_volumW = 0.96428, p-value = 0.3768
总之,并不是所有的函数都适合所有的数据,需要尝试。
