学习笔记

当原有的数据不满足正态分布时,需对数据进行变化

常见的三种变化方式

  1. > mydata <- runif(100,min = 1,max = 2)
  2. > shapiro.test(mydata)
  3. Shapiro-Wilk normality test
  4. data: mydata
  5. W = 0.96198, p-value = 0.005578
  6. > shapiro.test(1/mydata)
  7. Shapiro-Wilk normality test
  8. data: 1/mydata
  9. W = 0.96442, p-value = 0.008406
  10. > shapiro.test(log(mydata))
  11. Shapiro-Wilk normality test
  12. data: log(mydata)
  13. W = 0.96941, p-value = 0.01994
  14. > shapiro.test(sqrt(mydata))
  15. Shapiro-Wilk normality test
  16. data: sqrt(mydata)
  17. W = 0.96717, p-value = 0.01348
  18. # 三种方法p-value都小于0.5,都不合适正态分布

方法四:MASS包里的boxCox() 转换,一种数据转换形式,一种基于lambda值的转换

三种方法都不合适,可考虑boxCox()

  1. > library(MASS)
  2. > head(trees,3)
  3. Girth Height Volume
  4. 1 8.3 70 10.3
  5. 2 8.6 65 10.3
  6. 3 8.8 63 10.2
  7. > shapiro.test(trees$Volume) #p-value小于0.5
  8. Shapiro-Wilk normality test
  9. data: trees$Volume
  10. W = 0.88757, p-value = 0.003579
  11. > shapiro.test(log(trees$Volume))#转换后p-value大于0.5
  12. Shapiro-Wilk normality test
  13. data: log(trees$Volume)
  14. W = 0.96427, p-value = 0.3766
  15. #boxcox的方法
  16. > bc <- boxcox(Volume ~ log(Height) + log(Girth), data = trees,
  17. + lambda = seq(-0.25, 0.25, length = 10))
  18. #首先画图:如下,按照height,Girth估计volumn,数据是trees,lambda是在-0.25, 0.25之间取值,分成10份,反复取值,返回最佳值
  19. > lambda <- bc$x[which.max(bc$y)] #取出最大y值对应的x,此x值即为lambda值
  20. > Volume_bc <- (trees$Volume^lambda-1)/lambda #转化
  21. > Volume_bc
  22. [1] 2.156176 2.156176 2.147851 2.546707 2.659044 2.697267
  23. [7] 2.505310 2.632460 2.808820 2.705508 2.863994 2.749305
  24. [13] 2.764627 2.760825 2.671999 2.794373 3.129821 2.963511
  25. [19] 2.912290 2.886918 3.145933 3.079254 3.185814 3.227720
  26. [25] 3.310407 3.512022 3.516128 3.550759 3.456365 3.448906
  27. [31] 3.759623
  28. > shapiro.test(Volume_bc)
  29. Shapiro-Wilk normality test
  30. data: Volume_bc
  31. W = 0.96431, p-value = 0.3776 #服从正态分布了

R统计3_数据转换 - 图1

  1. > qqnorm(Volume_bc)
  2. > qqline(Volume_bc)
  3. #三点集中与直线上,服从正态分布

R统计3_数据转换 - 图2R统计3_数据转换 - 图3

方法五:根据forecast里的lambda值转换

  • -1 :1/x
  • -0.5 :1/sqrt(x)
  • 0 : log
  • 0.5 :sqrt()
  • 1:无需转换
  1. library(forecast)
  2. > lambda <- BoxCox.lambda(trees$Volume);lambda #通过lambda生成一个判断值,判断方法如上
  3. [1] -0.4954451
  4. # 接近0.5,倒数转换
  5. > new_volum <- 1/sqrt(trees$Volume) #倒数转换
  6. > shapiro.test(new_volum ) #服从正态分布了
  7. Shapiro-Wilk normality test
  8. data: new_volum
  9. W = 0.94523, p-value = 0.1152

方法六:有一个转换方法 car包里的powerTransform

  1. > library(car)
  2. > powerTransform(trees$Volume) #生成一个指数值
  3. Estimated transformation parameter
  4. trees$Volume
  5. -0.07476608
  6. > new_volum <- trees$Volume^-0.07476608 #通过指数值生成新变量
  7. > shapiro.test(new_volum ) #服从正态分布了
  8. Shapiro-Wilk normality test
  9. data: new_volum
  10. W = 0.96428, p-value = 0.3768

总之,并不是所有的函数都适合所有的数据,需要尝试。