连续型数据

基于分布的检验

T检验

  1. t.test(1:10, 10:20)
  1. ##
  2. ## Welch Two Sample t-test
  3. ##
  4. ## data: 1:10 and 10:20
  5. ## t = -7, df = 19, p-value = 2e-06
  6. ## alternative hypothesis: true difference in means is not equal to 0
  7. ## 95 percent confidence interval:
  8. ## -12.4 -6.6
  9. ## sample estimates:
  10. ## mean of x mean of y
  11. ## 5.5 15.0

配对 t 检验:

  1. t.test(rnorm(10), rnorm(10, mean = 1), paired = TRUE)
  1. ##
  2. ## Paired t-test
  3. ##
  4. ## data: rnorm(10) and rnorm(10, mean = 1)
  5. ## t = -2, df = 9, p-value = 0.03
  6. ## alternative hypothesis: true difference in means is not equal to 0
  7. ## 95 percent confidence interval:
  8. ## -1.981 -0.096
  9. ## sample estimates:
  10. ## mean of the differences
  11. ## -1.04

使用公式表示:

  1. df <- data.frame(
  2. value = c(rnorm(10), rnorm(10, mean = 1)),
  3. group = c(rep("case", 10), rep("control", 10))
  4. )
  5. t.test(value ~ group, data = df)
  1. ##
  2. ## Welch Two Sample t-test
  3. ##
  4. ## data: value by group
  5. ## t = -0.7, df = 18, p-value = 0.5
  6. ## alternative hypothesis: true difference in means is not equal to 0
  7. ## 95 percent confidence interval:
  8. ## -0.933 0.447
  9. ## sample estimates:
  10. ## mean in group case mean in group control
  11. ## 0.539 0.782

假设方差同质化:

  1. t.test(value ~ group, data = df, var.equal = TRUE)
  1. ##
  2. ## Two Sample t-test
  3. ##
  4. ## data: value by group
  5. ## t = -0.7, df = 18, p-value = 0.5
  6. ## alternative hypothesis: true difference in means is not equal to 0
  7. ## 95 percent confidence interval:
  8. ## -0.933 0.447
  9. ## sample estimates:
  10. ## mean in group case mean in group control
  11. ## 0.539 0.782

总体方差比较

  1. var.test(value ~ group, data = df)
  1. ##
  2. ## F test to compare two variances
  3. ##
  4. ## data: value by group
  5. ## F = 0.8, num df = 9, denom df = 9, p-value = 0.8
  6. ## alternative hypothesis: true ratio of variances is not equal to 1
  7. ## 95 percent confidence interval:
  8. ## 0.202 3.269
  9. ## sample estimates:
  10. ## ratio of variances
  11. ## 0.812
  1. bartlett.test(value ~ group, data = df)
  1. ##
  2. ## Bartlett test of homogeneity of variances
  3. ##
  4. ## data: value by group
  5. ## Bartlett's K-squared = 0.09, df = 1, p-value = 0.8

多个组间均值的比较

两组以上的比较要使用ANOVA

  1. aov(wt ~ factor(cyl), data = mtcars)
  1. ## Call:
  2. ## aov(formula = wt ~ factor(cyl), data = mtcars)
  3. ##
  4. ## Terms:
  5. ## factor(cyl) Residuals
  6. ## Sum of Squares 18.2 11.5
  7. ## Deg. of Freedom 2 29
  8. ##
  9. ## Residual standard error: 0.63
  10. ## Estimated effects may be unbalanced
  1. ## 查看详细的信息
  2. model.tables(aov(wt ~ factor(cyl), data = mtcars))
  1. ## Tables of effects
  2. ##
  3. ## factor(cyl)
  4. ## 4 6 8
  5. ## -0.9315 -0.1001 0.782
  6. ## rep 11.0000 7.0000 14.000

ANOVA 分析假设各组样本数据的方差是相等的,如果知道(或怀疑)不相等,可以使用 oneway.test() 函数。

  1. oneway.test(wt ~ cyl, data = mtcars)
  1. ##
  2. ## One-way analysis of means (not assuming equal variances)
  3. ##
  4. ## data: wt and cyl
  5. ## F = 20, num df = 2, denom df = 19, p-value = 2e-05

多组样本的配对 t 检验

  1. pairwise.t.test(mtcars$wt, mtcars$cyl)
  1. ##
  2. ## Pairwise comparisons using t tests with pooled SD
  3. ##
  4. ## data: mtcars$wt and mtcars$cyl
  5. ##
  6. ## 4 6
  7. ## 6 0.01 -
  8. ## 8 6e-07 0.01
  9. ##
  10. ## P value adjustment method: holm

正态性检验

  1. shapiro.test(rnorm(30))
  1. ##
  2. ## Shapiro-Wilk normality test
  3. ##
  4. ## data: rnorm(30)
  5. ## W = 1, p-value = 0.6
  1. qqnorm(rnorm(30))

Rplot.png

分布的对称性检验

用 Kolmogorov-Smirnov 检验查看一个向量是否来自一个对称的分布(不限于正态分布)。

  1. ks.test(rnorm(10), pnorm)
  1. ##
  2. ## One-sample Kolmogorov-Smirnov test
  3. ##
  4. ## data: rnorm(10)
  5. ## D = 0.3, p-value = 0.2
  6. ## alternative hypothesis: two-sided

函数第 1 个参数指定待检验的数据,第 2个参数指定对称分布的类型,可以是数值型向量、指定概率分布函数的字符串或一个分布函数。

  1. ks.test(rnorm(10), "pnorm")
  1. ##
  2. ## One-sample Kolmogorov-Smirnov test
  3. ##
  4. ## data: rnorm(10)
  5. ## D = 0.3, p-value = 0.3
  6. ## alternative hypothesis: two-sided
  1. ks.test(rpois(10, lambda = 1), "pnorm")
  1. ## Warning in ks.test(rpois(10, lambda = 1), "pnorm"): ties should not be present for the
  2. ## Kolmogorov-Smirnov test
  1. ##
  2. ## One-sample Kolmogorov-Smirnov test
  3. ##
  4. ## data: rpois(10, lambda = 1)
  5. ## D = 0.5, p-value = 0.006
  6. ## alternative hypothesis: two-sided

检验两个向量是否服从同一分布

  1. ks.test(rnorm(20), rnorm(30))
  1. ##
  2. ## Two-sample Kolmogorov-Smirnov test
  3. ##
  4. ## data: rnorm(20) and rnorm(30)
  5. ## D = 0.2, p-value = 0.4
  6. ## alternative hypothesis: two-sided

相关性分析

  1. cor.test(mtcars$mpg, mtcars$wt)
  1. ##
  2. ## Pearson's product-moment correlation
  3. ##
  4. ## data: mtcars$mpg and mtcars$wt
  5. ## t = -10, df = 30, p-value = 1e-10
  6. ## alternative hypothesis: true correlation is not equal to 0
  7. ## 95 percent confidence interval:
  8. ## -0.934 -0.744
  9. ## sample estimates:
  10. ## cor
  11. ## -0.868

不依赖分布的检验

均值检验

Wilcoxon 检验是 t 检验的非参数版本

  1. ## 秩和检验
  2. wilcox.test(1:10, 10:20)
  1. ## Warning in wilcox.test.default(1:10, 10:20): cannot compute exact p-value with ties
  1. ##
  2. ## Wilcoxon rank sum test with continuity correction
  3. ##
  4. ## data: 1:10 and 10:20
  5. ## W = 0.5, p-value = 1e-04
  6. ## alternative hypothesis: true location shift is not equal to 0
  1. ## 符号秩检验
  2. wilcox.test(1:10, 10:19, paired = TRUE)
  1. ## Warning in wilcox.test.default(1:10, 10:19, paired = TRUE): cannot compute exact p-value with
  2. ## ties
  1. ##
  2. ## Wilcoxon signed rank test with continuity correction
  3. ##
  4. ## data: 1:10 and 10:19
  5. ## V = 0, p-value = 0.002
  6. ## alternative hypothesis: true location shift is not equal to 0

多均值比较

  1. ## Kruskal-Wallis 秩和检验
  2. kruskal.test(wt ~ factor(cyl), data = mtcars)
  1. ##
  2. ## Kruskal-Wallis rank sum test
  3. ##
  4. ## data: wt by factor(cyl)
  5. ## Kruskal-Wallis chi-squared = 23, df = 2, p-value = 1e-05

方差检验

使用Fligner-Killeen(中位数)检验完成不同组别的方差比较

  1. fligner.test(wt ~ cyl, data = mtcars)
  1. ##
  2. ## Fligner-Killeen test of homogeneity of variances
  3. ##
  4. ## data: wt by cyl
  5. ## Fligner-Killeen:med chi-squared = 0.5, df = 2, p-value = 0.8

离散数据

比例检验

使用 prop.test() 比较两组观测值发生的概率是否有差异。

  1. heads <- rbinom(1, size = 100, prob = .5)
  2. prop.test(heads, 100)
  1. ##
  2. ## 1-sample proportions test with continuity correction
  3. ##
  4. ## data: heads out of 100, null probability 0.5
  5. ## X-squared = 2, df = 1, p-value = 0.1
  6. ## alternative hypothesis: true p is not equal to 0.5
  7. ## 95 percent confidence interval:
  8. ## 0.477 0.677
  9. ## sample estimates:
  10. ## p
  11. ## 0.58
  1. prop.test(heads, 100, p = 0.3, correct = FALSE)
  1. ##
  2. ## 1-sample proportions test without continuity correction
  3. ##
  4. ## data: heads out of 100, null probability 0.3
  5. ## X-squared = 37, df = 1, p-value = 1e-09
  6. ## alternative hypothesis: true p is not equal to 0.3
  7. ## 95 percent confidence interval:
  8. ## 0.482 0.672
  9. ## sample estimates:
  10. ## p
  11. ## 0.58

二项式检验

  1. binom.test(c(682, 243), p = 3/4)
  1. ##
  2. ## Exact binomial test
  3. ##
  4. ## data: c(682, 243)
  5. ## number of successes = 682, number of trials = 925, p-value = 0.4
  6. ## alternative hypothesis: true probability of success is not equal to 0.75
  7. ## 95 percent confidence interval:
  8. ## 0.708 0.765
  9. ## sample estimates:
  10. ## probability of success
  11. ## 0.737

列联表

确定两个分类变量是否相关

Fisher精确检验

  1. TeaTasting <-
  2. matrix(c(3, 1, 1, 3),
  3. nrow = 2,
  4. dimnames = list(Guess = c("Milk", "Tea"),
  5. Truth = c("Milk", "Tea")))
  6. fisher.test(TeaTasting, alternative = "greater")
  1. ##
  2. ## Fisher's Exact Test for Count Data
  3. ##
  4. ## data: TeaTasting
  5. ## p-value = 0.2
  6. ## alternative hypothesis: true odds ratio is greater than 1
  7. ## 95 percent confidence interval:
  8. ## 0.314 Inf
  9. ## sample estimates:
  10. ## odds ratio
  11. ## 6.41

当样本数比较多时,使用卡方检验代替

  1. chisq.test(TeaTasting)
  1. ## Warning in chisq.test(TeaTasting): Chi-squared approximation may be incorrect
  1. ##
  2. ## Pearson's Chi-squared test with Yates' continuity correction
  3. ##
  4. ## data: TeaTasting
  5. ## X-squared = 0.5, df = 1, p-value = 0.5

对于三变量的混合影响,使用 Cochran-Mantel-Haenszel 检验。

  1. Rabbits <-
  2. array(c(0, 0, 6, 5,
  3. 3, 0, 3, 6,
  4. 6, 2, 0, 4,
  5. 5, 6, 1, 0,
  6. 2, 5, 0, 0),
  7. dim = c(2, 2, 5),
  8. dimnames = list(
  9. Delay = c("None", "1.5h"),
  10. Response = c("Cured", "Died"),
  11. Penicillin.Level = c("1/8", "1/4", "1/2", "1", "4")))
  12. Rabbits
  1. ## , , Penicillin.Level = 1/8
  2. ##
  3. ## Response
  4. ## Delay Cured Died
  5. ## None 0 6
  6. ## 1.5h 0 5
  7. ##
  8. ## , , Penicillin.Level = 1/4
  9. ##
  10. ## Response
  11. ## Delay Cured Died
  12. ## None 3 3
  13. ## 1.5h 0 6
  14. ##
  15. ## , , Penicillin.Level = 1/2
  16. ##
  17. ## Response
  18. ## Delay Cured Died
  19. ## None 6 0
  20. ## 1.5h 2 4
  21. ##
  22. ## , , Penicillin.Level = 1
  23. ##
  24. ## Response
  25. ## Delay Cured Died
  26. ## None 5 1
  27. ## 1.5h 6 0
  28. ##
  29. ## , , Penicillin.Level = 4
  30. ##
  31. ## Response
  32. ## Delay Cured Died
  33. ## None 2 0
  34. ## 1.5h 5 0
  1. mantelhaen.test(Rabbits)
  1. ##
  2. ## Mantel-Haenszel chi-squared test with continuity correction
  3. ##
  4. ## data: Rabbits
  5. ## Mantel-Haenszel X-squared = 4, df = 1, p-value = 0.05
  6. ## alternative hypothesis: true common odds ratio is not equal to 1
  7. ## 95 percent confidence interval:
  8. ## 1.03 47.73
  9. ## sample estimates:
  10. ## common odds ratio
  11. ## 7

列联表非参数检验

Friedman 秩和检验是一个非参数版本的双边 ANOVA 检验。

  1. ## Hollander & Wolfe (1973), p. 140ff.
  2. ## Comparison of three methods ("round out", "narrow angle", and
  3. ## "wide angle") for rounding first base. For each of 18 players
  4. ## and the three method, the average time of two runs from a point on
  5. ## the first base line 35ft from home plate to a point 15ft short of
  6. ## second base is recorded.
  7. RoundingTimes <-
  8. matrix(c(5.40, 5.50, 5.55,
  9. 5.85, 5.70, 5.75,
  10. 5.20, 5.60, 5.50,
  11. 5.55, 5.50, 5.40,
  12. 5.90, 5.85, 5.70,
  13. 5.45, 5.55, 5.60,
  14. 5.40, 5.40, 5.35,
  15. 5.45, 5.50, 5.35,
  16. 5.25, 5.15, 5.00,
  17. 5.85, 5.80, 5.70,
  18. 5.25, 5.20, 5.10,
  19. 5.65, 5.55, 5.45,
  20. 5.60, 5.35, 5.45,
  21. 5.05, 5.00, 4.95,
  22. 5.50, 5.50, 5.40,
  23. 5.45, 5.55, 5.50,
  24. 5.55, 5.55, 5.35,
  25. 5.45, 5.50, 5.55,
  26. 5.50, 5.45, 5.25,
  27. 5.65, 5.60, 5.40,
  28. 5.70, 5.65, 5.55,
  29. 6.30, 6.30, 6.25),
  30. nrow = 22,
  31. byrow = TRUE,
  32. dimnames = list(1 : 22,
  33. c("Round Out", "Narrow Angle", "Wide Angle")))
  34. friedman.test(RoundingTimes)
  1. ##
  2. ## Friedman rank sum test
  3. ##
  4. ## data: RoundingTimes
  5. ## Friedman chi-squared = 11, df = 2, p-value = 0.004