base

使用summary()直接获取相应的结果

  1. df <- mtcars[,c("mpg", "cyl","disp", "am")]
  2. summary(df)
  3. # out
  4. mpg cyl disp am
  5. Min. :10.40 Min. :4.000 Min. : 71.1 Min. :0.0000
  6. 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.:0.0000
  7. Median :19.20 Median :6.000 Median :196.3 Median :0.0000
  8. Mean :20.09 Mean :6.188 Mean :230.7 Mean :0.4062
  9. 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:1.0000
  10. Max. :33.90 Max. :8.000 Max. :472.0 Max. :1.000

数据结果中并没有偏度和峰度的结果

偏度:度量随机变量概率分布的不对称性 image.png 当偏度<0时,概率分布图左偏。 当偏度=0时,表示数据相对均匀的分布在平均值两侧,不一定是绝对的对称分布。 当偏度>0时,概率分布图右偏 峰度:度量随机变量概率分布的陡峭程度 image.png 完全服从正态分布的数据的峰度值为 3,峰度值越大,概率分布图越高尖,峰度值越小,越扁

自己写一下

  1. # 偏度
  2. sapply(df, function(x) {
  3. sum(((x - mean(x))/sd(x))^3)/length(x)
  4. })
  5. # out
  6. # mpg cyl disp am
  7. # 0.6106550 -0.1746119 0.3816570 0.3640159
  8. # 峰度
  9. sapply(df, function(x) {
  10. (sum(((x - mean(x))/sd(x))^4)/length(x)) -3
  11. })
  12. # out
  13. # mpg cyl disp am
  14. # -0.372766 -1.762120 -1.207212 -1.924741

psych

使用describeBy()或者describe()获取描述性统计分析结果

  1. describe(df)
  2. describeBy(df, group = "am")

分组描述

aggregate

func只能接受单返回值

  1. # by 可以存在多个分组变量
  2. aggregate(df, by = list(gr = df$am), FUN = mean)
  3. # out
  4. # gr mpg cyl disp am
  5. # 1 0 17.14737 6.947368 290.3789 0
  6. # 2 1 24.39231 5.076923 143.5308 1

group_by与summarize

  1. df %>% group_by(am) %>% summarise(
  2. count = n(),
  3. dist.mean = mean(disp, na.rm = TRUE),
  4. dist.sd = sd(disp, na.rm = TRUE),
  5. mpg.mean = mean(mpg, na.rm = TRUE),
  6. mpg.sd = sd(mpg, na.rm = TRUE),
  7. )
  8. # out
  9. # A tibble: 2 x 6
  10. # am count dist.mean dist.sd mpg.mean mpg.sd
  11. # <dbl> <int> <dbl> <dbl> <dbl> <dbl>
  12. #1 0 19 290. 110. 17.1 3.83
  13. #2 1 13 144. 87.2 24.4 6.17