base
使用summary()
直接获取相应的结果
df <- mtcars[,c("mpg", "cyl","disp", "am")]
summary(df)
# out
mpg cyl disp am
Min. :10.40 Min. :4.000 Min. : 71.1 Min. :0.0000
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.:0.0000
Median :19.20 Median :6.000 Median :196.3 Median :0.0000
Mean :20.09 Mean :6.188 Mean :230.7 Mean :0.4062
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:1.0000
Max. :33.90 Max. :8.000 Max. :472.0 Max. :1.000
数据结果中并没有偏度和峰度的结果
偏度:度量随机变量概率分布的不对称性 当偏度<0时,概率分布图左偏。 当偏度=0时,表示数据相对均匀的分布在平均值两侧,不一定是绝对的对称分布。 当偏度>0时,概率分布图右偏 峰度:度量随机变量概率分布的陡峭程度 完全服从正态分布的数据的峰度值为 3,峰度值越大,概率分布图越高尖,峰度值越小,越扁
自己写一下
# 偏度
sapply(df, function(x) {
sum(((x - mean(x))/sd(x))^3)/length(x)
})
# out
# mpg cyl disp am
# 0.6106550 -0.1746119 0.3816570 0.3640159
# 峰度
sapply(df, function(x) {
(sum(((x - mean(x))/sd(x))^4)/length(x)) -3
})
# out
# mpg cyl disp am
# -0.372766 -1.762120 -1.207212 -1.924741
psych
使用describeBy()
或者describe()
获取描述性统计分析结果
describe(df)
describeBy(df, group = "am")
分组描述
aggregate
func
只能接受单返回值
# by 可以存在多个分组变量
aggregate(df, by = list(gr = df$am), FUN = mean)
# out
# gr mpg cyl disp am
# 1 0 17.14737 6.947368 290.3789 0
# 2 1 24.39231 5.076923 143.5308 1
group_by与summarize
df %>% group_by(am) %>% summarise(
count = n(),
dist.mean = mean(disp, na.rm = TRUE),
dist.sd = sd(disp, na.rm = TRUE),
mpg.mean = mean(mpg, na.rm = TRUE),
mpg.sd = sd(mpg, na.rm = TRUE),
)
# out
# A tibble: 2 x 6
# am count dist.mean dist.sd mpg.mean mpg.sd
# <dbl> <int> <dbl> <dbl> <dbl> <dbl>
#1 0 19 290. 110. 17.1 3.83
#2 1 13 144. 87.2 24.4 6.17