介绍
更多知识分享请到 https://zouhua.top/。dplyr是data manipulation的包,其包含多个处理数据的函数。主要函数有:
- mutate() 添加新变量.
- select() 选择列名.
- filter() 过滤行.
- summarise() 求和统计.
- arrange() 排序.
- group_by() 分组处理.
安装
#install.packages("dplyr")#install.packages("nycflights13")#devtools::install_github("tidyverse/dplyr")library(dplyr)library(nycflights13)
tibbles数据类型
tibbles可以取代data.frame,虽然前者仍然可以认为是数据框类型,在数据处理过程中,tibbles数据类型消耗资源更少,处理速度更快。dplyr的函数可以直接处理tibbles数据类型。
#install.packages("tibble")library(tibble)as_tibble(iris)
## # A tibble: 150 x 5## Sepal.Length Sepal.Width Petal.Length Petal.Width Species## <dbl> <dbl> <dbl> <dbl> <fct>## 1 5.1 3.5 1.4 0.2 setosa## 2 4.9 3 1.4 0.2 setosa## 3 4.7 3.2 1.3 0.2 setosa## 4 4.6 3.1 1.5 0.2 setosa## 5 5 3.6 1.4 0.2 setosa## 6 5.4 3.9 1.7 0.4 setosa## 7 4.6 3.4 1.4 0.3 setosa## 8 5 3.4 1.5 0.2 setosa## 9 4.4 2.9 1.4 0.2 setosa## 10 4.9 3.1 1.5 0.1 setosa## # ... with 140 more rows
tibble(x = 1:5,y = 1,z = x ^ 2 + y)
## # A tibble: 5 x 3## x y z## <int> <dbl> <dbl>## 1 1 1 2## 2 2 1 5## 3 3 1 10## 4 4 1 17## 5 5 1 26
Add new variables with mutate()
添加新的列:新的列一般在数据集的最后一列
flights %>%select(year:day,ends_with("delay"),distance,air_time) %>%mutate(gain = dep_delay - arr_delay,speed = distance / air_time * 60)
## # A tibble: 336,776 x 9## year month day dep_delay arr_delay distance air_time gain speed## <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 2013 1 1 2 11 1400 227 -9 370.## 2 2013 1 1 4 20 1416 227 -16 374.## 3 2013 1 1 2 33 1089 160 -31 408.## 4 2013 1 1 -1 -18 1576 183 17 517.## 5 2013 1 1 -6 -25 762 116 19 394.## 6 2013 1 1 -4 12 719 150 -16 288.## 7 2013 1 1 -5 19 1065 158 -24 404.## 8 2013 1 1 -3 -14 229 53 11 259.## 9 2013 1 1 -3 -8 944 140 5 405.## 10 2013 1 1 -2 8 733 138 -10 319.## # ... with 336,766 more rows
如果只想保留新生成的列,则使用transmute():
flights %>%transmute(gain = dep_delay - arr_delay,hours = air_time / 60,gain_per_hour = gain / hours)
## # A tibble: 336,776 x 3## gain hours gain_per_hour## <dbl> <dbl> <dbl>## 1 -9 3.78 -2.38## 2 -16 3.78 -4.23## 3 -31 2.67 -11.6## 4 17 3.05 5.57## 5 19 1.93 9.83## 6 -16 2.5 -6.4## 7 -24 2.63 -9.11## 8 11 0.883 12.5## 9 5 2.33 2.14## 10 -10 2.3 -4.35## # ... with 336,766 more rows
Select columns with select()
筛选列,可以给出确切列名,也可通过函数匹配列名:
- starts_with(“abc”): matches names that begin with “abc”.
- ends_with(“xyz”): matches names that end with “xyz”.
- contains(“ijk”): matches names that contain “ijk”.
- matches(“(.)\1”): selects variables that match a regular expression. This one matches any variables that contain repeated characters. You’ll learn more about regular expressions in strings.
- num_range(“x”, 1:3): matches x1, x2 and x3.
flights %>%select(ends_with("time"))
## # A tibble: 336,776 x 5## dep_time sched_dep_time arr_time sched_arr_time air_time## <int> <int> <int> <int> <dbl>## 1 517 515 830 819 227## 2 533 529 850 830 227## 3 542 540 923 850 160## 4 544 545 1004 1022 183## 5 554 600 812 837 116## 6 554 558 740 728 150## 7 555 600 913 854 158## 8 557 600 709 723 53## 9 557 600 838 846 140## 10 558 600 753 745 138## # ... with 336,766 more rows
everything()选择所有剩余列名(除已选择列名外),可以将某些关心的列排在前面
flights %>%select(time_hour, air_time, everything())
## # A tibble: 336,776 x 19## time_hour air_time year month day dep_time sched_dep_time dep_delay arr_time## <dttm> <dbl> <int> <int> <int> <int> <int> <dbl> <int>## 1 2013-01-01 05:00:00 227 2013 1 1 517 515 2 830## 2 2013-01-01 05:00:00 227 2013 1 1 533 529 4 850## 3 2013-01-01 05:00:00 160 2013 1 1 542 540 2 923## 4 2013-01-01 05:00:00 183 2013 1 1 544 545 -1 1004## 5 2013-01-01 06:00:00 116 2013 1 1 554 600 -6 812## 6 2013-01-01 05:00:00 150 2013 1 1 554 558 -4 740## 7 2013-01-01 06:00:00 158 2013 1 1 555 600 -5 913## 8 2013-01-01 06:00:00 53 2013 1 1 557 600 -3 709## 9 2013-01-01 06:00:00 140 2013 1 1 557 600 -3 838## 10 2013-01-01 06:00:00 138 2013 1 1 558 600 -2 753## # ... with 336,766 more rows, and 10 more variables: sched_arr_time <int>, arr_delay <dbl>,## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,## # hour <dbl>, minute <dbl>
Filter rows with filter
根据阈值筛选数据的行: 多个筛选条件,可以通过,链接。判断条件可以是逻辑运算符,如 >, < != 等。
flights %>%filter(month == 1, day == 1)
## # A tibble: 842 x 19## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>## 1 2013 1 1 517 515 2 830 819 11 UA## 2 2013 1 1 533 529 4 850 830 20 UA## 3 2013 1 1 542 540 2 923 850 33 AA## 4 2013 1 1 544 545 -1 1004 1022 -18 B6## 5 2013 1 1 554 600 -6 812 837 -25 DL## 6 2013 1 1 554 558 -4 740 728 12 UA## 7 2013 1 1 555 600 -5 913 854 19 B6## 8 2013 1 1 557 600 -3 709 723 -14 EV## 9 2013 1 1 557 600 -3 838 846 -8 B6## 10 2013 1 1 558 600 -2 753 745 8 AA## # ... with 832 more rows, and 9 more variables: flight <int>, tailnum <chr>, origin <chr>,## # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Grouped summaries with summarise()
summarise()单独使用时,直接对数据集加和
flights %>%summarise(delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 1 x 1## delay## <dbl>## 1 12.6
summarise()结合group_by()使用:分组求和,过滤并画图(使用 %>% 管道符)
library(ggplot2)# 分组flights %>% group_by(dest) %>%# 每组均值summarise(count = n(),dist = mean(distance, na.rm = TRUE), # na.rm=TRUE移除NA值delay = mean(arr_delay, na.rm = TRUE)) %>%# 解除分组ungroup() %>%# 过滤filter(count > 20, dest != "HNL") %>%# 画图ggplot(aes(x = dist, y = delay)) +geom_point(aes(size = count), alpha = 1/3) +geom_smooth(se = FALSE) +theme_bw()

除了mean()函数外,还有其他summary函数:
- Measures of Location:
mean(x),median(x). - Measures of spread:
sd(x),IQR(x),mad(x). - Measures of rank:
min(x),quantile(x, 0.25),max(x). - Measures of position:
first(x),nth(x, 2),last(x).
flights %>%filter(!is.na(dep_delay), !is.na(arr_delay)) %>%group_by(year, month, day) %>%summarise(first_dep = first(dep_time),last_dep = last(dep_time))
## # A tibble: 365 x 5## # Groups: year, month [12]## year month day first_dep last_dep## <int> <int> <int> <int> <int>## 1 2013 1 1 517 2356## 2 2013 1 2 42 2354## 3 2013 1 3 32 2349## 4 2013 1 4 25 2358## 5 2013 1 5 14 2357## 6 2013 1 6 16 2355## 7 2013 1 7 49 2359## 8 2013 1 8 454 2351## 9 2013 1 9 2 2252## 10 2013 1 10 3 2320## # ... with 355 more rows
- Counts :
n()简单计数加和延误航班次数,例如D942DN有四次延误记录。
flights %>%filter(!is.na(dep_delay), !is.na(arr_delay)) %>%group_by(tailnum) %>%summarise(delay = mean(arr_delay),n = n())
## # A tibble: 4,037 x 3## tailnum delay n## <chr> <dbl> <int>## 1 D942DN 31.5 4## 2 N0EGMQ 9.98 352## 3 N10156 12.7 145## 4 N102UW 2.94 48## 5 N103US -6.93 46## 6 N104UW 1.80 46## 7 N10575 20.7 269## 8 N105UW -0.267 45## 9 N107US -5.73 41## 10 N108UW -1.25 60## # ... with 4,027 more rows
flights %>% filter(tailnum == "D942DN")
## # A tibble: 4 x 19## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>## 1 2013 2 11 1508 1400 68 1807 1636 91 DL## 2 2013 3 23 1340 1300 40 1638 1554 44 DL## 3 2013 3 24 859 835 24 1142 1140 2 DL## 4 2013 7 5 1253 1259 -6 1518 1529 -11 DL## # ... with 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
统计非NA值使用sum(!is.na(x)),统计unique值使用n_distinct(x).
flights %>%filter(!is.na(dep_delay), !is.na(arr_delay)) %>%group_by(dest) %>%summarise(carriers_unique = n_distinct(carrier),carrriers_nona = sum(!is.na(carrier)),carrriers_n = n()) %>%arrange(desc(carriers_unique)) %>%ungroup()
## # A tibble: 104 x 4## dest carriers_unique carrriers_nona carrriers_n## <chr> <int> <int> <int>## 1 ATL 7 16837 16837## 2 BOS 7 15022 15022## 3 CLT 7 13674 13674## 4 ORD 7 16566 16566## 5 TPA 7 7390 7390## 6 AUS 6 2411 2411## 7 DCA 6 9111 9111## 8 DTW 6 9031 9031## 9 IAD 6 5383 5383## 10 MSP 6 6929 6929## # ... with 94 more rows
flights %>%filter(!is.na(dep_delay), !is.na(arr_delay)) %>%count(dest)
## # A tibble: 104 x 2## dest n## <chr> <int>## 1 ABQ 254## 2 ACK 264## 3 ALB 418## 4 ANC 8## 5 ATL 16837## 6 AUS 2411## 7 AVL 261## 8 BDL 412## 9 BGR 358## 10 BHM 269## # ... with 94 more rows
Grouping by multiple variables
根据多个变量分组计算
flights %>%group_by(year, month, day) %>%summarise(flights = n())
## # A tibble: 365 x 4## # Groups: year, month [12]## year month day flights## <int> <int> <int> <int>## 1 2013 1 1 842## 2 2013 1 2 943## 3 2013 1 3 914## 4 2013 1 4 915## 5 2013 1 5 720## 6 2013 1 6 832## 7 2013 1 7 933## 8 2013 1 8 899## 9 2013 1 9 902## 10 2013 1 10 932## # ... with 355 more rows
Arrange rows with arrange()
按照从大到小对行排序:desc(rownames)
flights %>%arrange(year, desc(month), day)
## # A tibble: 336,776 x 19## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>## 1 2013 12 1 13 2359 14 446 445 1 B6## 2 2013 12 1 17 2359 18 443 437 6 B6## 3 2013 12 1 453 500 -7 636 651 -15 US## 4 2013 12 1 520 515 5 749 808 -19 UA## 5 2013 12 1 536 540 -4 845 850 -5 AA## 6 2013 12 1 540 550 -10 1005 1027 -22 B6## 7 2013 12 1 541 545 -4 734 755 -21 EV## 8 2013 12 1 546 545 1 826 835 -9 UA## 9 2013 12 1 549 600 -11 648 659 -11 US## 10 2013 12 1 550 600 -10 825 854 -29 B6## # ... with 336,766 more rows, and 9 more variables: flight <int>, tailnum <chr>, origin <chr>,## # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
across()
对多个列做同一个操作,可以使用across()处理
flights %>% group_by(dest) %>%summarise(dist = mean(distance, na.rm = TRUE),delay = mean(arr_delay, na.rm = TRUE)) %>%ungroup()
## # A tibble: 105 x 3## dest dist delay## <chr> <dbl> <dbl>## 1 ABQ 1826 4.38## 2 ACK 199 4.85## 3 ALB 143 14.4## 4 ANC 3370 -2.5## 5 ATL 757. 11.3## 6 AUS 1514. 6.02## 7 AVL 584. 8.00## 8 BDL 116 7.05## 9 BGR 378 8.03## 10 BHM 866. 16.9## # ... with 95 more rows
flights %>% select(dest, distance, arr_delay) %>%group_by(dest) %>%summarise(across(where(is.numeric), mean, na.rm = TRUE)) %>%ungroup()
## # A tibble: 105 x 3## dest distance arr_delay## <chr> <dbl> <dbl>## 1 ABQ 1826 4.38## 2 ACK 199 4.85## 3 ALB 143 14.4## 4 ANC 3370 -2.5## 5 ATL 757. 11.3## 6 AUS 1514. 6.02## 7 AVL 584. 8.00## 8 BDL 116 7.05## 9 BGR 378 8.03## 10 BHM 866. 16.9## # ... with 95 more rows
across() 常用场景
df %>% mutate_if(is.numeric, mean, na.rm = TRUE)# ->df %>% mutate(across(where(is.numeric), mean, na.rm = TRUE))df %>% mutate_at(vars(x, starts_with("y")), mean, na.rm = TRUE)# ->df %>% mutate(across(c(x, starts_with("y")), mean, na.rm = TRUE))df %>% mutate_all(mean, na.rm = TRUE)# ->df %>% mutate(across(everything(), mean, na.rm = TRUE))
R information
sessionInfo()
## R version 4.0.2 (2020-06-22)## Platform: x86_64-w64-mingw32/x64 (64-bit)## Running under: Windows 10 x64 (build 19042)#### Matrix products: default#### locale:## [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252## [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C## [5] LC_TIME=English_United States.1252## system code page: 936#### attached base packages:## [1] stats graphics grDevices utils datasets methods base#### other attached packages:## [1] ggplot2_3.3.2 tibble_3.0.3 dplyr_1.0.2 nycflights13_1.0.1#### loaded via a namespace (and not attached):## [1] pillar_1.4.6 compiler_4.0.2 highr_0.8 tools_4.0.2 digest_0.6.25## [6] jsonlite_1.7.1 evaluate_0.14 lifecycle_0.2.0 gtable_0.3.0 nlme_3.1-150## [11] lattice_0.20-41 mgcv_1.8-33 pkgconfig_2.0.3 rlang_0.4.7 Matrix_1.2-18## [16] cli_2.1.0 yaml_2.2.1 xfun_0.18 withr_2.3.0 stringr_1.4.0## [21] knitr_1.30 generics_0.1.0 vctrs_0.3.4 grid_4.0.2 tidyselect_1.1.0## [26] glue_1.4.2 R6_2.5.0 fansi_0.4.1 rmarkdown_2.5 purrr_0.3.4## [31] farver_2.0.3 magrittr_1.5 scales_1.1.1 ellipsis_0.3.1 htmltools_0.5.0## [36] splines_4.0.2 assertthat_0.2.1 colorspace_1.4-1 labeling_0.4.2 utf8_1.1.4## [41] stringi_1.5.3 munsell_0.5.0 crayon_1.3.4
引用
参考文章如引起任何侵权问题,可以与我联系,谢谢。
