介绍

更多知识分享请到 https://zouhua.top/。dplyr是data manipulation的包,其包含多个处理数据的函数。主要函数有:

  • mutate() 添加新变量.
  • select() 选择列名.
  • filter() 过滤行.
  • summarise() 求和统计.
  • arrange() 排序.
  • group_by() 分组处理.

安装

  1. #install.packages("dplyr")
  2. #install.packages("nycflights13")
  3. #devtools::install_github("tidyverse/dplyr")
  4. library(dplyr)
  5. library(nycflights13)

tibbles数据类型

tibbles可以取代data.frame,虽然前者仍然可以认为是数据框类型,在数据处理过程中,tibbles数据类型消耗资源更少,处理速度更快。dplyr的函数可以直接处理tibbles数据类型。

  1. #install.packages("tibble")
  2. library(tibble)
  3. as_tibble(iris)
  1. ## # A tibble: 150 x 5
  2. ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
  3. ## <dbl> <dbl> <dbl> <dbl> <fct>
  4. ## 1 5.1 3.5 1.4 0.2 setosa
  5. ## 2 4.9 3 1.4 0.2 setosa
  6. ## 3 4.7 3.2 1.3 0.2 setosa
  7. ## 4 4.6 3.1 1.5 0.2 setosa
  8. ## 5 5 3.6 1.4 0.2 setosa
  9. ## 6 5.4 3.9 1.7 0.4 setosa
  10. ## 7 4.6 3.4 1.4 0.3 setosa
  11. ## 8 5 3.4 1.5 0.2 setosa
  12. ## 9 4.4 2.9 1.4 0.2 setosa
  13. ## 10 4.9 3.1 1.5 0.1 setosa
  14. ## # ... with 140 more rows
  1. tibble(
  2. x = 1:5,
  3. y = 1,
  4. z = x ^ 2 + y
  5. )
  1. ## # A tibble: 5 x 3
  2. ## x y z
  3. ## <int> <dbl> <dbl>
  4. ## 1 1 1 2
  5. ## 2 2 1 5
  6. ## 3 3 1 10
  7. ## 4 4 1 17
  8. ## 5 5 1 26

Add new variables with mutate()

添加新的列:新的列一般在数据集的最后一列

  1. flights %>%
  2. select(
  3. year:day,
  4. ends_with("delay"),
  5. distance,
  6. air_time) %>%
  7. mutate(gain = dep_delay - arr_delay,
  8. speed = distance / air_time * 60
  9. )
  1. ## # A tibble: 336,776 x 9
  2. ## year month day dep_delay arr_delay distance air_time gain speed
  3. ## <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
  4. ## 1 2013 1 1 2 11 1400 227 -9 370.
  5. ## 2 2013 1 1 4 20 1416 227 -16 374.
  6. ## 3 2013 1 1 2 33 1089 160 -31 408.
  7. ## 4 2013 1 1 -1 -18 1576 183 17 517.
  8. ## 5 2013 1 1 -6 -25 762 116 19 394.
  9. ## 6 2013 1 1 -4 12 719 150 -16 288.
  10. ## 7 2013 1 1 -5 19 1065 158 -24 404.
  11. ## 8 2013 1 1 -3 -14 229 53 11 259.
  12. ## 9 2013 1 1 -3 -8 944 140 5 405.
  13. ## 10 2013 1 1 -2 8 733 138 -10 319.
  14. ## # ... with 336,766 more rows

如果只想保留新生成的列,则使用transmute():

  1. flights %>%
  2. transmute(gain = dep_delay - arr_delay,
  3. hours = air_time / 60,
  4. gain_per_hour = gain / hours)
  1. ## # A tibble: 336,776 x 3
  2. ## gain hours gain_per_hour
  3. ## <dbl> <dbl> <dbl>
  4. ## 1 -9 3.78 -2.38
  5. ## 2 -16 3.78 -4.23
  6. ## 3 -31 2.67 -11.6
  7. ## 4 17 3.05 5.57
  8. ## 5 19 1.93 9.83
  9. ## 6 -16 2.5 -6.4
  10. ## 7 -24 2.63 -9.11
  11. ## 8 11 0.883 12.5
  12. ## 9 5 2.33 2.14
  13. ## 10 -10 2.3 -4.35
  14. ## # ... with 336,766 more rows

Select columns with select()

筛选列,可以给出确切列名,也可通过函数匹配列名:

  • starts_with(“abc”): matches names that begin with “abc”.
  • ends_with(“xyz”): matches names that end with “xyz”.
  • contains(“ijk”): matches names that contain “ijk”.
  • matches(“(.)\1”): selects variables that match a regular expression. This one matches any variables that contain repeated characters. You’ll learn more about regular expressions in strings.
  • num_range(“x”, 1:3): matches x1, x2 and x3.
  1. flights %>%
  2. select(ends_with("time"))
  1. ## # A tibble: 336,776 x 5
  2. ## dep_time sched_dep_time arr_time sched_arr_time air_time
  3. ## <int> <int> <int> <int> <dbl>
  4. ## 1 517 515 830 819 227
  5. ## 2 533 529 850 830 227
  6. ## 3 542 540 923 850 160
  7. ## 4 544 545 1004 1022 183
  8. ## 5 554 600 812 837 116
  9. ## 6 554 558 740 728 150
  10. ## 7 555 600 913 854 158
  11. ## 8 557 600 709 723 53
  12. ## 9 557 600 838 846 140
  13. ## 10 558 600 753 745 138
  14. ## # ... with 336,766 more rows

everything()选择所有剩余列名(除已选择列名外),可以将某些关心的列排在前面

  1. flights %>%
  2. select(time_hour, air_time, everything())
  1. ## # A tibble: 336,776 x 19
  2. ## time_hour air_time year month day dep_time sched_dep_time dep_delay arr_time
  3. ## <dttm> <dbl> <int> <int> <int> <int> <int> <dbl> <int>
  4. ## 1 2013-01-01 05:00:00 227 2013 1 1 517 515 2 830
  5. ## 2 2013-01-01 05:00:00 227 2013 1 1 533 529 4 850
  6. ## 3 2013-01-01 05:00:00 160 2013 1 1 542 540 2 923
  7. ## 4 2013-01-01 05:00:00 183 2013 1 1 544 545 -1 1004
  8. ## 5 2013-01-01 06:00:00 116 2013 1 1 554 600 -6 812
  9. ## 6 2013-01-01 05:00:00 150 2013 1 1 554 558 -4 740
  10. ## 7 2013-01-01 06:00:00 158 2013 1 1 555 600 -5 913
  11. ## 8 2013-01-01 06:00:00 53 2013 1 1 557 600 -3 709
  12. ## 9 2013-01-01 06:00:00 140 2013 1 1 557 600 -3 838
  13. ## 10 2013-01-01 06:00:00 138 2013 1 1 558 600 -2 753
  14. ## # ... with 336,766 more rows, and 10 more variables: sched_arr_time <int>, arr_delay <dbl>,
  15. ## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
  16. ## # hour <dbl>, minute <dbl>

Filter rows with filter

根据阈值筛选数据的行: 多个筛选条件,可以通过,链接。判断条件可以是逻辑运算符,如 >, < != 等。

  1. flights %>%
  2. filter(month == 1, day == 1)
  1. ## # A tibble: 842 x 19
  2. ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
  3. ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
  4. ## 1 2013 1 1 517 515 2 830 819 11 UA
  5. ## 2 2013 1 1 533 529 4 850 830 20 UA
  6. ## 3 2013 1 1 542 540 2 923 850 33 AA
  7. ## 4 2013 1 1 544 545 -1 1004 1022 -18 B6
  8. ## 5 2013 1 1 554 600 -6 812 837 -25 DL
  9. ## 6 2013 1 1 554 558 -4 740 728 12 UA
  10. ## 7 2013 1 1 555 600 -5 913 854 19 B6
  11. ## 8 2013 1 1 557 600 -3 709 723 -14 EV
  12. ## 9 2013 1 1 557 600 -3 838 846 -8 B6
  13. ## 10 2013 1 1 558 600 -2 753 745 8 AA
  14. ## # ... with 832 more rows, and 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
  15. ## # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Grouped summaries with summarise()

summarise()单独使用时,直接对数据集加和

  1. flights %>%
  2. summarise(delay = mean(dep_delay, na.rm = TRUE))
  1. ## # A tibble: 1 x 1
  2. ## delay
  3. ## <dbl>
  4. ## 1 12.6

summarise()结合group_by()使用:分组求和,过滤并画图(使用 %>% 管道符)

  1. library(ggplot2)
  2. # 分组
  3. flights %>% group_by(dest) %>%
  4. # 每组均值
  5. summarise(count = n(),
  6. dist = mean(distance, na.rm = TRUE), # na.rm=TRUE移除NA值
  7. delay = mean(arr_delay, na.rm = TRUE)) %>%
  8. # 解除分组
  9. ungroup() %>%
  10. # 过滤
  11. filter(count > 20, dest != "HNL") %>%
  12. # 画图
  13. ggplot(aes(x = dist, y = delay)) +
  14. geom_point(aes(size = count), alpha = 1/3) +
  15. geom_smooth(se = FALSE) +
  16. theme_bw()

R包:dplyr包数据塑形利器 - 图1

除了mean()函数外,还有其他summary函数:

  • Measures of Location: mean(x), median(x).
  • Measures of spread: sd(x), IQR(x), mad(x).
  • Measures of rank: min(x), quantile(x, 0.25), max(x).
  • Measures of position: first(x), nth(x, 2), last(x).
  1. flights %>%
  2. filter(!is.na(dep_delay), !is.na(arr_delay)) %>%
  3. group_by(year, month, day) %>%
  4. summarise(
  5. first_dep = first(dep_time),
  6. last_dep = last(dep_time)
  7. )
  1. ## # A tibble: 365 x 5
  2. ## # Groups: year, month [12]
  3. ## year month day first_dep last_dep
  4. ## <int> <int> <int> <int> <int>
  5. ## 1 2013 1 1 517 2356
  6. ## 2 2013 1 2 42 2354
  7. ## 3 2013 1 3 32 2349
  8. ## 4 2013 1 4 25 2358
  9. ## 5 2013 1 5 14 2357
  10. ## 6 2013 1 6 16 2355
  11. ## 7 2013 1 7 49 2359
  12. ## 8 2013 1 8 454 2351
  13. ## 9 2013 1 9 2 2252
  14. ## 10 2013 1 10 3 2320
  15. ## # ... with 355 more rows
  • Counts : n()简单计数加和延误航班次数,例如D942DN有四次延误记录。
  1. flights %>%
  2. filter(!is.na(dep_delay), !is.na(arr_delay)) %>%
  3. group_by(tailnum) %>%
  4. summarise(
  5. delay = mean(arr_delay),
  6. n = n())
  1. ## # A tibble: 4,037 x 3
  2. ## tailnum delay n
  3. ## <chr> <dbl> <int>
  4. ## 1 D942DN 31.5 4
  5. ## 2 N0EGMQ 9.98 352
  6. ## 3 N10156 12.7 145
  7. ## 4 N102UW 2.94 48
  8. ## 5 N103US -6.93 46
  9. ## 6 N104UW 1.80 46
  10. ## 7 N10575 20.7 269
  11. ## 8 N105UW -0.267 45
  12. ## 9 N107US -5.73 41
  13. ## 10 N108UW -1.25 60
  14. ## # ... with 4,027 more rows
  1. flights %>% filter(tailnum == "D942DN")
  1. ## # A tibble: 4 x 19
  2. ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
  3. ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
  4. ## 1 2013 2 11 1508 1400 68 1807 1636 91 DL
  5. ## 2 2013 3 23 1340 1300 40 1638 1554 44 DL
  6. ## 3 2013 3 24 859 835 24 1142 1140 2 DL
  7. ## 4 2013 7 5 1253 1259 -6 1518 1529 -11 DL
  8. ## # ... with 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
  9. ## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

统计非NA值使用sum(!is.na(x)),统计unique值使用n_distinct(x).

  1. flights %>%
  2. filter(!is.na(dep_delay), !is.na(arr_delay)) %>%
  3. group_by(dest) %>%
  4. summarise(carriers_unique = n_distinct(carrier),
  5. carrriers_nona = sum(!is.na(carrier)),
  6. carrriers_n = n()) %>%
  7. arrange(desc(carriers_unique)) %>%
  8. ungroup()
  1. ## # A tibble: 104 x 4
  2. ## dest carriers_unique carrriers_nona carrriers_n
  3. ## <chr> <int> <int> <int>
  4. ## 1 ATL 7 16837 16837
  5. ## 2 BOS 7 15022 15022
  6. ## 3 CLT 7 13674 13674
  7. ## 4 ORD 7 16566 16566
  8. ## 5 TPA 7 7390 7390
  9. ## 6 AUS 6 2411 2411
  10. ## 7 DCA 6 9111 9111
  11. ## 8 DTW 6 9031 9031
  12. ## 9 IAD 6 5383 5383
  13. ## 10 MSP 6 6929 6929
  14. ## # ... with 94 more rows
  1. flights %>%
  2. filter(!is.na(dep_delay), !is.na(arr_delay)) %>%
  3. count(dest)
  1. ## # A tibble: 104 x 2
  2. ## dest n
  3. ## <chr> <int>
  4. ## 1 ABQ 254
  5. ## 2 ACK 264
  6. ## 3 ALB 418
  7. ## 4 ANC 8
  8. ## 5 ATL 16837
  9. ## 6 AUS 2411
  10. ## 7 AVL 261
  11. ## 8 BDL 412
  12. ## 9 BGR 358
  13. ## 10 BHM 269
  14. ## # ... with 94 more rows

Grouping by multiple variables

根据多个变量分组计算

  1. flights %>%
  2. group_by(year, month, day) %>%
  3. summarise(flights = n())
  1. ## # A tibble: 365 x 4
  2. ## # Groups: year, month [12]
  3. ## year month day flights
  4. ## <int> <int> <int> <int>
  5. ## 1 2013 1 1 842
  6. ## 2 2013 1 2 943
  7. ## 3 2013 1 3 914
  8. ## 4 2013 1 4 915
  9. ## 5 2013 1 5 720
  10. ## 6 2013 1 6 832
  11. ## 7 2013 1 7 933
  12. ## 8 2013 1 8 899
  13. ## 9 2013 1 9 902
  14. ## 10 2013 1 10 932
  15. ## # ... with 355 more rows

Arrange rows with arrange()

按照从大到小对行排序:desc(rownames)

  1. flights %>%
  2. arrange(year, desc(month), day)
  1. ## # A tibble: 336,776 x 19
  2. ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
  3. ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
  4. ## 1 2013 12 1 13 2359 14 446 445 1 B6
  5. ## 2 2013 12 1 17 2359 18 443 437 6 B6
  6. ## 3 2013 12 1 453 500 -7 636 651 -15 US
  7. ## 4 2013 12 1 520 515 5 749 808 -19 UA
  8. ## 5 2013 12 1 536 540 -4 845 850 -5 AA
  9. ## 6 2013 12 1 540 550 -10 1005 1027 -22 B6
  10. ## 7 2013 12 1 541 545 -4 734 755 -21 EV
  11. ## 8 2013 12 1 546 545 1 826 835 -9 UA
  12. ## 9 2013 12 1 549 600 -11 648 659 -11 US
  13. ## 10 2013 12 1 550 600 -10 825 854 -29 B6
  14. ## # ... with 336,766 more rows, and 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
  15. ## # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

across()

对多个列做同一个操作,可以使用across()处理

  1. flights %>% group_by(dest) %>%
  2. summarise(dist = mean(distance, na.rm = TRUE),
  3. delay = mean(arr_delay, na.rm = TRUE)) %>%
  4. ungroup()
  1. ## # A tibble: 105 x 3
  2. ## dest dist delay
  3. ## <chr> <dbl> <dbl>
  4. ## 1 ABQ 1826 4.38
  5. ## 2 ACK 199 4.85
  6. ## 3 ALB 143 14.4
  7. ## 4 ANC 3370 -2.5
  8. ## 5 ATL 757. 11.3
  9. ## 6 AUS 1514. 6.02
  10. ## 7 AVL 584. 8.00
  11. ## 8 BDL 116 7.05
  12. ## 9 BGR 378 8.03
  13. ## 10 BHM 866. 16.9
  14. ## # ... with 95 more rows
  1. flights %>% select(dest, distance, arr_delay) %>%
  2. group_by(dest) %>%
  3. summarise(across(where(is.numeric), mean, na.rm = TRUE)) %>%
  4. ungroup()
  1. ## # A tibble: 105 x 3
  2. ## dest distance arr_delay
  3. ## <chr> <dbl> <dbl>
  4. ## 1 ABQ 1826 4.38
  5. ## 2 ACK 199 4.85
  6. ## 3 ALB 143 14.4
  7. ## 4 ANC 3370 -2.5
  8. ## 5 ATL 757. 11.3
  9. ## 6 AUS 1514. 6.02
  10. ## 7 AVL 584. 8.00
  11. ## 8 BDL 116 7.05
  12. ## 9 BGR 378 8.03
  13. ## 10 BHM 866. 16.9
  14. ## # ... with 95 more rows

across() 常用场景

  1. df %>% mutate_if(is.numeric, mean, na.rm = TRUE)
  2. # ->
  3. df %>% mutate(across(where(is.numeric), mean, na.rm = TRUE))
  4. df %>% mutate_at(vars(x, starts_with("y")), mean, na.rm = TRUE)
  5. # ->
  6. df %>% mutate(across(c(x, starts_with("y")), mean, na.rm = TRUE))
  7. df %>% mutate_all(mean, na.rm = TRUE)
  8. # ->
  9. df %>% mutate(across(everything(), mean, na.rm = TRUE))

R information

  1. sessionInfo()
  1. ## R version 4.0.2 (2020-06-22)
  2. ## Platform: x86_64-w64-mingw32/x64 (64-bit)
  3. ## Running under: Windows 10 x64 (build 19042)
  4. ##
  5. ## Matrix products: default
  6. ##
  7. ## locale:
  8. ## [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
  9. ## [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
  10. ## [5] LC_TIME=English_United States.1252
  11. ## system code page: 936
  12. ##
  13. ## attached base packages:
  14. ## [1] stats graphics grDevices utils datasets methods base
  15. ##
  16. ## other attached packages:
  17. ## [1] ggplot2_3.3.2 tibble_3.0.3 dplyr_1.0.2 nycflights13_1.0.1
  18. ##
  19. ## loaded via a namespace (and not attached):
  20. ## [1] pillar_1.4.6 compiler_4.0.2 highr_0.8 tools_4.0.2 digest_0.6.25
  21. ## [6] jsonlite_1.7.1 evaluate_0.14 lifecycle_0.2.0 gtable_0.3.0 nlme_3.1-150
  22. ## [11] lattice_0.20-41 mgcv_1.8-33 pkgconfig_2.0.3 rlang_0.4.7 Matrix_1.2-18
  23. ## [16] cli_2.1.0 yaml_2.2.1 xfun_0.18 withr_2.3.0 stringr_1.4.0
  24. ## [21] knitr_1.30 generics_0.1.0 vctrs_0.3.4 grid_4.0.2 tidyselect_1.1.0
  25. ## [26] glue_1.4.2 R6_2.5.0 fansi_0.4.1 rmarkdown_2.5 purrr_0.3.4
  26. ## [31] farver_2.0.3 magrittr_1.5 scales_1.1.1 ellipsis_0.3.1 htmltools_0.5.0
  27. ## [36] splines_4.0.2 assertthat_0.2.1 colorspace_1.4-1 labeling_0.4.2 utf8_1.1.4
  28. ## [41] stringi_1.5.3 munsell_0.5.0 crayon_1.3.4

引用

  1. dplyr
  2. R for Data Science
  3. dplyr across

参考文章如引起任何侵权问题,可以与我联系,谢谢。