dplyr

dplyrtidyverse数据科学包中的重要成员,提供了许多数据操控(data manipulation)的语法,用于快速处理数据。

Verbs 语法

dplyr包中,提供了一致的动词语法,帮助解决最常见的数据操作:

  • filter(): 根据值筛选数据
  • arrange(): 数据排序
  • select()rename(): 根据名字选择变量
  • mutate()transmute(): 修改数据
  • summarise(): 数据摘要
  • sample_n()sample_frac(): 随机选择样本

filter()

  1. library(nycflights13)
  2. dim(flights)
  1. ## [1] 336776 19
  1. flights
  1. ## # A tibble: 336,776 x 19
  2. ## year month day dep_time sched_dep_time dep_delay
  3. ## <int> <int> <int> <int> <int> <dbl>
  4. ## 1 2013 1 1 517 515 2
  5. ## 2 2013 1 1 533 529 4
  6. ## 3 2013 1 1 542 540 2
  7. ## 4 2013 1 1 544 545 -1
  8. ## 5 2013 1 1 554 600 -6
  9. ## 6 2013 1 1 554 558 -4
  10. ## 7 2013 1 1 555 600 -5
  11. ## 8 2013 1 1 557 600 -3
  12. ## 9 2013 1 1 557 600 -3
  13. ## 10 2013 1 1 558 600 -2
  14. ## # ... with 336,766 more rows, and 13 more variables:
  15. ## # arr_time <int>, sched_arr_time <int>, arr_delay <dbl>,
  16. ## # carrier <chr>, flight <int>, tailnum <chr>,
  17. ## # origin <chr>, dest <chr>, air_time <dbl>,
  18. ## # distance <dbl>, hour <dbl>, minute <dbl>,
  19. ## # time_hour <dttm>

根据条件筛选行:

  1. filter(flights, month == 1, day == 1)
  1. ## # A tibble: 842 x 19
  2. ## year month day dep_time sched_dep_time dep_delay
  3. ## <int> <int> <int> <int> <int> <dbl>
  4. ## 1 2013 1 1 517 515 2
  5. ## 2 2013 1 1 533 529 4
  6. ## 3 2013 1 1 542 540 2
  7. ## 4 2013 1 1 544 545 -1
  8. ## 5 2013 1 1 554 600 -6
  9. ## 6 2013 1 1 554 558 -4
  10. ## 7 2013 1 1 555 600 -5
  11. ## 8 2013 1 1 557 600 -3
  12. ## 9 2013 1 1 557 600 -3
  13. ## 10 2013 1 1 558 600 -2
  14. ## # ... with 832 more rows, and 13 more variables:
  15. ## # arr_time <int>, sched_arr_time <int>, arr_delay <dbl>,
  16. ## # carrier <chr>, flight <int>, tailnum <chr>,
  17. ## # origin <chr>, dest <chr>, air_time <dbl>,
  18. ## # distance <dbl>, hour <dbl>, minute <dbl>,
  19. ## # time_hour <dttm>

这等价于flights[flights$month == 1 & flights$day == 1, ]

arrange()

数据排序

  1. arrange(flights, year, month, day)
  1. ## # A tibble: 336,776 x 19
  2. ## year month day dep_time sched_dep_time dep_delay
  3. ## <int> <int> <int> <int> <int> <dbl>
  4. ## 1 2013 1 1 517 515 2
  5. ## 2 2013 1 1 533 529 4
  6. ## 3 2013 1 1 542 540 2
  7. ## 4 2013 1 1 544 545 -1
  8. ## 5 2013 1 1 554 600 -6
  9. ## 6 2013 1 1 554 558 -4
  10. ## 7 2013 1 1 555 600 -5
  11. ## 8 2013 1 1 557 600 -3
  12. ## 9 2013 1 1 557 600 -3
  13. ## 10 2013 1 1 558 600 -2
  14. ## # ... with 336,766 more rows, and 13 more variables:
  15. ## # arr_time <int>, sched_arr_time <int>, arr_delay <dbl>,
  16. ## # carrier <chr>, flight <int>, tailnum <chr>,
  17. ## # origin <chr>, dest <chr>, air_time <dbl>,
  18. ## # distance <dbl>, hour <dbl>, minute <dbl>,
  19. ## # time_hour <dttm>

降序排列

  1. arrange(flights, desc(arr_delay))
  1. ## # A tibble: 336,776 x 19
  2. ## year month day dep_time sched_dep_time dep_delay
  3. ## <int> <int> <int> <int> <int> <dbl>
  4. ## 1 2013 1 9 641 900 1301
  5. ## 2 2013 6 15 1432 1935 1137
  6. ## 3 2013 1 10 1121 1635 1126
  7. ## 4 2013 9 20 1139 1845 1014
  8. ## 5 2013 7 22 845 1600 1005
  9. ## 6 2013 4 10 1100 1900 960
  10. ## 7 2013 3 17 2321 810 911
  11. ## 8 2013 7 22 2257 759 898
  12. ## 9 2013 12 5 756 1700 896
  13. ## 10 2013 5 3 1133 2055 878
  14. ## # ... with 336,766 more rows, and 13 more variables:
  15. ## # arr_time <int>, sched_arr_time <int>, arr_delay <dbl>,
  16. ## # carrier <chr>, flight <int>, tailnum <chr>,
  17. ## # origin <chr>, dest <chr>, air_time <dbl>,
  18. ## # distance <dbl>, hour <dbl>, minute <dbl>,
  19. ## # time_hour <dttm>

select()

变量筛选

  1. # Select columns by name
  2. select(flights, year, month, day)
  1. ## # A tibble: 336,776 x 3
  2. ## year month day
  3. ## <int> <int> <int>
  4. ## 1 2013 1 1
  5. ## 2 2013 1 1
  6. ## 3 2013 1 1
  7. ## 4 2013 1 1
  8. ## 5 2013 1 1
  9. ## 6 2013 1 1
  10. ## 7 2013 1 1
  11. ## 8 2013 1 1
  12. ## 9 2013 1 1
  13. ## 10 2013 1 1
  14. ## # ... with 336,766 more rows
  1. # Select all columns between year and day (inclusive)
  2. select(flights, year:day)
  1. ## # A tibble: 336,776 x 3
  2. ## year month day
  3. ## <int> <int> <int>
  4. ## 1 2013 1 1
  5. ## 2 2013 1 1
  6. ## 3 2013 1 1
  7. ## 4 2013 1 1
  8. ## 5 2013 1 1
  9. ## 6 2013 1 1
  10. ## 7 2013 1 1
  11. ## 8 2013 1 1
  12. ## 9 2013 1 1
  13. ## 10 2013 1 1
  14. ## # ... with 336,766 more rows
  1. # Select all columns except those from year to day (inclusive)
  2. select(flights, -(year:day))
  1. ## # A tibble: 336,776 x 16
  2. ## dep_time sched_dep_time dep_delay arr_time sched_arr_time
  3. ## <int> <int> <dbl> <int> <int>
  4. ## 1 517 515 2 830 819
  5. ## 2 533 529 4 850 830
  6. ## 3 542 540 2 923 850
  7. ## 4 544 545 -1 1004 1022
  8. ## 5 554 600 -6 812 837
  9. ## 6 554 558 -4 740 728
  10. ## 7 555 600 -5 913 854
  11. ## 8 557 600 -3 709 723
  12. ## 9 557 600 -3 838 846
  13. ## 10 558 600 -2 753 745
  14. ## # ... with 336,766 more rows, and 11 more variables:
  15. ## # arr_delay <dbl>, carrier <chr>, flight <int>,
  16. ## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
  17. ## # distance <dbl>, hour <dbl>, minute <dbl>,
  18. ## # time_hour <dttm>

利用select和其他辅助函数进行组合,进行更多类型的操作

  1. # 提取变量并重命名
  2. select(flights, tail_num = tailnum)
  1. ## # A tibble: 336,776 x 1
  2. ## tail_num
  3. ## <chr>
  4. ## 1 N14228
  5. ## 2 N24211
  6. ## 3 N619AA
  7. ## 4 N804JB
  8. ## 5 N668DN
  9. ## 6 N39463
  10. ## 7 N516JB
  11. ## 8 N829AS
  12. ## 9 N593JB
  13. ## 10 N3ALAA
  14. ## # ... with 336,766 more rows

mutate()

基于已有变量,添加新变量

  1. mutate(flights,
  2. gain = arr_delay - dep_delay,
  3. speed = distance / air_time * 60
  4. )
  1. ## # A tibble: 336,776 x 21
  2. ## year month day dep_time sched_dep_time dep_delay
  3. ## <int> <int> <int> <int> <int> <dbl>
  4. ## 1 2013 1 1 517 515 2
  5. ## 2 2013 1 1 533 529 4
  6. ## 3 2013 1 1 542 540 2
  7. ## 4 2013 1 1 544 545 -1
  8. ## 5 2013 1 1 554 600 -6
  9. ## 6 2013 1 1 554 558 -4
  10. ## 7 2013 1 1 555 600 -5
  11. ## 8 2013 1 1 557 600 -3
  12. ## 9 2013 1 1 557 600 -3
  13. ## 10 2013 1 1 558 600 -2
  14. ## # ... with 336,766 more rows, and 15 more variables:
  15. ## # arr_time <int>, sched_arr_time <int>, arr_delay <dbl>,
  16. ## # carrier <chr>, flight <int>, tailnum <chr>,
  17. ## # origin <chr>, dest <chr>, air_time <dbl>,
  18. ## # distance <dbl>, hour <dbl>, minute <dbl>,
  19. ## # time_hour <dttm>, gain <dbl>, speed <dbl>

并且,可以依据刚建立的新列进行操作

  1. mutate(flights,
  2. gain = arr_delay - dep_delay,
  3. gain_per_hour = gain / (air_time / 60)
  4. )
  1. ## # A tibble: 336,776 x 21
  2. ## year month day dep_time sched_dep_time dep_delay
  3. ## <int> <int> <int> <int> <int> <dbl>
  4. ## 1 2013 1 1 517 515 2
  5. ## 2 2013 1 1 533 529 4
  6. ## 3 2013 1 1 542 540 2
  7. ## 4 2013 1 1 544 545 -1
  8. ## 5 2013 1 1 554 600 -6
  9. ## 6 2013 1 1 554 558 -4
  10. ## 7 2013 1 1 555 600 -5
  11. ## 8 2013 1 1 557 600 -3
  12. ## 9 2013 1 1 557 600 -3
  13. ## 10 2013 1 1 558 600 -2
  14. ## # ... with 336,766 more rows, and 15 more variables:
  15. ## # arr_time <int>, sched_arr_time <int>, arr_delay <dbl>,
  16. ## # carrier <chr>, flight <int>, tailnum <chr>,
  17. ## # origin <chr>, dest <chr>, air_time <dbl>,
  18. ## # distance <dbl>, hour <dbl>, minute <dbl>,
  19. ## # time_hour <dttm>, gain <dbl>, gain_per_hour <dbl>

若只想保留新列,可以使用transmute()

  1. transmute(flights,
  2. gain = arr_delay - dep_delay,
  3. gain_per_hour = gain / (air_time / 60)
  4. )
  1. ## # A tibble: 336,776 x 2
  2. ## gain gain_per_hour
  3. ## <dbl> <dbl>
  4. ## 1 9 2.38
  5. ## 2 16 4.23
  6. ## 3 31 11.6
  7. ## 4 -17 -5.57
  8. ## 5 -19 -9.83
  9. ## 6 16 6.4
  10. ## 7 24 9.11
  11. ## 8 -11 -12.5
  12. ## 9 -5 -2.14
  13. ## 10 10 4.35
  14. ## # ... with 336,766 more rows

summarise()

根据要求提供数据概要

  1. summarise(flights,
  2. delay = mean(dep_delay, na.rm = TRUE)
  3. )
  1. ## # A tibble: 1 x 1
  2. ## delay
  3. ## <dbl>
  4. ## 1 12.6

sample_n() 和 sample_frac()

随机选择变量

  1. # 提取变量并重命名
  2. sample_n(flights, 10)
  1. ## # A tibble: 10 x 19
  2. ## year month day dep_time sched_dep_time dep_delay
  3. ## <int> <int> <int> <int> <int> <dbl>
  4. ## 1 2013 4 19 1948 1930 18
  5. ## 2 2013 5 17 1328 1329 -1
  6. ## 3 2013 6 27 732 600 92
  7. ## 4 2013 7 17 1010 1014 -4
  8. ## 5 2013 8 17 1241 1229 12
  9. ## 6 2013 7 14 700 700 0
  10. ## 7 2013 9 24 1015 1020 -5
  11. ## 8 2013 11 14 1451 1455 -4
  12. ## 9 2013 3 17 847 850 -3
  13. ## 10 2013 11 9 1629 1630 -1
  14. ## # ... with 13 more variables: arr_time <int>,
  15. ## # sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
  16. ## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
  17. ## # air_time <dbl>, distance <dbl>, hour <dbl>,
  18. ## # minute <dbl>, time_hour <dttm>
  1. # 按比例选择
  2. sample_frac(flights, 0.01)
  1. ## # A tibble: 3,368 x 19
  2. ## year month day dep_time sched_dep_time dep_delay
  3. ## <int> <int> <int> <int> <int> <dbl>
  4. ## 1 2013 4 2 812 815 -3
  5. ## 2 2013 8 6 542 545 -3
  6. ## 3 2013 2 6 608 610 -2
  7. ## 4 2013 10 5 1138 1145 -7
  8. ## 5 2013 6 14 1515 1455 20
  9. ## 6 2013 7 17 608 609 -1
  10. ## 7 2013 10 2 1214 1223 -9
  11. ## 8 2013 6 9 1739 1629 70
  12. ## 9 2013 6 13 651 700 -9
  13. ## 10 2013 10 1 2026 2029 -3
  14. ## # ... with 3,358 more rows, and 13 more variables:
  15. ## # arr_time <int>, sched_arr_time <int>, arr_delay <dbl>,
  16. ## # carrier <chr>, flight <int>, tailnum <chr>,
  17. ## # origin <chr>, dest <chr>, air_time <dbl>,
  18. ## # distance <dbl>, hour <dbl>, minute <dbl>,
  19. ## # time_hour <dttm>

总结

dplyr动词语法的操作规律:

  1. 第一个参数是data.frame
  2. 第二个参数是要进行的操作
  3. 最后返回一个data.frame