对于数据进行操作时,往往不限于单个数据的操作,多数情况需要对数据进行整合,过滤,标准化。在dplyr包中,提供了一系列的快捷操作,用于两个数据之间的操作处理

合并(join)操作

  1. library("nycflights13")
  2. # Drop unimportant variables so it's easier to understand the join results.
  3. flights2 <- flights %>% select(year:day, hour, origin, dest, tailnum, carrier)
  4. ## 向前边的数据添加变量
  5. flights2 %>%
  6. left_join(airlines)
  1. ## Joining, by = "carrier"
  1. ## # A tibble: 336,776 x 9
  2. ## year month day hour origin dest tailnum carrier name
  3. ## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr>
  4. ## 1 2013 1 1 5 EWR IAH N14228 UA Unit~
  5. ## 2 2013 1 1 5 LGA IAH N24211 UA Unit~
  6. ## 3 2013 1 1 5 JFK MIA N619AA AA Amer~
  7. ## 4 2013 1 1 5 JFK BQN N804JB B6 JetB~
  8. ## 5 2013 1 1 6 LGA ATL N668DN DL Delt~
  9. ## 6 2013 1 1 5 EWR ORD N39463 UA Unit~
  10. ## 7 2013 1 1 6 EWR FLL N516JB B6 JetB~
  11. ## 8 2013 1 1 6 LGA IAD N829AS EV Expr~
  12. ## 9 2013 1 1 6 JFK MCO N593JB B6 JetB~
  13. ## 10 2013 1 1 6 LGA ORD N3ALAA AA Amer~
  14. ## # ... with 336,766 more rows

join变量参照

在进行join操作时,选取变量作为参照,对数据进行合并,在默认情况下时NULL,意思为根据所有变量进行合并

  1. flights2 %>% left_join(weather)
  1. ## Joining, by = c("year", "month", "day", "hour", "origin")
  1. ## # A tibble: 336,776 x 18
  2. ## year month day hour origin dest tailnum carrier temp
  3. ## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <dbl>
  4. ## 1 2013 1 1 5 EWR IAH N14228 UA 39.0
  5. ## 2 2013 1 1 5 LGA IAH N24211 UA 39.9
  6. ## 3 2013 1 1 5 JFK MIA N619AA AA 39.0
  7. ## 4 2013 1 1 5 JFK BQN N804JB B6 39.0
  8. ## 5 2013 1 1 6 LGA ATL N668DN DL 39.9
  9. ## 6 2013 1 1 5 EWR ORD N39463 UA 39.0
  10. ## 7 2013 1 1 6 EWR FLL N516JB B6 37.9
  11. ## 8 2013 1 1 6 LGA IAD N829AS EV 39.9
  12. ## 9 2013 1 1 6 JFK MCO N593JB B6 37.9
  13. ## 10 2013 1 1 6 LGA ORD N3ALAA AA 39.9
  14. ## # ... with 336,766 more rows, and 9 more variables:
  15. ## # dewp <dbl>, humid <dbl>, wind_dir <dbl>,
  16. ## # wind_speed <dbl>, wind_gust <dbl>, precip <dbl>,
  17. ## # pressure <dbl>, visib <dbl>, time_hour <dttm>

同样,也可以设定特殊的列作为参考进行数据合并

  1. flights2 %>% left_join(planes, by = "tailnum")
  1. ## # A tibble: 336,776 x 16
  2. ## year.x month day hour origin dest tailnum carrier
  3. ## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr>
  4. ## 1 2013 1 1 5 EWR IAH N14228 UA
  5. ## 2 2013 1 1 5 LGA IAH N24211 UA
  6. ## 3 2013 1 1 5 JFK MIA N619AA AA
  7. ## 4 2013 1 1 5 JFK BQN N804JB B6
  8. ## 5 2013 1 1 6 LGA ATL N668DN DL
  9. ## 6 2013 1 1 5 EWR ORD N39463 UA
  10. ## 7 2013 1 1 6 EWR FLL N516JB B6
  11. ## 8 2013 1 1 6 LGA IAD N829AS EV
  12. ## 9 2013 1 1 6 JFK MCO N593JB B6
  13. ## 10 2013 1 1 6 LGA ORD N3ALAA AA
  14. ## # ... with 336,766 more rows, and 8 more variables:
  15. ## # year.y <int>, type <chr>, manufacturer <chr>,
  16. ## # model <chr>, engines <int>, seats <int>, speed <int>,
  17. ## # engine <chr>

当在两个数据中,需要参考的变量的名称不一样怎么办?使用by = c("x" = "a")进行操作

  1. flights2 %>% left_join(airports, c("dest" = "faa"))
  1. ## # A tibble: 336,776 x 15
  2. ## year month day hour origin dest tailnum carrier name
  3. ## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr>
  4. ## 1 2013 1 1 5 EWR IAH N14228 UA Geor~
  5. ## 2 2013 1 1 5 LGA IAH N24211 UA Geor~
  6. ## 3 2013 1 1 5 JFK MIA N619AA AA Miam~
  7. ## 4 2013 1 1 5 JFK BQN N804JB B6 <NA>
  8. ## 5 2013 1 1 6 LGA ATL N668DN DL Hart~
  9. ## 6 2013 1 1 5 EWR ORD N39463 UA Chic~
  10. ## 7 2013 1 1 6 EWR FLL N516JB B6 Fort~
  11. ## 8 2013 1 1 6 LGA IAD N829AS EV Wash~
  12. ## 9 2013 1 1 6 JFK MCO N593JB B6 Orla~
  13. ## 10 2013 1 1 6 LGA ORD N3ALAA AA Chic~
  14. ## # ... with 336,766 more rows, and 6 more variables:
  15. ## # lat <dbl>, lon <dbl>, alt <dbl>, tz <dbl>, dst <chr>,
  16. ## # tzone <chr>

join的类型

inner_join()

合并后保留两个数据公有的观测值,保留所有unique的变量

  1. df1 <- tibble(x = c(1, 2), y = 2:1)
  2. df2 <- tibble(x = c(1, 3), a = 10, b = "a")
  3. df1 %>% inner_join(df2)
  1. ## Joining, by = "x"
  1. ## # A tibble: 1 x 4
  2. ## x y a b
  3. ## <dbl> <int> <dbl> <chr>
  4. ## 1 1 2 10 a

left_join()

合并后包含第一个数据所有的观测值,第二个数据中没有的值用NA表示

  1. df1 %>% left_join(df2)
  1. ## Joining, by = "x"
  1. ## # A tibble: 2 x 4
  2. ## x y a b
  3. ## <dbl> <int> <dbl> <chr>
  4. ## 1 1 2 10 a
  5. ## 2 2 1 NA <NA>

right_join()

合并后保留第二个数据所有的观测值,等同于left_join(y,x),不过变量顺序发生了改变

  1. df1 %>% right_join(df2)
  1. ## Joining, by = "x"
  1. ## # A tibble: 2 x 4
  2. ## x y a b
  3. ## <dbl> <int> <dbl> <chr>
  4. ## 1 1 2 10 a
  5. ## 2 3 NA 10 a
  1. df2 %>% left_join(df1)
  1. ## Joining, by = "x"
  1. ## # A tibble: 2 x 4
  2. ## x a b y
  3. ## <dbl> <dbl> <chr> <int>
  4. ## 1 1 10 a 2
  5. ## 2 3 10 a NA

full_join()

合并后包含两个数据所有的观测值,没有的值用NA补足

  1. df1 %>% full_join(df2)
  1. ## Joining, by = "x"
  1. ## # A tibble: 3 x 4
  2. ## x y a b
  3. ## <dbl> <int> <dbl> <chr>
  4. ## 1 1 2 10 a
  5. ## 2 2 1 NA <NA>
  6. ## 3 3 NA 10 a

semi_join()

合并后保留第一个数据中所有在第二个数据中匹配到的观测值

  1. df1 <- tibble(x = c(1, 1, 3, 4), y = 1:4)
  2. df2 <- tibble(x = c(1, 1, 2), z = c("a", "b", "a"))
  3. df1 %>% semi_join(df2, by = "x")
  1. ## # A tibble: 2 x 2
  2. ## x y
  3. ## <dbl> <int>
  4. ## 1 1 1
  5. ## 2 1 2

anti_join()

合并后保留第一个数据中所有在第二个数据中未匹配到的观测值

  1. df1 %>% anti_join(df2, by = "x")
  1. ## # A tibble: 2 x 2
  2. ## x y
  3. ## <dbl> <int>
  4. ## 1 3 3
  5. ## 2 4 4

集合操作

在基础R函数中的向量取集操作运用到data.frame水平

  1. df1 <- tibble(x = 1:2, y = c(1L, 1L))
  2. df2 <- tibble(x = 1:2, y = 1:2)
  1. intersect(df1, df2)
  1. ## # A tibble: 1 x 2
  2. ## x y
  3. ## <int> <int>
  4. ## 1 1 1
  1. union(df1, df2)
  1. ## # A tibble: 3 x 2
  2. ## x y
  3. ## <int> <int>
  4. ## 1 1 1
  5. ## 2 2 1
  6. ## 3 2 2
  1. setdiff(df1, df2)
  1. ## # A tibble: 1 x 2
  2. ## x y
  3. ## <int> <int>
  4. ## 1 2 1
  1. setdiff(df2, df1)
  1. ## # A tibble: 1 x 2
  2. ## x y
  3. ## <int> <int>
  4. ## 1 2 2