title: R_7 tidyrdate: 2021-07-11
tags: R语言
categories: 学习
mathjax: true

使用tidyr重塑数据

tidyr可以将数据转换为适当的结构,以便进行分析和可视化。

1. 适合分析的数据

整洁数据的三个原则:
(1)每个变量在一列中;
(2)每次观测结果对应一行;
(3)每个值对应一个单元格。

宽格式是不适合分析的,但是表示方法简洁;长格式是适合分析的,但是会包含一些重复的数据。

导入tidyr包

  1. # install.packages("tidyr")
  2. # library("tidyr")

2. 从列到行:gather()

使用gather()函数将宽格式数据转换为长格式数据,即将在多个列中的数据收集到一个列中。

band_data_wide <- read.csv("band_price.csv")
band_data_long <- gather(
  band_data_wide, #数据来源
  key = band, #新产生数据框中的特征所在列的列名
  value = price, #新产生数据框中的数据值所在列的列名
  -city #从这些列中收集数据,负号表示从除了city以外
)

3. 从列到行:spread()

使用spread()函数将长格式转换为宽格式。

price_by_band <- spread(
  band_data_long,
  key = city,
  value = price
)

使用unite()和separate()将列合并或者展开。

tidyr示例:

# Exercise 1: analyzing avocado sales with the `tidyr` package

# Load necessary packages (`tidyr`, `dplyr`, and `ggplot2`)


# Set your working directory using the RStudio menu:
# Session > Set Working Directory > To Source File Location

# Load the `data/avocado.csv` file into a variable `avocados`
# Make sure strings are *not* read in as factors
avocados <- read.csv("avocado.csv",stringsAsFactors = FALSE)

# To tell R to treat the `Date` column as a date (not just a string)
# Redefine that column as a date using the `as.Date()` function
# (hint: use the `mutate` function)
as.Date(avocados$Date, "%Y-%m-%d")

# The file had some uninformative column names, so rename these columns:
# `X4046` to `small_haas`
# `X4225` to `large_haas`
# `X4770` to `xlarge_haas`
avocados <- rename(
  avocados,
  small_haas = X4046,
  large_haas = X4225,
  xlarge_haas = X4770
)

# The data only has sales for haas avocados. Create a new column `other_avos`
# that is the Total.Volume minus all haas avocados (small, large, xlarge)
avocados <- mutate(
  avocados,
  other_avos = Total.Volume - small_haas - large_haas - xlarge_haas
)

# To perform analysis by avocado size, create a dataframe `by_size` that has
# only `Date`, `other_avos`, `small_haas`, `large_haas`, `xlarge_haas`
by_size <- select(avocados,Date,other_avos,small_haas,large_haas,xlarge_haas)

# In order to visualize this data, it needs to be reshaped. The four columns
# `other_avos`, `small_haas`, `large_haas`, `xlarge_haas` need to be 
# **gathered** together into a single column called `size`. The volume of sales
# (currently stored in each column) should be stored in a new column called 
# `volume`. Create a new dataframe `size_gathered` by passing the `by_size` 
# data frame to the `gather()` function. `size_gathered` will only have 3 
# columns: `Date`, `size`, and `volume`.
size_gathered <- by_size %>%
  gather(
    key = size,
    value = volume,
    small_haas,large_haas,xlarge_haas,other_avos
  )

# Using `size_gathered`, compute the average sales volume of each size 
# (hint, first `group_by` size, then compute using `summarize`)
avg_sales_value <- size_gathered %>%
  group_by(size) %>%
  summarize(
    avg = mean(volume, na.rm = TRUE)
  )

# This shape also facilitates the visualization of sales over time
# (how to write this code is covered in Chapter 16)
ggplot(size_gathered) +
  geom_smooth(mapping = aes(x = Date, y = volume, col = size), se = F) 


# We can also investigate sales by avocado type (conventional, organic).
# Create a new data frame `by_type` by grouping the `avocados` dataframe by
# `Date` and `type`, and calculating the sum of the `Total.Volume` for that type
# in that week (resulting in a data frame with 2 rows per week).


# To make a (visual) comparison of conventional versus organic sales, you 
# need to **spread** out the `type` column into two different columns. Create a 
# new data frame `by_type_wide` by passing the `by_type` data frame to 
# the `spread()` function!


# Now you can create a scatterplot comparing conventional to organic sales!
# (how to write this code is covered in Chapter 16)
ggplot(by_type_wide) +
  geom_point(mapping = aes(x = conventional, y = organic, color = Date))