title: R_7 tidyrdate: 2021-07-11
tags: R语言
categories: 学习
mathjax: true
使用tidyr重塑数据
tidyr可以将数据转换为适当的结构,以便进行分析和可视化。
1. 适合分析的数据
整洁数据的三个原则:
(1)每个变量在一列中;
(2)每次观测结果对应一行;
(3)每个值对应一个单元格。
宽格式是不适合分析的,但是表示方法简洁;长格式是适合分析的,但是会包含一些重复的数据。
导入tidyr包
# install.packages("tidyr")
# library("tidyr")
2. 从列到行:gather()
使用gather()函数将宽格式数据转换为长格式数据,即将在多个列中的数据收集到一个列中。
band_data_wide <- read.csv("band_price.csv")
band_data_long <- gather(
band_data_wide, #数据来源
key = band, #新产生数据框中的特征所在列的列名
value = price, #新产生数据框中的数据值所在列的列名
-city #从这些列中收集数据,负号表示从除了city以外
)
3. 从列到行:spread()
使用spread()函数将长格式转换为宽格式。
price_by_band <- spread(
band_data_long,
key = city,
value = price
)
使用unite()和separate()将列合并或者展开。
tidyr示例:
# Exercise 1: analyzing avocado sales with the `tidyr` package
# Load necessary packages (`tidyr`, `dplyr`, and `ggplot2`)
# Set your working directory using the RStudio menu:
# Session > Set Working Directory > To Source File Location
# Load the `data/avocado.csv` file into a variable `avocados`
# Make sure strings are *not* read in as factors
avocados <- read.csv("avocado.csv",stringsAsFactors = FALSE)
# To tell R to treat the `Date` column as a date (not just a string)
# Redefine that column as a date using the `as.Date()` function
# (hint: use the `mutate` function)
as.Date(avocados$Date, "%Y-%m-%d")
# The file had some uninformative column names, so rename these columns:
# `X4046` to `small_haas`
# `X4225` to `large_haas`
# `X4770` to `xlarge_haas`
avocados <- rename(
avocados,
small_haas = X4046,
large_haas = X4225,
xlarge_haas = X4770
)
# The data only has sales for haas avocados. Create a new column `other_avos`
# that is the Total.Volume minus all haas avocados (small, large, xlarge)
avocados <- mutate(
avocados,
other_avos = Total.Volume - small_haas - large_haas - xlarge_haas
)
# To perform analysis by avocado size, create a dataframe `by_size` that has
# only `Date`, `other_avos`, `small_haas`, `large_haas`, `xlarge_haas`
by_size <- select(avocados,Date,other_avos,small_haas,large_haas,xlarge_haas)
# In order to visualize this data, it needs to be reshaped. The four columns
# `other_avos`, `small_haas`, `large_haas`, `xlarge_haas` need to be
# **gathered** together into a single column called `size`. The volume of sales
# (currently stored in each column) should be stored in a new column called
# `volume`. Create a new dataframe `size_gathered` by passing the `by_size`
# data frame to the `gather()` function. `size_gathered` will only have 3
# columns: `Date`, `size`, and `volume`.
size_gathered <- by_size %>%
gather(
key = size,
value = volume,
small_haas,large_haas,xlarge_haas,other_avos
)
# Using `size_gathered`, compute the average sales volume of each size
# (hint, first `group_by` size, then compute using `summarize`)
avg_sales_value <- size_gathered %>%
group_by(size) %>%
summarize(
avg = mean(volume, na.rm = TRUE)
)
# This shape also facilitates the visualization of sales over time
# (how to write this code is covered in Chapter 16)
ggplot(size_gathered) +
geom_smooth(mapping = aes(x = Date, y = volume, col = size), se = F)
# We can also investigate sales by avocado type (conventional, organic).
# Create a new data frame `by_type` by grouping the `avocados` dataframe by
# `Date` and `type`, and calculating the sum of the `Total.Volume` for that type
# in that week (resulting in a data frame with 2 rows per week).
# To make a (visual) comparison of conventional versus organic sales, you
# need to **spread** out the `type` column into two different columns. Create a
# new data frame `by_type_wide` by passing the `by_type` data frame to
# the `spread()` function!
# Now you can create a scatterplot comparing conventional to organic sales!
# (how to write this code is covered in Chapter 16)
ggplot(by_type_wide) +
geom_point(mapping = aes(x = conventional, y = organic, color = Date))