01 quality_control

zhuang xioajin
6月-23-2021
1. Loading libraries
2.合并seurat对象
3.Assessing the quality metrics

1. Loading libraries
2.合并seurat对象
3.Assessing the quality metrics

zhuang xioajin

6月-23-2021

1. Loading libraries

rm(list = ls())
# Load libraries
library(SingleCellExperiment)
library(Seurat)
library(tidyverse)
library(Matrix)
library(scales)
library(cowplot)
library(RCurl)
## 对于每个单独的样本，你将有以下三个文件。
## 1. barcodes.tsv 这是一个文本文件，包含该样品的所有细胞条码。条形码是按照矩阵文件中的数据顺序排列的（即这些是列名）。
## 2. features.tsv 这些基因的顺序与矩阵文件中的行的顺序相对应（即这些是行名）
## 3. matrix.mtx 这是一个文本文件，其中包含一个计数值的矩阵。行与上述基因ID相关，列与细胞条码对应。请注意，在这个矩阵中，有许多零值。(可以将矩阵转换为稀疏矩阵)
# How to read in 10X data for a single sample (output is a sparse matrix)
ctrl_counts <- Read10X(data.dir = "data/ctrl_raw_feature_bc_matrix")
# 创建 Seurat对象
ctrl <- CreateSeuratObject(counts = ctrl_counts,
                           min.features = 100)
# min.features参数指定了每个细胞需要检测的最小基因数量。
# Explore the metadata
head(ctrl@meta.data)

##                     orig.ident nCount_RNA nFeature_RNA
## AAACATACAATGCC-1 SeuratProject       2344          874
## AAACATACATTTCC-1 SeuratProject       3125          896
## AAACATACCAGAAA-1 SeuratProject       2578          725
## AAACATACCAGCTA-1 SeuratProject       3261          979
## AAACATACCATGCA-1 SeuratProject        746          362
## AAACATACCTCGCT-1 SeuratProject       3519          866

orig.ident: this often contains the sample identity if known, but will default to “SeuratProject”
nCount_RNA: number of UMIs per cell
nFeature_RNA: number of genes detected per cell

# Create each individual Seurat object for every sample
for (file in c("ctrl_raw_feature_bc_matrix", "stim_raw_feature_bc_matrix")){
     #file <- "ctrl_raw_feature_bc_matrix"
        seurat_data <- Read10X(data.dir = paste0("data/", file))
        seurat_obj <- CreateSeuratObject(counts = seurat_data, 
                                         min.features = 100, 
                                         project = file)
        assign(file, seurat_obj) ## 重命名
}
# Check the metadata in the new Seurat objects
head(ctrl_raw_feature_bc_matrix@meta.data)

##                                  orig.ident nCount_RNA nFeature_RNA
## AAACATACAATGCC-1 ctrl_raw_feature_bc_matrix       2344          874
## AAACATACATTTCC-1 ctrl_raw_feature_bc_matrix       3125          896
## AAACATACCAGAAA-1 ctrl_raw_feature_bc_matrix       2578          725
## AAACATACCAGCTA-1 ctrl_raw_feature_bc_matrix       3261          979
## AAACATACCATGCA-1 ctrl_raw_feature_bc_matrix        746          362
## AAACATACCTCGCT-1 ctrl_raw_feature_bc_matrix       3519          866

head(stim_raw_feature_bc_matrix@meta.data)

##                                  orig.ident nCount_RNA nFeature_RNA
## AAACATACCAAGCT-1 stim_raw_feature_bc_matrix       1221          606
## AAACATACCCCTAC-1 stim_raw_feature_bc_matrix       1782          807
## AAACATACCCGTAA-1 stim_raw_feature_bc_matrix       1451          605
## AAACATACCCTCGT-1 stim_raw_feature_bc_matrix       1549          747
## AAACATACGAGGTG-1 stim_raw_feature_bc_matrix       1303          558
## AAACATACGCGAAG-1 stim_raw_feature_bc_matrix       5445         1330

2.合并seurat对象

# Create a merged Seurat object
merged_seurat <- merge(x = ctrl_raw_feature_bc_matrix, 
                       y = stim_raw_feature_bc_matrix, 
                       add.cell.id = c("ctrl", "stim"))
merged_seurat@assays$RNA

## Assay data with 33538 features for 31444 cells
## First 10 features:
##  MIR1302-2HG, FAM138A, OR4F5, AL627309.1, AL627309.3, AL627309.2,
## AL627309.4, AL732372.1, OR4F29, AC114498.1

# Check that the merged object has the appropriate sample-specific prefixes
head(merged_seurat@meta.data)

##                                       orig.ident nCount_RNA nFeature_RNA
## ctrl_AAACATACAATGCC-1 ctrl_raw_feature_bc_matrix       2344          874
## ctrl_AAACATACATTTCC-1 ctrl_raw_feature_bc_matrix       3125          896
## ctrl_AAACATACCAGAAA-1 ctrl_raw_feature_bc_matrix       2578          725
## ctrl_AAACATACCAGCTA-1 ctrl_raw_feature_bc_matrix       3261          979
## ctrl_AAACATACCATGCA-1 ctrl_raw_feature_bc_matrix        746          362
## ctrl_AAACATACCTCGCT-1 ctrl_raw_feature_bc_matrix       3519          866

tail(merged_seurat@meta.data)

##                                       orig.ident nCount_RNA nFeature_RNA
## stim_TTTGCATGCGACAT-1 stim_raw_feature_bc_matrix        620          295
## stim_TTTGCATGCTAAGC-1 stim_raw_feature_bc_matrix       1641          545
## stim_TTTGCATGGGACGA-1 stim_raw_feature_bc_matrix       1233          518
## stim_TTTGCATGGTGAGG-1 stim_raw_feature_bc_matrix       1084          469
## stim_TTTGCATGGTTTGG-1 stim_raw_feature_bc_matrix        818          432
## stim_TTTGCATGTCTTAC-1 stim_raw_feature_bc_matrix       1104          438

目标：要过滤数据只包括那些高质量的真的细胞，这样，当我们聚集我们的细胞更容易识别不同的细胞类型的人群要确定任何不合格的样品，要么试图挽救数据或分析除去，除了，试图理解为什么样品不合格
挑战：从不太复杂的细胞中描绘出质量差的细胞选择合适的过滤阈值，在不去除生物学相关细胞类型的情况下保留高质量细胞
建议：在执行 QC 之前，请清楚了解您对存在的细胞类型的期望。例如，您是否希望样品中含有低复杂性细胞或线粒体表达水平较高的细胞？如果是这样，那么我们在评估数据质量时需要考虑这种生物学。

# Explore merged metadata
View(merged_seurat@meta.data)

number of genes detected per UMI:这个指标可以让我们了解我们的数据集的复杂性（每个UMI检测到的基因越多，我们的数据越复杂）。
mitochondrial ratio:这个指标将给我们一个来自线粒体基因的细胞读数的百分比。

每个细胞的每个UMI的基因数量很容易计算，我们将把结果进行对数10转换，以便更好地在样品之间进行比较。

# Add number of genes per UMI for each cell to metadata
merged_seurat$log10GenesPerUMI <- log10(merged_seurat$nFeature_RNA) / log10(merged_seurat$nCount_RNA)

Seurat有一个方便的函数，可以让我们计算映射到线粒体基因的转录物的比例。PercentageFeatureSet() will take a pattern and search the gene identifiers. For each column (cell) it will take the sum of the counts slot for features belonging to the set, divide by the column sum for all features and multiply by 100. Since we want the ratio value for plotting, we will reverse that step by then dividing by 100.

# Compute percent mito ratio
merged_seurat$mitoRatio <- PercentageFeatureSet(object = merged_seurat, pattern = "^MT-")
merged_seurat$mitoRatio <- merged_seurat@meta.data$mitoRatio / 100

注意：提供的模式（“^MT-”）适用于人类基因名称。你可能需要根据你感兴趣的生物体来调整。如果你没有使用基因名称作为基因ID，那么这个功能就不能工作。我们有 代码可以让你自己计算这个指标。

我们需要为我们的QC指标的metadata添加额外的信息，比如细胞ID、条件信息和各种指标。虽然使用$操作符直接向Seurat对象中的metadata槽添加信息很容易，但我们将把数据帧提取到一个单独的变量中。通过这种方式，我们可以继续插入我们的质量控制分析所需要的额外指标，而不会有影响我们的merged_seurat对象的风险。

# Create metadata dataframe
metadata <- merged_seurat@meta.data
# Add cell IDs to metadata
metadata$cells <- rownames(metadata)
# Rename columns
metadata <- metadata %>%
        dplyr::rename(seq_folder = orig.ident,
                      nUMI = nCount_RNA,
                      nGene = nFeature_RNA)
# Create sample column
metadata$sample <- NA
metadata$sample[which(str_detect(metadata$cells, "^ctrl_"))] <- "ctrl"
metadata$sample[which(str_detect(metadata$cells, "^stim_"))] <- "stim"
# Add metadata back to Seurat object
merged_seurat@meta.data <- metadata

# Create .RData object to load at any time
save(merged_seurat, file="data/merged_filtered_seurat.RData")

3.Assessing the quality metrics

现在我们已经生成了要评估的各种指标，我们可以用可视化的方式探索它们。我们将评估各种指标，然后决定哪些细胞是低质量的，应该从分析中删除。

Cell counts
UMI counts per cell
Genes detected per cell
UMIs vs. genes detected
Mitochondrial counts ratio
Novelty
3.1Cell counts
Cell counts是由检测到的独特的细胞条形码的数量决定的。对于这个实验，预计有12,000-13,000个细胞。在一个理想的世界里，你会期望独特的细胞条形码的数量与你装载的细胞数量相关。然而，情况并非如此，因为细胞的捕获率只是装载量的一个比例。例如，inDrops的细胞捕获效率较高（70-80%），而10X的细胞捕获率则在50-60%之间。注意：如果用于制备文库的细胞浓度不准确，捕获效率可能会低很多。细胞浓度不应该由FACS机器或Bioanalyzer确定（这些工具的浓度测定不准确），而应该使用血细胞计数器或自动细胞计数器来计算细胞浓度。细胞数量也可以因协议而异，产生的细胞数量比我们装载的要高得多。例如，在inDrops协议中，细胞条形码存在于水凝胶中，水凝胶被包裹在液滴中，有一个细胞和裂解/反应混合物。虽然每个水凝胶应该有一个与之相关的细胞条形码，但偶尔一个水凝胶可以有一个以上的细胞条形码。同样，在10X协议中，有可能在乳化液滴（GEM）中只获得一个条形码的珠子，而没有实际的细胞。这两种情况，再加上垂死细胞的存在，会导致细胞条形码的数量高于细胞。
```
# Visualize the number of cell counts per sample
metadata %>% 
  ggplot(aes(x=sample, fill=sample)) + 
  geom_bar() +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
  theme(plot.title = element_text(hjust=0.5, face="bold")) +
  ggtitle("NCells")
```
```
metadata %>% 
count(sample)
```
```
##   sample     n
## 1   ctrl 15688
## 2   stim 15756
```
我们看到每个样本有超过15,000个细胞，这比预期的12-13,000个多出不少。很明显，我们很可能有一些垃圾’细胞’存在。
3.2UMI counts (transcripts) per cell
每个细胞的UMI计数一般应在500以上，那是我们期望的低端。如果UMI计数在500-1000个之间，它是可用的，但细胞可能应该被更深入地测序。
```
# Visualize the number UMIs/transcripts per cell
metadata %>% 
  ggplot(aes(color=sample, x=nUMI, fill= sample)) + 
  geom_density(alpha = 0.2) + 
  scale_x_log10() + 
  theme_classic() +
  ylab("Cell density") +
  geom_vline(xintercept = 500)
```
```
metadata %>% 
group_by(sample) %>% 
summarise(mean(nUMI))
```
```
## # A tibble: 2 x 2
##   sample `mean(nUMI)`
## * <chr>         <dbl>
## 1 ctrl          1600.
## 2 stim          1689.
```
3.3Genes detected per cell
我们对基因检测的期望与UMI检测相似，尽管它可能比UMI低一点。对于高质量的数据，比例直方图应该包含一个单一的大峰，代表被包埋的细胞。如果我们看到主峰右侧有一个小的肩部（在我们的数据中没有），或者细胞的双峰分布，这可以说明一些问题。这可能是有一组细胞由于某种原因而失败。也可能是有生物学上不同类型的细胞（即静止的细胞群，不太复杂的感兴趣的细胞），和/或一种类型比另一种类型小得多（即高计数的细胞可能是尺寸较大的细胞）。因此，这个阈值应与我们在本课中描述的其他指标一起评估。
```
# Visualize the distribution of genes detected per cell via histogram
metadata %>% 
  ggplot(aes(color=sample, x=nGene, fill= sample)) + 
  geom_density(alpha = 0.2) + 
  theme_classic() +
  scale_x_log10() + 
  geom_vline(xintercept = 300)
```
```
# Visualize the distribution of genes detected per cell via boxplot
metadata %>% 
  ggplot(aes(x=sample, y=log10(nGene), fill=sample)) + 
  geom_boxplot() + 
  theme_classic() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
  theme(plot.title = element_text(hjust=0.5, face="bold")) +
  ggtitle("NCells vs NGenes")
```
3.4UMIs vs. genes detected
经常一起评估的两个指标是UMI的数量和每个细胞检测到的基因数量。we have plotted the number of genes versus the numnber of UMIs coloured by the fraction of mitochondrial reads。线粒体读数部分只在检测到的基因很少的特别低的细胞中高（浅蓝颜色）。这可能表明受损/衰老的细胞，其细胞质的mRNA已经通过破损的膜渗出，因此，只有位于线粒体的mRNA仍然保存。这些细胞被我们的计数和基因数阈值过滤掉了。联合可视化计数和基因阈值显示了联合过滤的效果。
质量差的细胞很可能每个细胞的基因和UMI都很低，对应于图中左下角的数据点。好的细胞一般会表现出每个细胞的基因数和UMI数都比较高。
通过这个图，我们还可以评估线的斜率，以及图中右下角象限的任何数据点的散布。这些细胞有较多的UMI，但只有少数的基因数量。这些可能是垂死的细胞，但也可能代表低复杂度的细胞类型（即红细胞）的群体。
```
# Visualize the correlation between genes detected and number of UMIs and determine whether strong presence of cells with low numbers of genes/UMIs
metadata %>% 
  ggplot(aes(x=nUMI, y=nGene, color=mitoRatio)) + 
  geom_point() + 
  scale_colour_gradient(low = "gray90", high = "black") +
  stat_smooth(method=lm) +
  scale_x_log10() + 
  scale_y_log10() + 
  theme_classic() +
  geom_vline(xintercept = 500) +
  geom_hline(yintercept = 250) +
  facet_wrap(~sample)
```
3.5Mitochondrial counts ratio
这个指标可以识别是否有大量来自 死亡或垂死细胞的线粒体污染。我们将线粒体计数的劣质样品定义为超过0.2线粒体比率标志的细胞，当然，除非你的样品中预期会出现这种情况。
```
# Visualize the distribution of mitochondrial gene expression detected per cell
metadata %>% 
  ggplot(aes(color=sample, x=mitoRatio, fill=sample)) + 
  geom_density(alpha = 0.2) + 
  scale_x_log10() + 
  theme_classic() +
  geom_vline(xintercept = 0.2)
```
3.6Complexity
我们可以看到，我们对每个细胞测序较少的样本有较高的整体复杂性，那是因为我们对这些样本的任何特定基因都没有开始饱和测序。这些样品中的离群细胞可能是具有比其他细胞更不复杂的RNA物种的细胞。有时我们可以通过这个指标检测到低复杂度的细胞类型如红细胞的污染。一般来说，我们希望新颖性得分在0.80以上。
```
# Visualize the overall complexity of the gene expression by visualizing the genes detected per UMI
metadata %>%
  ggplot(aes(x=log10GenesPerUMI, color = sample, fill=sample)) +
  geom_density(alpha = 0.2) +
  theme_classic() +
  geom_vline(xintercept = 0.8)
```
注意：Reads per cell是另一个可以用来探索的指标；但是，使用的工作流程需要保存这个信息来评估。一般来说，通过这个指标，你希望看到所有的样品在每细胞10,000到100,000个读数之间的相对相同的位置有峰值。

3.7Cell-level filtering

现在我们已经将各种指标可视化，我们可以决定应用的阈值，这将导致去除低质量的细胞。通常，前面提到的建议是一个粗略的指导方针，具体的实验需要告知所选择的确切阈值。我们将使用以下阈值。

nUMI > 500
nGene > 250
log10GenesPerUMI > 0.8

mitoRatio < 0.2
为了过滤，我们将回到我们的Seurat对象并使用subset()函数。

# Filter out low quality reads using selected thresholds - these will change with experiment
filtered_seurat <- subset(x = merged_seurat, 
                       subset= (nUMI >= 500) & 
                         (nGene >= 250) & 
                         (log10GenesPerUMI > 0.80) & 
                         (mitoRatio < 0.20))

3.8Gene-level filtering

在我们的数据中，我们会有许多 zero counts的基因。这些基因会极大地降低细胞的平均表达量，因此我们将从数据中删除它们。首先， 我们将删除所有细胞中表达量为零的基因。此外，我们将按流行率进行一些过滤。如果一个基因只在少数几个细胞中表达，那么它就没有特别的意义，因为它仍然会降低所有其他没有表达的细胞的平均水平。对于我们的数据，我们选择只 保留在10个或以上细胞中表达的基因。

# Output a logical vector for every gene on whether the more than zero counts per cell
# Extract counts
counts <- GetAssayData(object = filtered_seurat, slot = "counts")
# Output a logical vector for every gene on whether the more than zero counts per cell
nonzero <- counts > 0
# Sums all TRUE values and returns TRUE if more than 10 TRUE values per gene
keep_genes <- Matrix::rowSums(nonzero) >= 10
# Only keeping those genes expressed in more than 10 cells
filtered_counts <- counts[keep_genes, ]
# Reassign to filtered Seurat object
filtered_seurat <- CreateSeuratObject(filtered_counts, meta.data = filtered_seurat@meta.data)

3.9Re-assess QC metrics

在进行过滤后，建议回顾一下指标，以确保你的数据符合你的期望，并有利于下游分析。

从filtered_seurat对象中提取新的meta.data，以通过与未过滤的数据相同的图示。

3.10Saving filtered cells
```
# Create .RData object to load at any time
save(filtered_seurat, file="data/seurat_filtered.RData")
```
参考来源： Single-cell RNA-seq analysis workshop

These materials have been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

zhuang xioajin

6月-23-2021

1. Loading libraries

2.合并seurat对象

3.Assessing the quality metrics

3.1Cell counts

3.2UMI counts (transcripts) per cell

3.3Genes detected per cell

3.4UMIs vs. genes detected

3.5Mitochondrial counts ratio

3.6Complexity

3.7Cell-level filtering

3.8Gene-level filtering

3.9Re-assess QC metrics

3.10Saving filtered cells