Seurat-多样本整合(rPCA)

R版本与运行环境信息

  1. Date:2021-7-21
  2. sessionInfo("Seurat")
  3. R version 4.0.3 (2020-10-10)
  4. Platform: x86_64-w64-mingw32/x64 (64-bit)
  5. Running under: Windows 10 x64 (build 18363)
  6. version: Seurat_4.0.1
  1. Tim S, Andrew Butler, Paul Hoffman , et al. Comprehensive integration of single cell data[J].Cell,2019.

数据整合原理

见Seurat-多样本整合(CCA)

https://www.jianshu.com/p/32ca61450fe9

通常一下情况可以选择使用rPCA的方法对数据进行整合

  • 一个数据集中有大部分的cell无法与另一个数据集进行匹配( substantial fraction of cells in one dataset have no matching type in the other)
  • 数据集均来自同一平台,如10X( Datasets originate from the same platform)
  • 数据集非常的大(There are a large number of datasets or cells to integrate)

与基于CCA的数据整合的区别

  • 在确定两个数据集共享的变异来源时,CCA更适合细胞类型比较保守,但是基因表达差异很大的数据,但是CCA可能会造成过度矫正,尤其是在非常大的数据集已经共享细胞较少的数据集
  1. By identifying shared sources of variation between datasets, CCA is well-suited for identifying anchors when cell types are conserved, but there are very substantial differences in gene expression across experiments. CCA-based integration therefore enables integrative analysis when experimental conditions or disease states introduce very strong expression shifts, or when integrating datasets across modalities and species. However, CCA-based integration may also lead to overcorrection, especially when a large proportion of cells are non-overlapping across datasets.
  • rPCA是一种运行速度更快的方法,同时也更为保守,因此整合后的数据可能并没有那么“整齐”,同时在执行rPCA寻找锚点时,需要先对每个需要整合的数据运行PCA分析,因此rPCA的方法更适合上述三种情况
  1. RPCA-based integration runs significantly faster, and also represents a more conservative approach where cells in different biological states are less likely to align after integration.

使用rPCA的方法整合数据

载入数据与相关包

  1. rm(list = ls())
  2. library(SeuratData)
  3. library(Seurat)
  4. library(tidyverse)
  5. # install dataset
  6. #InstallData("ifnb")
  7. #载入ifnb数据
  8. LoadData("ifnb")

分割对象并对每组数据进行处理

  1. #分割对象
  2. ifnb.list <- SplitObject(ifnb,split.by = "stim")
  3. #———查看数据----
  4. > ifnb.list
  5. $CTRL
  6. An object of class Seurat
  7. 14053 features across 6548 samples within 1 assay
  8. Active assay: RNA (14053 features, 0 variable features)
  9. $STIM
  10. An object of class Seurat
  11. 14053 features across 7451 samples within 1 assay
  12. Active assay: RNA (14053 features, 0 variable features)
  13. #————————----
  14. #归一化,寻找高变基因
  15. ifnb.list <- lapply(ifnb.list, function(x){
  16. x <- NormalizeData(x)
  17. x <- FindVariableFeatures(x,selection.method = "vst",nfeatures = 2000)
  18. })
  19. #确定用于数据整合的features
  20. features <- SelectIntegrationFeatures(ifnb.list)
  21. #数据标准化和PCA分析
  22. ifnb.list <- lapply(ifnb.list,function(x){
  23. x <- ScaleData(x,features = features)
  24. x <- RunPCA(x,features = features)
  25. })

使用rPCA确定锚点并整合数据

  1. #确定锚点
  2. ifnb.imm <- FindIntegrationAnchors(object.list = ifnb.list,
  3. anchor.features = features,
  4. reduction = "rpca")
  5. #整合数据
  6. ifnb.immdata <- IntegrateData(ifnb.imm)

整合数据标准分析

  1. #设定分析用的数据集
  2. DefaultAssay(ifnb.immdata) <- "integrated"
  3. #标准分析
  4. ifnb.immdata <- ScaleData(ifnb.immdata, verbose = FALSE)
  5. ifnb.immdata <- RunPCA(ifnb.immdata, npcs = 30, verbose = FALSE)
  6. ifnb.immdata <- RunUMAP(ifnb.immdata, reduction = "pca", dims = 1:30)
  7. ifnb.immdata <- FindNeighbors(ifnb.immdata, reduction = "pca", dims = 1:30)
  8. ifnb.immdata <- FindClusters(ifnb.immdata, resolution = 0.5)
  9. p1 <- DimPlot(ifnb.immdata, reduction = "umap", group.by = "stim")
  10. p2 <- DimPlot(ifnb.immdata, reduction = "umap", group.by = "seurat_annotations", label = TRUE,
  11. repel = TRUE)
  12. p1 + p2

scRNA-多样本整合(rPCA) - 图1

从上图可以看出,与CCA的整合相比,rPCA的整合结果要略差于CCA,但是可以通过调整FindIntegrationAnchors()的参数进一步优化整合结果

优化rPCA寻找锚点的结果

通过调整k.anchor参数, 默认为5

**k.anchor**: How many neighbors (k) to use when picking anchors, 5

  1. #调整k.anchor
  2. ifnb.imm <- FindIntegrationAnchors(object.list = ifnb.list,
  3. anchor.features = features,
  4. reduction = "rpca",
  5. k.anchor = 20)
  6. ifnb.immdata <- IntegrateData(ifnb.imm)
  7. DefaultAssay(ifnb.immdata) <- "integrated"
  8. # Run the standard workflow for visualization and clustering
  9. ifnb.immdata <- ScaleData(ifnb.immdata, verbose = FALSE)
  10. ifnb.immdata <- RunPCA(ifnb.immdata, npcs = 30, verbose = FALSE)
  11. ifnb.immdata <- RunUMAP(ifnb.immdata, reduction = "pca", dims = 1:30)
  12. ifnb.immdata <- FindNeighbors(ifnb.immdata, reduction = "pca", dims = 1:30)
  13. ifnb.immdata <- FindClusters(ifnb.immdata, resolution = 0.5)
  14. p1 <- DimPlot(ifnb.immdata, reduction = "umap", group.by = "stim")
  15. p2 <- DimPlot(ifnb.immdata, reduction = "umap", group.by = "seurat_annotations", label = TRUE,
  16. repel = TRUE)
  17. p1 + p2

scRNA-多样本整合(rPCA) - 图2

可以看出整合的效果要好于默认参数

使用SCTransform对数据标准化

上服务器

NormalizeData()FindVariableFeatures()ScaleData()SCTransform替代

  1. library(glmGamPoi)
  2. LoadData("ifnb")
  3. ifnb.list <- SplitObject(ifnb, split.by = "stim")
  4. #执行SCTransform标准化
  5. ifnb.list <- lapply(X = ifnb.list, FUN = SCTransform, method = "glmGamPoi")
  6. #选择用于整合的features
  7. features <- SelectIntegrationFeatures(object.list = ifnb.list, nfeatures = 3000)
  8. #Prepare an object list normalized with sctransform for integration.
  9. ?PrepSCTIntegration
  10. #PCA分析
  11. ifnb.list <- lapply(X = ifnb.list, FUN = RunPCA, features = features)

数据整合

  • 在常规分析中,使用少量的PC既能关注到关键的生物学差异,又能够不引入更多的技术差异,相当于一种保守性的做法。是的,它会失去一些生物差异信息,但是同时又在常规手段中比较安全。
  • 这里使用的sctransform,显然更“自信”一些,能提取更多的生物差异,并且兼顾不引入技术误差

常规分析中的FindVariableFeatures默认得到2000个高变异基因(HVGs),而这里的**sctransform**因为使用了更多的PCs,算法也更优化,所以默认会得到3000个HVGs。sctransform认为:新增加的这1000个基因就包含了之前没有检测到的微弱的生物学差异。而且,即使使用全部的全部的基因去做下游分析,得到的结果也是和sctransform这3000个基因的结果相似

  1. #确定锚点
  2. immune.anchors <- FindIntegrationAnchors(object.list = ifnb.list,
  3. normalization.method = "SCT",
  4. anchor.features = features,
  5. dims = 1:30,
  6. reduction = "rpca",
  7. k.anchor = 5)
  8. #数据整合
  9. immune.combined.sct <- IntegrateData(anchorset = immune.anchors, normalization.method = "SCT", dims = 1:30)

整合后的数据UMAP分析

  1. immune.combined.sct <- RunPCA(immune.combined.sct, verbose = FALSE)
  2. immune.combined.sct <- RunUMAP(immune.combined.sct, reduction = "pca", dims = 1:30)
  3. p1 <- DimPlot(immune.combined.sct, reduction = "umap", group.by = "stim")
  4. p2 <- DimPlot(immune.combined.sct, reduction = "umap", group.by = "seurat_annotations", label = TRUE,
  5. repel = TRUE)
  6. cowplot::plot_grid(p1,p2,nrow = 2)

scRNA-多样本整合(rPCA) - 图3