使用SnapATAC分析单细胞ATAC-seq数据(三):Integrative Analysis of PBMC scATAC-seq and scRNA-seq - 图1

简介

在本教程中,我们将对PBMC的两个scATAC-seq数据集(5K和10K)和一个scRNA-seq数据集进行整合分析。这三个数据集均来自10X genomics测序平台产生的数据,可以直接在10x官网下载使用。
具体来说,我们将主要执行以下分析内容:

  • Cell selection for PBMC 5k and 10k scATAC;
  • Randomly sample 10,000 cells as landmarks;
  • Unsupervised clustering of landmarks;
  • Project the remaining (query) cells onto the landmarks;
  • Supervised annotation of scATAC clusters using scRNA dataset;
  • Downstream analysis including peak calling, differential analysis, prediction of gene-enhancer pairing.

分析步骤

  • Step 0. Data download
  • Step 1. Barcode selection
  • Step 2. Add cell-by-bin matrix
  • Step 3. Matrix binarization
  • Step 4. Bin filtering
  • Step 5. Dimensionality reduction of landmarks
  • Step 6. Determine significant components
  • Step 7. Graph-based clustering
  • Step 8. Visualization
  • Step 9. scRNA-seq based annotation
  • Step 10. Create psudo multiomics cells
  • Step 11. Remove cells of low prediction score
  • Step 12. Gene expression projected into UMAP
  • Step 13. Identify peak
  • Step 14. Create a cell-by-peak matrix
  • Step 15. Identify differentially accessible regions
  • Step 16. Motif variability analysis
  • Step 17. De novo motif discovery
  • Step 18. Predict gene-enhancer pairs

Step 0. Data Download

开始分析之前,我们需要下载snaptools生成的snap文件和cellranger-atac产生的singlecell.csv文件。

  1. # 下载所需的数据集和基因注释信息
  2. $ wget http://renlab.sdsc.edu/r3fang//share/github/PBMC_ATAC_RNA/atac_pbmc_5k_nextgem.snap
  3. $ wget http://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_5k_nextgem/atac_pbmc_5k_nextgem_singlecell.csv
  4. $ wget http://renlab.sdsc.edu/r3fang//share/github/PBMC_ATAC_RNA/atac_pbmc_10k_nextgem.snap
  5. $ wget http://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_10k_nextgem/atac_pbmc_10k_nextgem_singlecell.csv
  6. $ wget http://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_10k_nextgem/hg19.blacklist.bed.gz
  7. $ wget http://cf.10xgenomics.com/samples/cell-atac/1.1.0/gencode.v19.annotation.gene.bed

Step 1. Barcode selection

首先,我们根据以下两个主要的标准来选择高质量的barcodes:

  1. number of unique fragments;
  2. fragments in promoter ratio;
  1. # 加载所需R包
  2. > library(SnapATAC);
  3. > snap.files = c(
  4. "atac_pbmc_5k_nextgem.snap",
  5. "atac_pbmc_10k_nextgem.snap"
  6. );
  7. > sample.names = c(
  8. "PBMC 5K",
  9. "PBMC 10K"
  10. );
  11. > barcode.files = c(
  12. "atac_pbmc_5k_nextgem_singlecell.csv",
  13. "atac_pbmc_10k_nextgem_singlecell.csv"
  14. );
  15. # 读取snap文件
  16. > x.sp.ls = lapply(seq(snap.files), function(i){
  17. createSnap(
  18. file=snap.files[i],
  19. sample=sample.names[i]
  20. );
  21. })
  22. > names(x.sp.ls) = sample.names;
  23. # 读取barcode信息
  24. > barcode.ls = lapply(seq(snap.files), function(i){
  25. barcodes = read.csv(
  26. barcode.files[i],
  27. head=TRUE
  28. );
  29. # remove NO BAROCDE line
  30. barcodes = barcodes[2:nrow(barcodes),];
  31. barcodes$logUMI = log10(barcodes$passed_filters + 1);
  32. barcodes$promoter_ratio = (barcodes$promoter_region_fragments+1) / (barcodes$passed_filters + 1);
  33. barcodes
  34. })
  35. # 质控指标数据可视化
  36. > plots = lapply(seq(snap.files), function(i){
  37. p1 = ggplot(
  38. barcode.ls[[i]],
  39. aes(x=logUMI, y=promoter_ratio)) +
  40. geom_point(size=0.3, col="grey") +
  41. theme_classic() +
  42. ggtitle(sample.names[[i]]) +
  43. ylim(0, 1) + xlim(0, 6) +
  44. labs(x = "log10(UMI)", y="promoter ratio")
  45. p1
  46. })
  47. > plots

使用SnapATAC分析单细胞ATAC-seq数据(三):Integrative Analysis of PBMC scATAC-seq and scRNA-seq - 图2

  1. # 查看snap对象信息
  2. > x.sp.ls
  3. ## $`PBMC 5K`
  4. ## number of barcodes: 20000
  5. ## number of bins: 0
  6. ## number of genes: 0
  7. ## number of peaks: 0
  8. ## number of motifs: 0
  9. ##
  10. ## $`PBMC 10K`
  11. ## number of barcodes: 20000
  12. ## number of bins: 0
  13. ## number of genes: 0
  14. ## number of peaks: 0
  15. ## number of motifs: 0
  16. # for both datasets, we identify usable barcodes using [3.5-5] for log10(UMI) and [0.4-0.8] for promoter ratio as cutoff.
  17. # 设定质控阈值进行筛选过滤
  18. > cutoff.logUMI.low = c(3.5, 3.5);
  19. > cutoff.logUMI.high = c(5, 5);
  20. > cutoff.FRIP.low = c(0.4, 0.4);
  21. > cutoff.FRIP.high = c(0.8, 0.8);
  22. # 根据过滤的阈值进行barcode的筛选
  23. > barcode.ls = lapply(seq(snap.files), function(i){
  24. barcodes = barcode.ls[[i]];
  25. idx = which(
  26. barcodes$logUMI >= cutoff.logUMI.low[i] &
  27. barcodes$logUMI <= cutoff.logUMI.high[i] &
  28. barcodes$promoter_ratio >= cutoff.FRIP.low[i] &
  29. barcodes$promoter_ratio <= cutoff.FRIP.high[i]
  30. );
  31. barcodes[idx,]
  32. });
  33. > x.sp.ls = lapply(seq(snap.files), function(i){
  34. barcodes = barcode.ls[[i]];
  35. x.sp = x.sp.ls[[i]];
  36. barcode.shared = intersect(x.sp@barcode, barcodes$barcode);
  37. x.sp = x.sp[match(barcode.shared, x.sp@barcode),];
  38. barcodes = barcodes[match(barcode.shared, barcodes$barcode),];
  39. x.sp@metaData = barcodes;
  40. x.sp
  41. })
  42. > names(x.sp.ls) = sample.names;
  43. > x.sp.ls
  44. ## $`PBMC 5K`
  45. ## number of barcodes: 4526
  46. ## number of bins: 0
  47. ## number of genes: 0
  48. ## number of peaks: 0
  49. ## number of motifs: 0
  50. ##
  51. ## $`PBMC 10K`
  52. ## number of barcodes: 9039
  53. ## number of bins: 0
  54. ## number of genes: 0
  55. ## number of peaks: 0
  56. ## number of motifs: 0
  57. # combine two snap object
  58. # combine two snap object
  59. # 使用Reduce函数将两个snap对象进行合并
  60. > x.sp = Reduce(snapRbind, x.sp.ls);
  61. > x.sp@metaData["sample"] = x.sp@sample;
  62. > x.sp
  63. ## number of barcodes: 13565
  64. ## number of bins: 0
  65. ## number of genes: 0
  66. ## number of peaks: 0
  67. > table(x.sp@sample);
  68. ## PBMC 10K PBMC 5K
  69. ## 9039 4526

Step 2. Add cell-by-bin matrix

接下来,我们使用addBmatToSnap函数生成5kb分辨率的cell-by-bin计数矩阵,并将其添加到snap对象中。该函数将自动从两个snap文件中读取cell-by-bin矩阵,并将连接后的矩阵添加到snap对象的bmat属性中。

  1. # 使用addBmatToSnap函数计算cell-by-bin计数矩阵
  2. > x.sp = addBmatToSnap(x.sp, bin.size=5000);

Step 3. Matrix binarization

我们使用makeBinary函数将cell-by-bin计数矩阵转换为二进制矩阵。在count矩阵中,某些项具有异常高的覆盖率,这可能是由比对错误造成的。因此,我们将删除计数矩阵中top 0.1%的项,并将其余非零的值转换为1。

  1. # 使用makeBinary函数将cell-by-bin计数矩阵转换成二进制矩阵
  2. > x.sp = makeBinary(x.sp, mat="bmat");

Step 4. Bin filtering

首先,我们过滤掉任何与ENCODE中blacklist区域重叠的bins,以防止潜在的artifacts。

  1. > library(GenomicRanges);
  2. # 读取blacklist信息
  3. > black_list = read.table("hg19.blacklist.bed.gz");
  4. > black_list.gr = GRanges(
  5. black_list[,1],
  6. IRanges(black_list[,2], black_list[,3])
  7. );
  8. > idy = queryHits(
  9. findOverlaps(x.sp@feature, black_list.gr)
  10. );
  11. > if(length(idy) > 0){
  12. x.sp = x.sp[,-idy, mat="bmat"];
  13. };
  14. > x.sp
  15. ## number of barcodes: 13565
  16. ## number of bins: 624794
  17. ## number of genes: 0
  18. ## number of peaks: 0
  19. ## number of motifs: 0

接下来,我们移除那些不需要的染色体信息。

  1. > chr.exclude = seqlevels(x.sp@feature)[grep("random|chrM", seqlevels(x.sp@feature))];
  2. > idy = grep(paste(chr.exclude, collapse="|"), x.sp@feature);
  3. > if(length(idy) > 0){
  4. x.sp = x.sp[,-idy, mat="bmat"]
  5. };
  6. > x.sp
  7. ## number of barcodes: 13565
  8. ## number of bins: 624297
  9. ## number of genes: 0
  10. ## number of peaks: 0
  11. ## number of motifs: 0

第三,bins的覆盖率大致是服从对数正态分布的。我们将与不变特征(如管家基因的启动子)重叠的前5%的bins进行删除 。

  1. > bin.cov = log10(Matrix::colSums(x.sp@bmat)+1);
  2. > hist(
  3. bin.cov[bin.cov > 0],
  4. xlab="log10(bin cov)",
  5. main="log10(Bin Cov)",
  6. col="lightblue",
  7. xlim=c(0, 5)
  8. );
  9. > bin.cutoff = quantile(bin.cov[bin.cov > 0], 0.95);
  10. > idy = which(bin.cov <= bin.cutoff & bin.cov > 0);
  11. > x.sp = x.sp[, idy, mat="bmat"];
  12. > x.sp
  13. ## number of barcodes: 13565
  14. ## number of bins: 534985
  15. ## number of genes: 0
  16. ## number of peaks: 0
  17. ## number of motifs: 0

使用SnapATAC分析单细胞ATAC-seq数据(三):Integrative Analysis of PBMC scATAC-seq and scRNA-seq - 图3

最后,我们将进一步删除bin覆盖率小于1000的任何细胞,这是因为尽管有些细胞可能含有较高的unique fragments片段,但是过滤后的bin覆盖率却很低。此步骤是可选的,但强烈建议这样做。

  1. > idx = which(Matrix::rowSums(x.sp@bmat) > 1000);
  2. > x.sp = x.sp[idx,];
  3. > x.sp
  4. ## number of barcodes: 13434
  5. ## number of bins: 534985
  6. ## number of genes: 0
  7. ## number of peaks: 0
  8. ## number of motifs: 0

Step 5. Dimentionality reduction

SnapATAC采用diffusion maps算法进行数据降维,这是一种非线性降维的技术,它通过对数据执行random walk来发现低维的manifold,并且对噪音和扰动具有很强的鲁棒性。

但是,diffusion maps算法的计算成本会随着细胞数目的增加而呈现指数级增长的趋势。为了克服这一局限,我们将Nyström方法(a sampling technique)和diffusion maps算法相结合,给出Nyström Landmark diffusion map来生成大规模数据集的低维嵌入。

Nyström landmark diffusion maps算法主要包括以下三个步骤:

  • sampling: sample a subset of K (K≪N) cells from N total cells as “landmarks”. Instead of random sampling, here we adopted a density-based sampling approach developed in SCTransform to preserve the density distribution of the N original points;
  • embedding: compute a diffusion map embedding for K landmarks;
  • extension: project the remaining N-K cells onto the low-dimensional embedding as learned from the landmarks to create a joint embedding space for all cells.

在本示例中,我们将采样10,000个细胞作为landmarks,并将余下的query细胞投射到嵌入landmarks的diffusion maps上。

  1. > row.covs.dens <- density(
  2. x = x.sp@metaData[,"logUMI"],
  3. bw = 'nrd', adjust = 1
  4. );
  5. > sampling_prob <- 1 / (approx(x = row.covs.dens$x, y = row.covs.dens$y, xout = x.sp@metaData[,"logUMI"])$y + .Machine$double.eps);
  6. > set.seed(1);
  7. > idx.landmark.ds <- sort(sample(x = seq(nrow(x.sp)), size = 10000, prob = sampling_prob));

将x.sp对象分割为landmark(x.landmark.sp)细胞和query(x.query.sp)细胞。

  1. > x.landmark.sp = x.sp[idx.landmark.ds,];
  2. > x.query.sp = x.sp[-idx.landmark.ds,];

Run diffusion maps on the landmark cells.
使用landmark细胞进行diffusion maps降维处理

  1. > x.landmark.sp = runDiffusionMaps(
  2. obj= x.landmark.sp,
  3. input.mat="bmat",
  4. num.eigs=50
  5. );
  6. > x.landmark.sp@metaData$landmark = 1;

Porject query cells to landmarks.
将query细胞映射到landmarks上

  1. > x.query.sp = runDiffusionMapsExtension(
  2. obj1=x.landmark.sp,
  3. obj2=x.query.sp,
  4. input.mat="bmat"
  5. );
  6. > x.query.sp@metaData$landmark = 0;

Combine landmark and query cells.
合并landmark和query细胞
Note: To merge snap objects, all the matrix (bmat, gmat, pmat) and metaData must be of the same number of columns between snap objects.

  1. > x.sp = snapRbind(x.landmark.sp, x.query.sp);
  2. > x.sp = x.sp[order(x.sp@metaData[,"sample"])]; #IMPORTANT

Step 6. Determine significant components

接下来,我们将确定特征向量(eigen-vectors)的数目,以便用于后续的分析。在下面的例子中,我们选择前15个特征向量。

  1. > plotDimReductPW(
  2. obj=x.sp,
  3. eigs.dims=1:50,
  4. point.size=0.3,
  5. point.color="grey",
  6. point.shape=19,
  7. point.alpha=0.6,
  8. down.sample=5000,
  9. pdf.file.name=NULL,
  10. pdf.height=7,
  11. pdf.width=7
  12. );

使用SnapATAC分析单细胞ATAC-seq数据(三):Integrative Analysis of PBMC scATAC-seq and scRNA-seq - 图4

Step 7. Graph-based clustering

接下来,我们使用所选的特征向量成分,来构造K近邻(KNN)聚类图。其中,每个节点是一个细胞,根据欧氏距离来确定每个细胞的k个近邻细胞。

  1. > x.sp = runKNN(
  2. obj=x.sp,
  3. eigs.dims=1:20,
  4. k=15
  5. );
  6. > library(leiden);
  7. > x.sp=runCluster(
  8. obj=x.sp,
  9. tmp.folder=tempdir(),
  10. louvain.lib="leiden",
  11. seed.use=10,
  12. resolution=0.7
  13. );

Step 8. Visualization

SnapATAC可以使用tSNE(FI-tsne)和UMAP方法对降维聚类后的细胞进行可视化的展示和探索。在本示例中,我们使用UMAP方法进行展示。

  1. > library(umap);
  2. > x.sp = runViz(
  3. obj=x.sp,
  4. tmp.folder=tempdir(),
  5. dims=2,
  6. eigs.dims=1:20,
  7. method="umap",
  8. seed.use=10
  9. );
  10. > par(mfrow = c(2, 2));
  11. > plotViz(
  12. obj= x.sp,
  13. method="umap",
  14. main="Cluster",
  15. point.color=x.sp@cluster,
  16. point.size=0.2,
  17. point.shape=19,
  18. text.add=TRUE,
  19. text.size=1,
  20. text.color="black",
  21. down.sample=10000,
  22. legend.add=FALSE
  23. );
  24. > plotFeatureSingle(
  25. obj=x.sp,
  26. feature.value=x.sp@metaData[,"logUMI"],
  27. method="umap",
  28. main="Read Depth",
  29. point.size=0.2,
  30. point.shape=19,
  31. down.sample=10000,
  32. quantiles=c(0.01, 0.99)
  33. );
  34. > plotViz(
  35. obj= x.sp,
  36. method="umap",
  37. main="Sample",
  38. point.size=0.2,
  39. point.shape=19,
  40. point.color=x.sp@sample,
  41. text.add=FALSE,
  42. text.size=1.5,
  43. text.color="black",
  44. down.sample=10000,
  45. legend.add=TRUE
  46. );
  47. > plotViz(
  48. obj= x.sp,
  49. method="umap",
  50. main="Landmark",
  51. point.size=0.2,
  52. point.shape=19,
  53. point.color=x.sp@metaData[,"landmark"],
  54. text.add=FALSE,
  55. text.size=1.5,
  56. text.color="black",
  57. down.sample=10000,
  58. legend.add=TRUE
  59. );

使用SnapATAC分析单细胞ATAC-seq数据(三):Integrative Analysis of PBMC scATAC-seq and scRNA-seq - 图5

Step 9. scRNA-seq based annotation

在本示例中,我们将使用相应的scRNA-seq数据集来对scATAC-seq数据的细胞类群进行注释。我们可以通过(https://www.dropbox.com/s/3f3p5nxrn5b3y4y/pbmc_10k_v3.rds?dl=1)地址下载所需的10X PBMC scRNA-seq(pbmc_10k_v3.rds)的Seurat对象。

  1. > library(Seurat);
  2. > pbmc.rna = readRDS("pbmc_10k_v3.rds");
  3. > pbmc.rna$tech = "rna";
  4. > variable.genes = VariableFeatures(object = pbmc.rna);
  5. > genes.df = read.table("gencode.v19.annotation.gene.bed");
  6. > genes.gr = GRanges(genes.df[,1], IRanges(genes.df[,2], genes.df[,3]), name=genes.df[,4]);
  7. > genes.sel.gr = genes.gr[which(genes.gr$name %in% variable.genes)];
  8. ## reload the bmat, this is optional but highly recommanded
  9. > x.sp = addBmatToSnap(x.sp);
  10. > x.sp = createGmatFromMat(
  11. obj=x.sp,
  12. input.mat="bmat",
  13. genes=genes.sel.gr,
  14. do.par=TRUE,
  15. num.cores=10
  16. );

接下来,我们将snap对象转换为Seurat对象,用于后续的数据整合。

# 使用snapToSeurat函数将snap对象转换为Seurat对象
> pbmc.atac <- snapToSeurat(
    obj=x.sp, 
    eigs.dims=1:20, 
    norm=TRUE,
    scale=TRUE
  );

# 识别整合的Anchors信息
> transfer.anchors <- FindTransferAnchors(
    reference = pbmc.rna, 
    query = pbmc.atac, 
    features = variable.genes, 
    reference.assay = "RNA", 
    query.assay = "ACTIVITY", 
    reduction = "cca"
  );

# 使用TransferData函数进行数据映射整合
> celltype.predictions <- TransferData(
    anchorset = transfer.anchors, 
    refdata = pbmc.rna$celltype,
    weight.reduction = pbmc.atac[["SnapATAC"]],
    dims = 1:20
  );
> x.sp@metaData$predicted.id = celltype.predictions$predicted.id;
> x.sp@metaData$predict.max.score = apply(celltype.predictions[,-1], 1, max);
> x.sp@cluster = as.factor(x.sp@metaData$predicted.id);

Step 10. Create psudo multiomics cells

现在,snap对象x.sp中包含了的每个细胞的染色质可及性@bmat和基因表达@gmat的信息。

> refdata <- GetAssayData(
    object = pbmc.rna, 
    assay = "RNA", 
    slot = "data"
  );
> imputation <- TransferData(
    anchorset = transfer.anchors, 
    refdata = refdata, 
    weight.reduction = pbmc.atac[["SnapATAC"]], 
    dims = 1:20
  );
> x.sp@gmat = t(imputation@data);
> rm(imputation); # free memory
> rm(refdata);    # free memory
> rm(pbmc.rna);   # free memory
> rm(pbmc.atac); # free memory

Step 11. Remove cells of low prediction score

> hist(
    x.sp@metaData$predict.max.score, 
    xlab="prediction score", 
    col="lightblue", 
    xlim=c(0, 1),
    main="PBMC 10X"
  );
> abline(v=0.5, col="red", lwd=2, lty=2);

> table(x.sp@metaData$predict.max.score > 0.5);
## FALSE  TRUE
##  331 13103

> x.sp = x.sp[x.sp@metaData$predict.max.score > 0.5,];
> x.sp
## number of barcodes: 13045
## number of bins: 627478
## number of genes: 19089
## number of peaks: 0

> plotViz(
    obj=x.sp,
    method="umap", 
    main="PBMC 10X",
    point.color=x.sp@metaData[,"predicted.id"], 
    point.size=0.5, 
    point.shape=19, 
    text.add=TRUE,
    text.size=1,
    text.color="black",
    down.sample=10000,
    legend.add=FALSE
  );

使用SnapATAC分析单细胞ATAC-seq数据(三):Integrative Analysis of PBMC scATAC-seq and scRNA-seq - 图6

Step 12. Gene expression projected onto UMAP

接下来,我们使用UMAP方法对一些marker基因的表达水平进行可视化展示。

> marker.genes = c(
    "IL32", "LTB", "CD3D",
    "IL7R", "LDHB", "FCGR3A", 
    "CD68", "MS4A1", "GNLY", 
    "CD3E", "CD14", "CD14", 
    "FCGR3A", "LYZ", "PPBP", 
    "CD8A", "PPBP", "CST3", 
    "NKG7", "MS4A7", "MS4A1", 
    "CD8A"
  );

> par(mfrow = c(3, 3));
> for(i in 1:9){
    j = which(colnames(x.sp@gmat) == marker.genes[i])
    plotFeatureSingle(
        obj=x.sp,
        feature.value=x.sp@gmat[,j],
        method="umap", 
        main=marker.genes[i],
        point.size=0.1, 
        point.shape=19, 
        down.sample=10000,
        quantiles=c(0.01, 0.99)
 )};

使用SnapATAC分析单细胞ATAC-seq数据(三):Integrative Analysis of PBMC scATAC-seq and scRNA-seq - 图7

Step 13. Identify peaks

接下来,我们将每个cluster群中的reads进行聚合,创建用于peak calling和可视化的集成track。执行完peak calling后,将生成一个.narrowPeak的文件,该文件中包含了识别出的peaks信息,和一个.bedgraph的文件,可以用于可视化展示。为了获得最robust的结果,我们不建议对细胞数目小于100的cluster群执行此步骤。

> system("which snaptools");
/home/r3fang/anaconda2/bin/snaptools
> system("which macs2");
/home/r3fang/anaconda2/bin/macs2

# 使用runMACS函数调用macs2进行peak calling
> peaks = runMACS(
    obj=x.sp[which(x.sp@cluster=="CD4 Naive"),], 
    output.prefix="PBMC.CD4_Naive",
    path.to.snaptools="/home/r3fang/anaconda2/bin/snaptools",
    path.to.macs="/home/r3fang/anaconda2/bin/macs2",
    gsize="hs", # mm, hs, etc
    buffer.size=500, 
    num.cores=10,
    macs.options="--nomodel --shift 100 --ext 200 --qval 5e-2 -B --SPMR",
    tmp.folder=tempdir()
    );

为了对所有的cluster群执行peak calling,我们提供了一个简便的脚本来实现该步骤。

# call peaks for all cluster with more than 100 cells
# 选出那些细胞数目大于100的cluster群
> clusters.sel = names(table(x.sp@cluster))[which(table(x.sp@cluster) > 100)];

# 批量对不同的cluster进行peak calling
> peaks.ls = mclapply(seq(clusters.sel), function(i){
    print(clusters.sel[i]);
    peaks = runMACS(
        obj=x.sp[which(x.sp@cluster==clusters.sel[i]),], 
        output.prefix=paste0("PBMC.", gsub(" ", "_", clusters.sel)[i]),
        path.to.snaptools="/home/r3fang/anaconda2/bin/snaptools",
        path.to.macs="/home/r3fang/anaconda2/bin/macs2",
        gsize="hs", # mm, hs, etc
        buffer.size=500, 
        num.cores=1,
        macs.options="--nomodel --shift 100 --ext 200 --qval 5e-2 -B --SPMR",
        tmp.folder=tempdir()
   );
    peaks
     }, mc.cores=5);

> peaks.names = system("ls | grep narrowPeak", intern=TRUE);

> peak.gr.ls = lapply(peaks.names, function(x){
    peak.df = read.table(x)
    GRanges(peak.df[,1], IRanges(peak.df[,2], peak.df[,3]))
  })

# 合并peak信息
> peak.gr = reduce(Reduce(c, peak.gr.ls));

我们可以将每个cluster群生成的bdg文件导入到IGV或其他基因组浏览器(如UW genome browser)进行可视化的展示,下面是来自UW genome browser的FOXJ2基因及其侧翼区域的screenshot。
使用SnapATAC分析单细胞ATAC-seq数据(三):Integrative Analysis of PBMC scATAC-seq and scRNA-seq - 图8

Step 14. Create a cell-by-peak matrix

接下来,我们使用合并后的peaks作为参考,来创建一个cell-by-peak计数矩阵,并将其添加到snap对象中。
首先,我们将合并后的peaks信息写入到peaks.combined.bed文件中

> peaks.df = as.data.frame(peak.gr)[,1:3];
> write.table(peaks.df,file = "peaks.combined.bed",append=FALSE,
        quote= FALSE,sep="\t", eol = "\n", na = "NA", dec = ".", 
        row.names = FALSE, col.names = FALSE, qmethod = c("escape", "double"),
        fileEncoding = "")

接下来,我们创建一个cell-by-peak计数矩阵,并将其添加到snap对象中。这一步可能需要一段时间。

$ snaptools snap-add-pmat \
    --snap-file atac_pbmc_10k_nextgem.snap \
    --peak-file peaks.combined.bed &
$ snaptools snap-add-pmat \
    --snap-file atac_pbmc_5k_nextgem.snap \
    --peak-file peaks.combined.bed

然后,我们将计算好的cell-by-peak计数矩阵添加到已有的snap对象中。

> x.sp = addPmatToSnap(x.sp);

Step 15. Identify differentially accessible peaks

对于一组给定的细胞Ci,我们首先在diffusion component空间中寻找邻近的细胞Cj (|Ci|=|Cj|)作为“background”细胞进行比较。如果Ci细胞的数目占总细胞数目的一半以上,则使用剩余的细胞作为local background。接下来,我们将Ci和Cj细胞进行聚合,来创建两个raw-count向量,即Vci和Vcj。然后,我们使用R包edgeR (v3.18.1)的精确检验(exact test)来对Vci和Vcj进行差异分析,其中,对于小鼠设置BCV=0.1,而人类设置BCV= 0.4。最后,再使用Benjamini-Hochberg多重检验校正方法,来调整p-value值为错误发现率(FDR)值。对于FDR值小于0.05的peaks,被选为显著的DARs。

我们发现这种方法也存在一定的局限性,特别是对于那些细胞数目较少的cluster群,统计的有效性可能不够强大。对于那些未能识别出显著差异peaks的cluster群,我们将根据富集p-value值对元素进行排序,并挑选出最具代表性的2,000个peaks进行后续的motif分析。

> DARs = findDAR(
    obj=x.sp,
    input.mat="pmat",
    cluster.pos="CD14+ Monocytes",
    cluster.neg.method="knn",
    test.method="exactTest",
    bcv=0.4, #0.4 for human, 0.1 for mouse
    seed.use=10
  );
> DARs$FDR = p.adjust(DARs$PValue, method="BH");
> idy = which(DARs$FDR < 5e-2 & DARs$logFC > 0);

> par(mfrow = c(1, 2));
> plot(DARs$logCPM, DARs$logFC, 
    pch=19, cex=0.1, col="grey", 
    ylab="logFC", xlab="logCPM",
    main="CD14+ Monocytes"
  );
> points(DARs$logCPM[idy], 
    DARs$logFC[idy], 
    pch=19, 
    cex=0.5, 
    col="red"
  );
> abline(h = 0, lwd=1, lty=2);

> covs = Matrix::rowSums(x.sp@pmat);
> vals = Matrix::rowSums(x.sp@pmat[,idy]) / covs;
> vals.zscore = (vals - mean(vals)) / sd(vals);

> plotFeatureSingle(
    obj=x.sp,
    feature.value=vals.zscore,
    method="umap", 
    main="CD14+ Monocytes",
    point.size=0.1, 
    point.shape=19, 
    down.sample=5000,
    quantiles=c(0.01, 0.99)
  );

Step 16. Motif variability analysis

SnapATAC可以调用chromVAR(Schep et al)程序进行motif可变性分析。

# 加载所需的R包
> library(chromVAR);
> library(motifmatchr);
> library(SummarizedExperiment);
> library(BSgenome.Hsapiens.UCSC.hg19);

> x.sp = makeBinary(x.sp, "pmat");

# 使用runChromVAR函数调用ChromVAR进行motif可变性分析
> x.sp@mmat = runChromVAR(
    obj=x.sp,
    input.mat="pmat",
    genome=BSgenome.Hsapiens.UCSC.hg19,
    min.count=10,
    species="Homo sapiens"
  );
> x.sp;
## number of barcodes: 13103
## number of bins: 627478
## number of genes: 19089
## number of peaks: 157750
## number of motifs: 271

> motif_i = "MA0071.1_RORA";
> dat = data.frame(x=x.sp@metaData$predicted.id, y=x.sp@mmat[,motif_i]);
> p <- ggplot(dat, aes(x=x, y=y, fill=x)) + 
    theme_classic() +
    geom_violin() + 
    xlab("cluster") +
    ylab("motif enrichment") + 
    ggtitle("MA0071.1_RORA") +
    theme(
          plot.margin = margin(5,1,5,1, "cm"),
          axis.text.x = element_text(angle = 90, hjust = 1),
          axis.ticks.x=element_blank(),
          legend.position = "none"
   );

使用SnapATAC分析单细胞ATAC-seq数据(三):Integrative Analysis of PBMC scATAC-seq and scRNA-seq - 图9

Step 17. De novo motif discovery

SnapATAC还可以调用homer用于对识别出的差异可及性区域(DARs)进行motif的识别和富集分析.

# 查看findMotifsGenome.pl程序安装的路径
> system("which findMotifsGenome.pl");
/projects/ps-renlab/r3fang/public_html/softwares/homer/bin/findMotifsGenome.pl

# 使用findDAR函数鉴定DAR区域
> DARs = findDAR(
    obj=x.sp,
    input.mat="pmat",
    cluster.pos="Double negative T cell",
    cluster.neg.method="knn",
    test.method="exactTest",
    bcv=0.4, #0.4 for human, 0.1 for mouse
    seed.use=10
  );
> DARs$FDR = p.adjust(DARs$PValue, method="BH");
> idy = which(DARs$FDR < 5e-2 & DARs$logFC > 0);

# 使用runHomer函数调用homer进行de novo motif discovery
> motifs = runHomer(
    x.sp[,idy,"pmat"], 
    mat = "pmat",
    path.to.homer = "/projects/ps-renlab/r3fang/public_html/softwares/homer/bin/findMotifsGenome.pl",
    result.dir = "./homer/DoubleNegativeTcell",
    num.cores=5,
    genome = 'hg19',
    motif.length = 10,
    scan.size = 300,
    optimize.count = 2,
    background = 'automatic',
    local.background = FALSE,
    only.known = TRUE,
    only.denovo = FALSE,
    fdr.num = 5,
    cache = 100,
    overwrite = TRUE,
    keep.minimal = FALSE
    );

使用SnapATAC分析单细胞ATAC-seq数据(三):Integrative Analysis of PBMC scATAC-seq and scRNA-seq - 图10

Step 18. Predict gene-enhancer pairs

最后,我们还可以使用一种“pseudo”细胞的方法,根据单个细胞中基因的表达和远端调控元件的染色质可及性的关系,将远端调控元件连接到目标基因上。对于给定的marker基因,我们首先识别出marker基因侧翼区域内的peak。对于每个侧翼的peak,我们使用基因表达作为输入变量进行逻辑回归来预测binarized的染色质状态。产生的结果估计了染色质可及性与基因表达之间联系的重要性。

> TSS.loci = GRanges("chr12", IRanges(8219067, 8219068));

> pairs = predictGenePeakPair(
    x.sp, 
    input.mat="pmat",
    gene.name="C3AR1", 
    gene.loci=resize(TSS.loci, width=500000, fix="center"),
    do.par=FALSE
    );

# convert the pair to genome browser arc plot format
> pairs.df = as.data.frame(pairs);
> pairs.df = data.frame(
    chr1=pairs.df[,"seqnames"],
    start1=pairs.df[,"start"],
    end1=pairs.df[,"end"],
    chr2="chr2",
    start2=8219067,
    end2=8219068,
    Pval=pairs.df[,"logPval"]
    );
> head(pairs.df)
##    chr1  start1    end1 chr2  start2    end2       Pval
## 1 chr12 7984734 7985229 chr2 8219067 8219068 14.6075918
## 2 chr12 7987561 7988085 chr2 8219067 8219068  5.6718381
## 3 chr12 7989776 7990567 chr2 8219067 8219068 24.2564608
## 4 chr12 7996454 7996667 chr2 8219067 8219068  0.6411017
## 5 chr12 8000059 8000667 chr2 8219067 8219068  2.0324922
## 6 chr12 8012404 8013040 chr2 8219067 8219068  0.0000000

使用SnapATAC分析单细胞ATAC-seq数据(三):Integrative Analysis of PBMC scATAC-seq and scRNA-seq - 图11

参考来源:https://gitee.com/booew/SnapATAC/blob/master/examples/10X_PBMC_15K/README.md