简介

在本教程中,我们将对来自10X和snATAC-seq技术产生的成年小鼠大脑的单细胞ATAC-seq测序数据进行整合分析。该示例的所有数据可以从以下链接进行下载:http://renlab.sdsc.edu/r3fang/share/github/Mouse_Brain_10X_snATAC/
使用SnapATAC分析单细胞ATAC-seq数据(四):Integrative Analysis of 10X and snATAC - 图1

分析流程

  • Step 0. Download data
  • Step 1. Create snap object
  • Step 2. Select barcode
  • Step 3. Add cell-by-bin matrix
  • Step 4. Combine snap objects
  • Step 5. Filter bins
  • Step 6. Dimensionality reduction
  • Step 7. Determine significant components
  • Step 8. Remove batch effect
  • Step 9. Graph-based cluster
  • Step 10. Visualization

Step 0. Download data

  1. # 下载所需的数据集
  2. $ wget http://renlab.sdsc.edu/r3fang/share/github/Mouse_Brain_10X_snATAC/CEMBA180305_2B.snap
  3. $ wget http://renlab.sdsc.edu/r3fang/share/github/Mouse_Brain_10X_snATAC/CEMBA180305_2B.barcode.txt
  4. $ wget http://renlab.sdsc.edu/r3fang/share/github/Mouse_Brain_10X_snATAC/atac_v1_adult_brain_fresh_5k.snap
  5. $ wget http://renlab.sdsc.edu/r3fang/share/github/Mouse_Brain_10X_snATAC/atac_v1_adult_brain_fresh_5k.barcode.txt

Step 1. Create snap object

首先,我们将所用的两个数据集读取到snap对象列表中。

  1. # 加载SnapATAC包
  2. > library(SnapATAC);
  3. > file.list = c("CEMBA180305_2B.snap", "atac_v1_adult_brain_fresh_5k.snap");
  4. > sample.list = c("snATAC", "10X");
  5. # 读取snap文件
  6. > x.sp.ls = lapply(seq(file.list), function(i){
  7. x.sp = createSnap(file=file.list[i], sample=sample.list[i]);
  8. x.sp
  9. })
  10. > names(x.sp.ls) = sample.list;
  11. # 查看snap文件信息
  12. > x.sp.ls
  13. ## $snATAC
  14. ## number of barcodes: 15136
  15. ## number of bins: 0
  16. ## number of genes: 0
  17. ## number of peaks: 0
  18. ## number of motifs: 0
  19. ##
  20. ## $`10X`
  21. ## number of barcodes: 20000
  22. ## number of bins: 0
  23. ## number of genes: 0
  24. ## number of peaks: 0
  25. ## number of motifs: 0

Step 2. Select barcode

接下来,我们将读取这两个数据集的barcode信息,并选择高质量的barcodes。

  1. > barcode.file.list = c("CEMBA180305_2B.barcode.txt", "atac_v1_adult_brain_fresh_5k.barcode.txt");
  2. # 读取barcode信息
  3. > barcode.list = lapply(barcode.file.list, function(file){
  4. read.table(file)[,1];
  5. })
  6. > x.sp.list = lapply(seq(x.sp.ls), function(i){
  7. x.sp = x.sp.ls[[i]];
  8. x.sp = x.sp[x.sp@barcode %in% barcode.list[[i]],];
  9. })
  10. > names(x.sp.list) = sample.list;
  11. > x.sp.list
  12. ## $snATAC
  13. ## number of barcodes: 9646
  14. ## number of bins: 0
  15. ## number of genes: 0
  16. ## number of peaks: 0
  17. ## number of motifs: 0
  18. ##
  19. ## $`10X`
  20. ## number of barcodes: 4100
  21. ## number of bins: 0
  22. ## number of genes: 0
  23. ## number of peaks: 0
  24. ## number of motifs: 0

Step 3. Add cell-by-bin matrix

  1. # 使用addBmatToSnap函数计算cell-by-bin计数矩阵并添加到snap对象中
  2. > x.sp.list = lapply(seq(x.sp.list), function(i){
  3. x.sp = addBmatToSnap(x.sp.list[[i]], bin.size=5000);
  4. x.sp
  5. })
  6. > x.sp.list
  7. ## $snATAC
  8. ## number of barcodes: 9646
  9. ## number of bins: 545118
  10. ## number of genes: 0
  11. ## number of peaks: 0
  12. ## number of motifs: 0
  13. ##
  14. ## $`10X`
  15. ## number of barcodes: 4100
  16. ## number of bins: 546206
  17. ## number of genes: 0
  18. ## number of peaks: 0
  19. ## number of motifs: 0

可以看到,这两个snap对象中含有不同数目的bins,这是因为这两个数据集使用的参考基因组有细微的差异。

Step 4. Combine snap objects

接下来,我们将这个数据集进行合并。
To combine these two snap objects, common bins are selected.

  1. # 选择两个数据集共有的bins
  2. > bin.shared = Reduce(intersect, lapply(x.sp.list, function(x.sp) x.sp@feature$name));
  3. > x.sp.list <- lapply(x.sp.list, function(x.sp){
  4. idy = match(bin.shared, x.sp@feature$name);
  5. x.sp[,idy, mat="bmat"];
  6. })
  7. # 合并两个数据集
  8. > x.sp = Reduce(snapRbind, x.sp.list);
  9. > rm(x.sp.list); # free memory
  10. > gc();
  11. > table(x.sp@sample);
  12. ## 10X snATAC
  13. ## 4100 9646

Step 5. Binarize matrix

  1. # 使用makeBinary函数将计数矩阵转换为二进制矩阵
  2. > x.sp = makeBinary(x.sp, mat="bmat");

Step 6. Filter bins

首先,我们将与ENCODE中blacklist区域重叠的bins进行过滤,以防止潜在的artifacts。

  1. > system("wget http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/mm10-mouse/mm10.blacklist.bed.gz");
  2. > library(GenomicRanges);
  3. > black_list = read.table("mm10.blacklist.bed.gz");
  4. > black_list.gr = GRanges(
  5. black_list[,1],
  6. IRanges(black_list[,2], black_list[,3])
  7. );
  8. > idy = queryHits(findOverlaps(x.sp@feature, black_list.gr));
  9. > if(length(idy) > 0){x.sp = x.sp[,-idy, mat="bmat"]};
  10. > x.sp
  11. ## number of barcodes: 13746
  12. ## number of bins: 545015
  13. ## number of genes: 0
  14. ## number of peaks: 0
  15. ## number of motifs: 0

接下来,我们将过滤掉那些不需要的染色体信息。

  1. > chr.exclude = seqlevels(x.sp@feature)[grep("random|chrM", seqlevels(x.sp@feature))];
  2. > idy = grep(paste(chr.exclude, collapse="|"), x.sp@feature);
  3. > if(length(idy) > 0){x.sp = x.sp[,-idy, mat="bmat"]};
  4. > x.sp
  5. ## number of barcodes: 13746
  6. ## number of bins: 545011
  7. ## number of genes: 0
  8. ## number of peaks: 0
  9. ## number of motifs: 0

第三,bins的覆盖率大致是服从对数正态分布的。我们将与不变特征(如管家基因的启动子)重叠的前5%的bins进行删除 。

  1. > bin.cov = log10(Matrix::colSums(x.sp@bmat)+1);
  2. > bin.cutoff = quantile(bin.cov[bin.cov > 0], 0.95);
  3. > idy = which(bin.cov <= bin.cutoff & bin.cov > 0);
  4. > x.sp = x.sp[, idy, mat="bmat"];
  5. > x.sp
  6. ## number of barcodes: 13746
  7. ## number of bins: 479127
  8. ## number of genes: 0
  9. ## number of peaks: 0
  10. ## number of motifs: 0

Step 7. Reduce dimensionality

我们使用diffusion maps的方法来计算landmark diffusion maps进行数据降维。首先,我们随机选择出10,000个细胞作为landmarks,然后将剩余的query细胞映射到diffusion maps embedding中。

  1. > row.covs = log10(Matrix::rowSums(x.sp@bmat)+1);
  2. > row.covs.dens = density(
  3. x = row.covs,
  4. bw = 'nrd', adjust = 1
  5. );
  6. > sampling_prob = 1 / (approx(x = row.covs.dens$x, y = row.covs.dens$y, xout = row.covs)$y + .Machine$double.eps);
  7. > set.seed(1);
  8. > idx.landmark.ds = sort(sample(x = seq(nrow(x.sp)), size = 10000, prob = sampling_prob));
  9. > x.landmark.sp = x.sp[idx.landmark.ds,];
  10. > x.query.sp = x.sp[-idx.landmark.ds,];
  11. > x.landmark.sp = runDiffusionMaps(
  12. obj= x.landmark.sp,
  13. input.mat="bmat",
  14. num.eigs=50
  15. );
  16. > x.query.sp = runDiffusionMapsExtension(
  17. obj1=x.landmark.sp,
  18. obj2=x.query.sp,
  19. input.mat="bmat"
  20. );
  21. > x.landmark.sp@metaData$landmark = 1;
  22. > x.query.sp@metaData$landmark = 0;
  23. > x.sp = snapRbind(x.landmark.sp, x.query.sp);
  24. ## combine landmarks and query cells;
  25. > x.sp = x.sp[order(x.sp@sample),]; # IMPORTANT
  26. > rm(x.landmark.sp, x.query.sp); # free memory

Step 8. Determine significant components

> plotDimReductPW(
    obj=x.sp, 
    eigs.dims=1:50,
    point.size=0.3,
    point.color="grey",
    point.shape=19,
    point.alpha=0.6,
    down.sample=5000,
    pdf.file.name=NULL, 
    pdf.height=7, 
    pdf.width=7
  );

使用SnapATAC分析单细胞ATAC-seq数据(四):Integrative Analysis of 10X and snATAC - 图2

Step 9. Remove batch effect

> library(harmony);
# 使用runHarmony函数进行批次校正
> x.after.sp = runHarmony(
    obj=x.sp, 
    eigs.dims=1:22, 
    meta_data=x.sp@sample # sample index
  );

Step 10. Graph-based cluster

> x.after.sp = runKNN(
    obj= x.after.sp,
    eigs.dim=1:22,
    k=15
  );

> x.after.sp = runCluster(
     obj=x.after.sp,
     tmp.folder=tempdir(),
     louvain.lib="R-igraph",
     path.to.snaptools=NULL,
     seed.use=10
  );

> x.after.sp@metaData$cluster = x.after.sp@cluster;

Step 11. Visualization

> x.sp = runViz(
    obj=x.sp, 
    tmp.folder=tempdir(),
    dims=2,
    eigs.dims=1:22, 
    method="Rtsne",
    seed.use=10
  );
> x.after.sp = runViz(
    obj=x.after.sp, 
    tmp.folder=tempdir(),
    dims=2,
    eigs.dims=1:22, 
    method="Rtsne",
    seed.use=10
  );
> par(mfrow = c(2, 3));
> plotViz(
    obj=x.sp,
    method="tsne", 
    main="Before Harmony",
    point.color=x.sp@sample, 
    point.size=0.1, 
    text.add= FALSE,
    down.sample=10000,
    legend.add=TRUE
  );
> plotViz(
    obj=x.after.sp,
    method="tsne", 
    main="After Harmony",
    point.color=x.sp@sample, 
    point.size=0.1, 
    text.add=FALSE,
    down.sample=10000,
    legend.add=TRUE
  );
> plotViz(
    obj=x.after.sp,
    method="tsne", 
    main="Cluster",
    point.color=x.after.sp@cluster, 
    point.size=0.1, 
    text.add=TRUE,
    text.size=1,
    text.color="black",
    text.halo.add=TRUE,
    text.halo.color="white",
    text.halo.width=0.2,
    down.sample=10000,
    legend.add=FALSE
  );

使用SnapATAC分析单细胞ATAC-seq数据(四):Integrative Analysis of 10X and snATAC - 图3

参考来源:https://gitee.com/booew/SnapATAC/blob/master/examples/10X_snATAC/README.md