本文来自“单细胞组学”公众号 原文链接:https://mp.weixin.qq.com/s/jUw79trMDRSAwCphSIbweA

ATAC-seq即transposase-accessible chromatin using sequencing,是一种检测染色体开放区域的技术。该方法由斯坦福大学Greenleaf实验室在2013年首次发表[1],两年后该实验室又发表基于单细胞的ATAC-seq技术[2]。

ATAC-seq相比于之前的FAIRE-seq和DNase-seq,ATAC-seq用Tn5转座酶对染色体开放区域进行剪切,并加上adaptors序列(图1的红蓝片段)。最后得到的DNA片段,包括了开放区域的剪切片段,以及横跨一个或多个核小体的长片段
scATAC-seq建库原理,质控方法和新R包Signac的使用 - 图1
图1. ATAC-seq[1]

根据片段长度可以分为Fragments in nucleosome-free regions(<147 base pairs)和**Fragments flanking a single nucleosome** (147~294 base pairs), 以及更长的多核片段。片段长度分布如下图,**没有跨越核小体的小片段最多**,其次是单核片段,依次递减。
scATAC-seq建库原理,质控方法和新R包Signac的使用 - 图2
图2. DNA片段长度分布

scATAC-seq建库原理

以10x 建库方法为例[5],比较scATAC-seq 和scRNA-seq建库方法的异同。
二者都用胶珠(GEMs)的方法,不一样的是ATAC胶珠上的序列中不用UMI,因为基因组只有一对序列,无需像RNA一样定量。另外序列末端用接头引物Read 1N代替PolyT。
scATAC-seq建库原理,质控方法和新R包Signac的使用 - 图3
scRNA-seq通过结合cDNA的PolyA尾进行扩增,而scATAC-seq的DNA片段没有PolyA尾,取而代之的是Tn5酶转座剪切时插入的adaptors片段,可以与胶珠上的Read 1N序列互补。
scATAC-seq建库原理,质控方法和新R包Signac的使用 - 图4
DNA片段接上胶珠后,在另一端加Read2和Sample index序列。在此之前,scRNA-seq需要将cDNA酶切至合适的片段长度,而scATAC-seq的片段不进行打碎,接上Sample index和P7序列后进行扩增。
scATAC-seq建库原理,质控方法和新R包Signac的使用 - 图5
最后上机测序。scRNAseq如果是3‘单端测序,Read2读取最近的100bp读长,而Read1只读取16bp的细胞barcode序列和10bp的UMI序列,共26bp。scATAC-seq则用双末端测序,读长一般不低于45bp[3]。
scATAC-seq建库原理,质控方法和新R包Signac的使用 - 图6

scATAC-seq最后可以得到4个原始文件:

scATAC-seq建库原理,质控方法和新R包Signac的使用 - 图7
其中I1/2分别是barcode和sample index,R1/2是目的片段的双末端。

10x提供cellranger软件对原始数据进行初步分析,如质控,比对,peak calling等。

  1. $ cellranger-atac count --id=sample345 \
  2. --reference=/opt/refdata-cellranger-atac-GRCh38-1.2.0 \
  3. --fastqs=/home/jdoe/runs/HAWT7ADXX/outs/fastq_path \
  4. --sample=mysample \
  5. --localcores=8 \
  6. --localmem=64

质控方法

重要指标:
1. Fraction of fragments overlapping any targeted region
覆盖注释区域的片段比例。目标区域包括TSS,DNase HS,增强子和启动子等。一般要求大于55%。

2. Fraction of transposition events in peaks in cell barcodes
转座位点位于peaks区域的比例。一般大于25%。如果比例过小,有可能是细胞状态不好,或者测序深度不够。

  1. Enrichment score of transcription start sites(TSS)
    转录起始位点的标准化分数。阈值选择根据基因组而定。人类一般选择TSS分数大于5或者6的细胞样本[3]。如果分数太低说明细胞染色质结构瓦解,或者实验过程细胞裂解方式不当。image.png
    0ca1923c89e1b8870cc74353b6e2df20.png
    TSS标准化分数的计算方法不同,可能导致阈值偏差。10X的标准化方法是cut sites数除以最小值,而ENCODE的标准是在cut sites数除以两个末端各100bp的cut site数的平均值。由于一般越靠近中间数值越大,ENCODE的标准化分数整体比10x的分数小一些
    image.png42a2f846146e81728d23fe737661d28d.png
    其他指标:

  2. Fraction of read pairs with a valid barcode > 75%;

  3. Q30 bases in R1 > 65%;
  4. Q30 bases in R2 > 65%;
  5. Q30 bases in barcode > 65%;
  6. Q30 bases in Sample Index > 90%;
  7. Estimated Number of Cells 500~10000 +- 20%;
  8. Median fragments per cell barcode > 500;
  9. Fragments in nucleosomefree regions >55%;
  10. Fraction of total read pairs mapped confidently to genome (>30 mapq) >80%;
  11. Fraction of total read pairs in mitochondria and in cell barcodes < 40%;

    下游分析(以Signac为例)

    Signac包由Seurat同一团队开发,独立于Seurat包,在2020年8月开始发布在GitHub上。目前仍是1.0.0版本[4]。

我们可以用10x官网的PBMC数据做演示。

首先加载所需R包:

  1. library(Signac)
  2. library(Seurat)
  3. library(GenomeInfoDb)
  4. library(EnsDb.Hsapiens.v75)
  5. library(ggplot2)
  6. library(patchwork)
  7. set.seed(1234)

加载peaks, 细胞注释和片段分布数据,并创建object。这个object和Seurat object类似,只是在assay里多了peaks等信息。

  1. counts <- Read10X_h5(filename = "atac_v1_pbmc_10k_filtered_peak_bc_matrix.h5")
  2. metadata <- read.csv(
  3. file = "atac_v1_pbmc_10k_singlecell.csv",
  4. header = TRUE,
  5. row.names = 1
  6. )
  7. chrom_assay <- CreateChromatinAssay(
  8. counts = counts,
  9. sep = c(":", "-"),
  10. genome = 'hg19',
  11. fragments = 'atac_v1_pbmc_10k_fragments.tsv.gz',
  12. min.cells = 10,
  13. min.features = 200
  14. )
  15. pbmc <- CreateSeuratObject(
  16. counts = chrom_assay,
  17. assay = "peaks",
  18. meta.data = metadata
  19. )

总共8728个细胞,87405个features。features不是基因,是基因组的注释区域,如启动子,增强子等。

  1. pbmc[['peaks']]
  2. ## ChromatinAssay data with 87405 features for 8728 cells
  3. ## Variable features: 0
  4. ## Genome: hg19
  5. ## Annotation present: FALSE
  6. ## Motifs present: FALSE
  7. ## Fragment files: 1

加载注释:

  1. # extract gene annotations from EnsDb
  2. annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v75)
  3. # change to UCSC style since the data was mapped to hg19
  4. seqlevelsStyle(annotations) <- 'UCSC'
  5. genome(annotations) <- "hg19"
  6. # add the gene information to the object
  7. Annotation(pbmc) <- annotations

TSS和blacklist比例计算。

  1. # compute nucleosome signal score per cell
  2. pbmc <- NucleosomeSignal(object = pbmc)
  3. # compute TSS enrichment score per cell
  4. pbmc <- TSSEnrichment(object = pbmc, fast = FALSE)
  5. # add blacklist ratio and fraction of reads in peaks
  6. pbmc$pct_reads_in_peaks <- pbmc$peak_region_fragments / pbmc$passed_filters * 100
  7. pbmc$blacklist_ratio <- pbmc$blacklist_region_fragments / pbmc$peak_region_fragments
  8. pbmc$high.tss <- ifelse(pbmc$TSS.enrichment > 2, 'High', 'Low')
  9. TSSPlot(pbmc, group.by = 'high.tss') + NoLegend()

81d9fdf15c1c5ee4ea0178dded3fdd6c.png
质控。Signac包用5个指标做过滤:

  1. TSS富集分数,大于2;
  2. blacklist比例,小于0.05;
  3. 缠绕核小体的片段与非核小体片段(< 147bp)的比例,小于4;
  4. 匹配peaks区域的片段比例,大于15%;
  5. 匹配peaks区域的片段数,大于3000,小于20000。
    1. VlnPlot(
    2. object = pbmc,
    3. features = c('pct_reads_in_peaks', 'peak_region_fragments', 'TSS.enrichment', 'blacklist_ratio', 'nucleosome_signal'),
    4. pt.size = 0.1,
    5. ncol = 5
    6. )
    58f8cf9ed3443d242ccc08857e21d115.png ```r pbmc <- subset( x = pbmc, subset = peak_region_fragments > 3000 & peak_region_fragments < 20000 & pct_reads_in_peaks > 15 & blacklist_ratio < 0.05 & nucleosome_signal < 4 & TSS.enrichment > 2 )

pbmc

An object of class Seurat

87405 features across 7060 samples within 1 assay

Active assay: peaks (87405 features, 0 variable features)

  1. 降维聚类。
  2. ```r
  3. pbmc <- RunTFIDF(pbmc)
  4. pbmc <- FindTopFeatures(pbmc, min.cutoff = 'q0')
  5. pbmc <- RunSVD(pbmc)
  6. pbmc <- RunUMAP(object = pbmc, reduction = 'lsi', dims = 2:30)
  7. pbmc <- FindNeighbors(object = pbmc, reduction = 'lsi', dims = 2:30)
  8. pbmc <- FindClusters(object = pbmc, verbose = FALSE, algorithm = 3)
  9. DimPlot(object = pbmc, label = TRUE) + NoLegend()

ce742ec0bfa092f2463f97459c2d16ae.png
创建基因活性矩阵。之前的聚类区域所用的features是peaks,为了展示不同分群基因活性的差异,首先要创建一个类似RNA表达的矩阵。用基因加上游2000bp区域的比对片段数代表该基因的活性。

  1. gene.activities <- GeneActivity(pbmc)
  2. # add the gene activity matrix to the Seurat object as a new assay and normalize it
  3. pbmc[['RNA']] <- CreateAssayObject(counts = gene.activities)
  4. pbmc <- NormalizeData(
  5. object = pbmc,
  6. assay = 'RNA',
  7. normalization.method = 'LogNormalize',
  8. scale.factor = median(pbmc$nCount_RNA)
  9. )

展示marker基因活性:

  1. DefaultAssay(pbmc) <- 'RNA'
  2. FeaturePlot(
  3. object = pbmc,
  4. features = c('MS4A1', 'CD3D', 'LEF1', 'NKG7', 'TREM1', 'LYZ'),
  5. pt.size = 0.1,
  6. max.cutoff = 'q95',
  7. ncol = 3
  8. )

3720ce010478840b5b78cb2b9fd66b2a.pngimage.png
与scRNA-seq数据的整合分析。单细胞转录组数据地址:https://www.dropbox.com/s/zn6khirjafoyyxl/pbmc_10k_v3.rds?dl=0。

  1. # Load the pre-processed scRNA-seq data for PBMCs
  2. pbmc_rna <- readRDS("/home/stuartt/github/chrom/vignette_data/pbmc_10k_v3.rds")
  3. transfer.anchors <- FindTransferAnchors(
  4. reference = pbmc_rna,
  5. query = pbmc,
  6. reduction = 'cca'
  7. )
  8. predicted.labels <- TransferData(
  9. anchorset = transfer.anchors,
  10. refdata = pbmc_rna$celltype,
  11. weight.reduction = pbmc[['lsi']],
  12. dims = 2:30
  13. )
  14. pbmc <- AddMetaData(object = pbmc, metadata = predicted.labels)

0336e81cf5e74e840d7623a8a0cc7ed9.pngimage.png
寻找细胞分群特异的peaks。

  1. # change back to working with peaks instead of gene activities
  2. DefaultAssay(pbmc) <- 'peaks'
  3. da_peaks <- FindMarkers(
  4. object = pbmc,
  5. ident.1 = "CD4 Naive",
  6. ident.2 = "CD14 Mono",
  7. min.pct = 0.2,
  8. test.use = 'LR',
  9. latent.vars = 'peak_region_fragments'
  10. )
  11. plot1 <- VlnPlot(
  12. object = pbmc,
  13. features = rownames(da_peaks)[1],
  14. pt.size = 0.1,
  15. idents = c("CD4 Memory","CD14 Mono")
  16. )
  17. plot2 <- FeaturePlot(
  18. object = pbmc,
  19. features = rownames(da_peaks)[1],
  20. pt.size = 0.1
  21. )
  22. plot1 | plot2

7179296c8bead4c02d451267d79971ca.pngimage.png
展示基因在不同细胞类型的开放程度:

  1. # set plotting order
  2. levels(pbmc) <- c("CD4 Naive","CD4 Memory","CD8 Naive","CD8 Effector","DN T","NK CD56bright","NK CD56Dim","pre-B",'pro-B',"pDC","DC","CD14 Mono",'CD16 Mono')
  3. CoveragePlot(
  4. object = pbmc,
  5. region = rownames(da_peaks)[1],
  6. extend.upstream = 40000,
  7. extend.downstream = 20000
  8. )

scATAC-seq建库原理,质控方法和新R包Signac的使用 - 图21
此外还有其他分析,如TF footprinting等。footprinting顾名思义是指转录因子留下的印记,由于Tn5酶不能剪切到TF结合的区域,所以footprinting图相对与TSS图,中间有“凹陷”,凹陷的程度根据TF结合的时间确定[6]。
scATAC-seq建库原理,质控方法和新R包Signac的使用 - 图22

总的来说,Signac包是一个亲民实用的工具。虽然有一些不足的地方,如染色体的注释目前只能选择UCSC的,不能选Ensemble。

参考文献:
[1] Buenrostro, J., Giresi, P., Zaba, L. et al. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat Methods 10, 1213–1218 (2013). https://doi.org/10.1038/nmeth.2688
[2] Buenrostro, J., Wu, B., Litzenburger, U. et al. Single-cell chromatin accessibility reveals principles of regulatory variation. Nature 523, 486–490 (2015). https://doi.org/10.1038/nature14590
[3] https://www.encodeproject.org/atac-seq/
[4] https://satijalab.org/signac/index.html
[5]https://assets.ctfassets.net/an68im79xiti/Cts31zFxXFXVwJ1lzU3Pc/fe66343ffd3039de73ecee6a1a6f5b7b/CG000202_TechnicalNote_InterpretingCellRangerATACWebSummaryFiles_Rev-A.pdf
[6] Sung, M., Baek, S. & Hager, G. Genome-wide footprinting: ready for prime time?. Nat Methods 13, 222–228 (2016). https://doi.org/10.1038/nmeth.3766