使用cellassign包进行单细胞类型注释分析(一):Constructing marker genes from purified scRNA-seq data - 图1

cellassign包简介

cellassign基于marker基因的信息自动将单细胞RNA-seq数据分配注释到已知的细胞类型中。它以marker基因的细胞类型矩阵作为输入,提供先验的已知marker基因是否属于某种细胞类型。然后,cellassign会概率性地将每个细胞分配给一个细胞类型,从而消除了典型无监督聚类中的主观偏见。
使用cellassign包进行单细胞类型注释分析(一):Constructing marker genes from purified scRNA-seq data - 图2

参考文献:Probabilistic cell-type assignment of single-cell RNA-seq for tumor microenvironment profiling, Nature Methods 2019

cellassign包的安装

Installing from GitHub

cellassign is built using Google's Tensorflow, and as such requires installation of the R package tensorflow:

  1. # 安装tensorflow包
  2. install.packages("tensorflow")
  3. tensorflow::install_tensorflow(extra_packages='tensorflow-probability', version = "2.1.0")
  4. # Please ensure this installs version 2 of tensorflow. You can check this by calling
  5. tensorflow::tf_config()
  6. # TensorFlow v2.1.0 (/usr/local/lib/python3.7/site-packages/tensorflow)
  7. # cellassign can then be installed from github:
  8. # 使用devtools安装cellassign包
  9. install.packages("devtools") # If not already installed
  10. devtools::install_github("Irrationone/cellassign")

Installing from conda

With conda, install the current release version of cellassign as follows:

# 使用conda安装cellassign包
conda install -c conda-forge -c bioconda r-cellassign

基础用法

cellassign需要以下输入文件:

  • exprs_obj(基因表达矩阵): Cell-by-gene matrix of raw counts (or SingleCellExperiment with counts assay)
  • marker_gene_info(标记基因列表): Binary gene-by-celltype marker gene matrix or list relating cell types to marker genes
  • s: Size factors
  • X: Design matrix for any patient/batch specific effects

The model can be run as follows:

cas <- cellassign(exprs_obj = gene_expression_data,
                  marker_gene_info = marker_gene_info,
                  s = s,
                  X = X)

# An example set of markers for the human tumour microenvironment can be loaded by calling
data(example_TME_markers)

常用函数

使用cellassign包进行单细胞类型注释分析(一):Constructing marker genes from purified scRNA-seq data - 图3

数据准备

下面,我们通过差异表达分析获得marker标记基因。
我们从Holik等人的bulk RNA-seq数据(Nucleic Acids Research 2017)中获得了3种不同细胞系的marker标记基因,数据被打包到cellassign中的holik_data中:

# 加载cellassign包
library(cellassign)

# 加载holik_data数据
data(holik_data)
# which contains a matrix of counts, where each row is a gene (index by entrez ID) and each column is a sample:

# 查看holik_data数据
head(holik_data$counts[,1:2])
#>    RSCE_10_BC2CTUACXX_AGTTCC_L006_R1.bam RSCE_12_BC2CTUACXX_TGACCA_L005_R1.bam
#> 1                                     13                                    12
#> 2                                      0                                     0
#> 9                                    346                                   286
#> 10                                     0                                     1
#> 12                                    10                                    17
#> 13                                     1                                     0

# as well as a vector with the cell line of origin for each sample:
# 查看细胞系类型
head(holik_data$cell_line)
#> [1] "HCC827" "HCC827" "H2228"  "H2228"  "H838"   "H838"

We first provide a map from entrez IDs to gene symbols:

# 转换ENTREZ基因id类型为gene symbols
library(org.Hs.eg.db)
entrez_map <- select(org.Hs.eg.db, 
                     as.character(rownames(holik_data$counts)), 
                     c("SYMBOL"), "ENTREZID")
#> 'select()' returned 1:1 mapping between keys and columns
gene_annotations <- entrez_map %>%
  dplyr::rename(GeneID=ENTREZID,
                Symbol=SYMBOL)

Then construct the DGEList object for input to limma voom, filtering out lowly expressed genes:

library(limma)
# 构建DGEList对象
dge <- DGEList(counts = holik_data$counts, 
               group = holik_data$cell_line, 
               genes = gene_annotations, 
               remove.zeros = TRUE)
#> Removing 3620 rows with all zero counts
# 过滤低表达的基因
genes_to_keep <- rowSums(cpm(dge$counts) > 0.5) >= 2
dge_filt <- dge[genes_to_keep,]

and finally calculate the normalization factors:

# 计算归一化因子
dge_filt <- calcNormFactors(dge_filt, method="TMM")

进行差异表达分析

接下来,我们使用limma包的voom方法对三个不同的细胞样本(HCC827, H2228, H1975)进行差异分析:

# 提取数据子集
dge_subset <- dge_filt[,dge_filt$samples$group %in% c("HCC827", "H2228", "H1975")]
# 设计差异分析建模矩阵
design <- model.matrix(~ 0+dge_subset$samples$group)
colnames(design) <- levels(dge_subset$samples$group)
# 使用voom函数进行差异分析
v <- voom(dge_subset, design)
fit <- lmFit(v, design)

Next, fit constrasts to find differentially expressed genes between cell types:

# 构建比较矩阵
contrast.matrix <- makeContrasts(H2228 - H1975, 
                                 HCC827 - H1975, 
                                 HCC827 - H2228, 
                                 levels = design)
fit2 <- contrasts.fit(fit, contrast.matrix)
fit2 <- eBayes(fit2)

Finally, compute gene summary statistics and filter to only significantly differentially expressed geens (FDR < 0.05):

# 提取显著的差异表达基因
tt <- topTable(fit2, n=Inf)
tt_sig <- tt %>%
  dplyr::filter(adj.P.Val < 0.05)

# 查看差异表达基因的信息
head(tt_sig)
#>   GeneID    Symbol H2228...H1975 HCC827...H1975 HCC827...H2228  AveExpr
#> 1   1956      EGFR    0.68777340       4.303436       3.615662 9.208711
#> 2  23480    SEC61G   -0.40115418       4.414580       4.815734 7.784072
#> 3  81552     VOPP1   -0.09640406       4.347392       4.443796 6.998559
#> 4   1362       CPD    2.12265278      -2.103571      -4.226224 8.704180
#> 5 729086 LOC729086   -0.04338772       4.384976       4.428363 6.555193
#> 6  55915    LANCL2   -0.57454448       4.021963       4.596508 6.568625
#>          F      P.Value    adj.P.Val
#> 1 2026.616 3.447052e-12 2.982899e-08
#> 2 2003.024 3.623981e-12 2.982899e-08
#> 3 1766.548 6.199752e-12 3.232724e-08
#> 4 1671.376 7.854996e-12 3.232724e-08
#> 5 1506.362 1.224645e-11 3.628466e-08
#> 6 1479.497 1.322488e-11 3.628466e-08

提取Marker标记基因

为了获得marker基因信息,我们首先使用H1975细胞系作为基准表达(baseline expression)水平来构建对数倍数变化矩阵(log fold change matrix):

lfc_table <- tt_sig[,c("H2228...H1975", "HCC827...H1975")]
lfc_table <- lfc_table %>%
  dplyr::mutate(H1975=0,
                H2228=H2228...H1975,
                HCC827=HCC827...H1975) %>%
  dplyr::select(H1975, H2228, HCC827)

rownames(lfc_table) <- tt_sig$GeneID

然后,对于每个基因,我们减去最小对数倍数变化,因为我们想知道那些相对于最小表达水平的基因过表达的基因,我们将这些基因定义为marker标记基因:

# 定义marker基因
lfc_table <- as.matrix(lfc_table)
lfc_table <- lfc_table - rowMins(lfc_table)
lfc_table <- as.data.frame(lfc_table)

现在,我们定义一个辅助函数,用于将对数倍数变化转换为二进制矩阵。这需要一个矩阵和一个阈值,将所有小于或等于阈值的值都设置为0,其他的值都设置为1:

binarize <- function(x, threshold) {
  x[x <= threshold] <- -Inf
  x[x > -Inf] <- 1
  x[x == -Inf] <- 0
  return(x)
}

接下来,我们使用定义好的辅助函数将对数倍数变化矩阵转换为二进制矩阵。实际上,我们寻找每个基因表达的最大“gap”,并指定那些表达水平高于该“gap”的基因作为该细胞类型的marker标记基因:

# Find the biggest difference
# 寻找最大差异
maxdiffs <- apply(lfc_table, 1, function(x) max(diff(sort(x))))
# 设置阈值
thres_vals <- apply(lfc_table, 1, function(x) sort(x)[which.max(diff(sort(x)))])

# 转换为二进制矩阵
expr_mat_thres <- plyr::rbind.fill(lapply(1:nrow(lfc_table), function(i) {
  binarize(lfc_table[i,], thres_vals[i])
}))

rownames(expr_mat_thres) <- rownames(lfc_table)
marker_gene_mat <- expr_mat_thres[(maxdiffs >= quantile(maxdiffs, c(.99))) 
                                  & (thres_vals <= log(2)),] %>% as.matrix

Finally, we add back in gene symbols rather than entrez ids:

suppressMessages({
  symbols <- plyr::mapvalues(
    rownames(marker_gene_mat),
    from = gene_annotations$GeneID,
    to = gene_annotations$Symbol
  )
})

is_na <- is.na(symbols)

marker_gene_mat <- marker_gene_mat[!is_na,]
rownames(marker_gene_mat) <- symbols[!is_na]

And there we have a marker gene matrix for our cell types(最终得到marker基因矩阵):

# 查看marker基因矩阵
head(marker_gene_mat)
#>          H1975 H2228 HCC827
#> NRK          0     0      1
#> MTAP         1     0      1
#> CP           0     1      0
#> HIST1H3F     1     0      1
#> HIST1H4E     1     0      1
#> ADIRF        1     1      0

# marker标记基因可视化
library(pheatmap)
pheatmap(t(marker_gene_mat))

使用cellassign包进行单细胞类型注释分析(一):Constructing marker genes from purified scRNA-seq data - 图4

参考来源:https://irrationone.github.io/cellassign/articles/constructing-markers-from-purified-data.html