CHETAH(CHaracterization of cEll Types Aided by Hierarchical classification,通过层级分类辅助鉴定细胞类型)是用于单细胞RNA-seq测序(scRNA-seq)数据的细胞类型识别的R包。


  • 输入数据(input data)
  • 参考数据集(reference dataset),带有已知的细胞类型

CHETAH包通过对参考数据集进行分层聚类(hierarchically clustering)来构建分类树,并以此分类树为指导,在分类树的每个节点中,输入细胞要么分配给右边的分支,要么分配给左边的分支。CHETAH会为每个分配结果计算一个置信度分数,当分配的置信度得分低于阈值(default = 0.1)时,则该细胞的分类将在该节点处停止。

  • 最终类型(final types):细胞被分类为分类树的叶节点之一(即参考的细胞类型)。
  • 中间类型(intermediate types):将置信度得分低于某个节点中的阈值的细胞分配给分类树的中间节点。当一个细胞与某个节点的右分支和左分支中的细胞类型具有近似相同的相似性时,就会产生这种情况。
CHETAH will be a part of Bioconductor starting at release 2.9 (30th of April), and will be available by:

  1. ## Install BiocManager is neccesary
  2. if (!require("BiocManager")) {
  3. install.packages("BiocManager")
  4. }
  5. BiocManager::install('CHETAH')
  6. # Load the package
  7. library(CHETAH)

The development version can be downloaded from the development version of Bioconductor (in R v3.6).

  1. ## Install BiocManager is neccesary
  2. if (!require("BiocManager")) {
  3. install.packages("BiocManager")
  4. }
  5. BiocManager::install('CHETAH', version = "devel")
  6. # Load the package
  7. library(CHETAH)

At a glance

To run chetah on an input count matrix input_counts with t-SNE coordinates in input_tsne, and a reference count matrix ref_counts with celltypes vector ref_ct, run:

  1. # Load the package
  2. library(CHETAH)
  3. ## Make SingleCellExperiments
  4. reference <- SingleCellExperiment(assays = list(counts = ref_counts),
  5. colData = DataFrame(celltypes = ref_ct))
  6. input <- SingleCellExperiment(assays = list(counts = input_counts),
  7. reducedDims = SimpleList(TSNE = input_tsne))
  8. ## Run CHETAH
  9. input <- CHETAHclassifier(input = input, ref_cells = reference)
  10. ## Plot the classification
  11. PlotCHETAH(input)
  12. ## Extract celltypes:
  13. celltypes <- input$celltype_CHETAH


Required data


  • 输入细胞的scRNA-seq表达计数矩阵:数据框或矩阵格式,列为细胞,行为基因
  • 标准化后的参考细胞的scRNA-seq表达计数矩阵
  • 参考细胞的细胞类型
  • (可选)用于可视化的输入数据的2D降维表示形式:如t-SNE,PCA。


  1. ## To prepare the data from the package's internal data, run:
  2. # 参考数据的细胞类型
  3. celltypes_hn <- headneck_ref$celltypes
  4. # 参考数据的表达矩阵
  5. counts_hn <- assay(headneck_ref)
  6. # 输入数据的表达矩阵
  7. counts_melanoma <- assay(input_mel)
  8. # 输入数据的降维信息
  9. tsne_melanoma <- reducedDim(input_mel)
  10. ## The input data: a Matrix
  11. class(counts_melanoma)
  12. #> [1] "dgCMatrix"
  13. #> attr(,"package")
  14. #> [1] "Matrix"
  15. counts_melanoma[1:5, 1:5]
  16. #> 5 x 5 sparse Matrix of class "dgCMatrix"
  17. #> mel_cell1 mel_cell2 mel_cell3 mel_cell4 mel_cell5
  18. #> ELMO2 . . . 4.5633 .
  19. #> PNMA1 . 4.3553 . . .
  20. #> MMP2 . . . . .
  21. #> TMEM216 . . . . 5.5624
  22. #> TRAF3IP2-AS1 2.1299 4.0542 2.4209 1.6531 1.3144
  23. ## The reduced dimensions of the input cells: 2 column matrix
  24. tsne_melanoma[1:5, ]
  25. #> tSNE_1 tSNE_2
  26. #> mel_cell1 4.5034553 13.596680
  27. #> mel_cell2 -4.0025667 -7.075722
  28. #> mel_cell3 0.4734054 9.277648
  29. #> mel_cell4 3.2201815 11.445236
  30. #> mel_cell5 -0.3354758 5.092415
  31. all.equal(rownames(tsne_melanoma), colnames(counts_melanoma))
  32. #> [1] TRUE
  33. ## The reference data: a Matrix
  34. class(counts_hn)
  35. #> [1] "matrix" "array"
  36. counts_hn[1:5, 1:5]
  37. #> hn_cell1 hn_cell2 hn_cell3 hn_cell4 hn_cell5
  38. #> ELMO2 0.00000 0 0.00000 1.55430 4.2926
  39. #> PNMA1 0.00000 0 0.00000 4.55360 0.0000
  40. #> MMP2 0.00000 0 7.02880 4.50910 6.3006
  41. #> TMEM216 0.00000 0 0.00000 0.00000 0.0000
  42. #> TRAF3IP2-AS1 0.14796 0 0.65352 0.28924 3.6365
  43. ## The cell types of the reference: a named character vector
  44. str(celltypes_hn)
  45. #> Named chr [1:180] "Fibroblast" "Fibroblast" "Fibroblast" "Fibroblast" ...
  46. #> - attr(*, "names")= chr [1:180] "hn_cell1" "hn_cell2" "hn_cell3" "hn_cell4" ...
  47. ## The names of the cell types correspond with the colnames of the reference counts:
  48. all.equal(names(celltypes_hn), colnames(counts_melanoma))
  49. #> [1] "Lengths (180, 150) differ (string compare on first 150)"
  50. #> [2] "150 string mismatches"


A SingleCellExperiment holds three things:

  • counts: assays(as a list of Matrices)
  • meta-data: colData(as DataFrames)
  • reduced dimensions (e.g. t-SNE, PCA): ReducedDims(as a SimpleList of 2-column data.frames or matrices)

CHETAH needs

  • a reference SingleCellExperiment with:
  1. an assay
  2. a colData column with the corresponding cell types (default “celltypes”)
  • an input SingleCellExperiment with:
  1. an assay
  2. a reducedDim (e.g. t-SNE)

For the example data, we would make the two objects by running:

## For the reference we define a "counts" assay and "celltypes" metadata
# 构建参考数据SingleCellExperiment对象
headneck_ref <- SingleCellExperiment(assays = list(counts = counts_hn),
                                     colData = DataFrame(celltypes = celltypes_hn))

class: SingleCellExperiment 
dim: 7943 180 
assays(1): counts
rownames(7943): ELMO2 PNMA1 ... SLC39A6 CTSC
rowData names(0):
colnames(180): hn_cell1 hn_cell2 ... hn_cell179 hn_cell180
colData names(1): celltypes

## For the input we define a "counts" assay and "TSNE" reduced dimensions
# 构建输入数据SingleCellExperiment对象
input_mel <- SingleCellExperiment(assays = list(counts = counts_melanoma),
                                  reducedDims = SimpleList(TSNE = tsne_melanoma))

class: SingleCellExperiment 
dim: 7943 150 
assays(1): counts
rownames(7943): ELMO2 PNMA1 ... SLC39A6 CTSC
rowData names(0):
colnames(150): mel_cell1 mel_cell2 ... mel_cell149 mel_cell150
colData names(0):
reducedDimNames(1): TSNE


Now that the data is prepared, running chetah is easy:

# 使用CHETAHclassifier函数运行CHETAH进行细胞类型分类注释
input_mel <- CHETAHclassifier(input = input_mel,
                              ref_cells = headneck_ref)
#> Preparing data....
#> Running analysis...

class: SingleCellExperiment 
dim: 7943 150 
assays(1): counts
rownames(7943): ELMO2 PNMA1 ... SLC39A6 CTSC
rowData names(0):
colnames(150): mel_cell1 mel_cell2 ... mel_cell149 mel_cell150
colData names(1): celltype_CHETAH
reducedDimNames(1): TSNE

    mel_cell1     mel_cell2     mel_cell3     mel_cell4     mel_cell5 
 "CD8 T cell" "Endothelial"       "Node7"  "CD8 T cell" "reg. T cell" 

The output

CHETAH returns the input object, but added:

  • input$celltype_CHETAH:a named character vector that can directly be used in any other workflow/method.
  • “hidden” int_colData and int_metadata, not meant for direct interaction, but
    which can all be viewed and interacted with using: PlotCHETAH and CHETAHshiny

Standard plots

我们可以使用PlotCHETAH函数查看CHETAH的分类结果,此函数可绘制分类树和t-SNE(或其他提供的降维)图。在这些图中,最终类型(final types)或中间类型(intermediate types)都带有颜色,非彩色类型(non-colored types)以灰度表示。
To plot the final types:

PlotCHETAH(input = input_mel)

Conversely, to color the intermediate types:

PlotCHETAH(input = input_mel, interm = TRUE)

If you would like to use the classification, and thus the colors, in another package (e.g. Seurat2), you can extract the colors using:

colors <- PlotCHETAH(input = input_mel, return_col = TRUE)

       Unassigned             Node1             Node2             Node3 
         "gray90"          "gray86"          "gray82"          "gray78" 
            Node4             Node5             Node6             Node7 
         "gray74"          "gray70"          "gray66"          "gray62" 
            Node8        Fibroblast            B cell        Macrophage 
         "gray58"            "blue"            "gold"           "cyan3" 
      Endothelial         Dendritic              Mast        CD4 T cell 
           "navy"     "forestgreen"          "orange" "darkolivegreen3" 
      reg. T cell        CD8 T cell 
          "brown"           "green"


Here you can view:

  • the confidence of all assignments
  • the classification in an interactive window
  • the genes used by CHETAH, an it’s expression in the input data
  • a lot more
CHETAHshiny(input = input_mel)

Changing classification

Confidence score


  • 值介于0到2之间
  • 通常在0到1之间
  • 0表示细胞分配的置信度最低,1表示高的置信度。



  • 如果将置信度阈值设置为0,所有的输入细胞将会被分类为最终类型。请注意,此分类可能很嘈杂,并且可能包含错误的分类。
  • 使用0.2、0.3、0.4等阈值会将逐渐减少被分类为最终类型的细胞的数量,其余的细胞在分类树中所有节点上的置信度越来越高。

For example, to only classify cells with very high confidence:

input_mel <- Classify(input = input_mel, 0.8)
PlotCHETAH(input = input_mel, tree = FALSE)

使用CHETAH包进行单细胞类型注释分析 - 图5

Conversely, to classify all cells:

input_mel <- Classify(input_mel, 0)
PlotCHETAH(input = input_mel, tree = FALSE)

使用CHETAH包进行单细胞类型注释分析 - 图6

Renaming types


input_mel <- RenameBelowNode(input_mel, whichnode = 6, replacement = "T cell")
PlotCHETAH(input = input_mel, tree = FALSE)

使用CHETAH包进行单细胞类型注释分析 - 图7

To reset the classification to its default, just run Classify again:

input_mel <- Classify(input_mel) ## the same as Classify(input_mel, 0.1)
PlotCHETAH(input = input_mel, tree = FALSE)

使用CHETAH包进行单细胞类型注释分析 - 图8

Creating a reference

Step 0: Obtain a reference.

  • 下载或使用自己的scRNA-seq数据集,这些数据集具有已知可用的细胞类型标签。
  • 构建SingleCellExperiment对象。

Step 1: good reference characteristics


  • 通过使用来自相同生物学类型或至少由处于相同细胞状态的细胞组成的参考和输入数据集,可以获得更好的分类结果。例如。对于PBMC的输入数据集,可以使用骨髓参考数据集,但是由于这些细胞大多数naive或前体细胞,这可能会对分类产生负面影响。在这种情况下,另一个PBMC的数据集将发挥最佳作用。
  • 参考数据集细胞注释结果的好坏会直接影响最终的细胞分类结果。参考数据集的细胞类型标签越准确,最终的分类结果会越好。
  • 参考数据的数据越稀疏,则需要更多的参考细胞来创建可靠的参考。对于高覆盖率的Smart-Seq2数据,每个细胞类型仅需要10-20个细胞。而对于稀疏的10X Genomics数据,通常需要100多个细胞可获得最佳结果。


cell_selection <- unlist(lapply(unique(ref$celltypes), function(type) {
    type_cells <- which(ref$celltypes == type)
    if (length(type_cells) > 200) {
        type_cells[sample(length(type_cells), 200)]
    } else type_cells

ref_new <- ref[ ,cell_selection]

Step 1: normalization


assay(headneck_ref, "counts") <- apply(assay(headneck_ref, "counts"), 2, 
                                       function(column) log2((column/sum(column) * 100000) + 1))

             hn_cell1 hn_cell2 hn_cell3 hn_cell4 hn_cell5
ELMO2        0.000000        0 0.000000 4.249018 5.939126
PNMA1        0.000000        0 0.000000 5.748898 0.000000
MMP2         0.000000        0 6.716108 5.734995 6.485250
TMEM216      0.000000        0 0.000000 0.000000 0.000000
TRAF3IP2-AS1 2.206329        0 3.417146 2.121777 5.704061
ERCC5        7.551864        0 6.326401 0.000000 0.000000

Step 2: discaring of house-keeping genes


ribo <- read.table("~/ribosomal.txt", header = FALSE, sep = '\t')
headneck_ref <- headneck_ref[!rownames(headneck_ref) %in% ribo[,1], ]

Step 3: Reference QC


  • scRNA-seq数据的质量
  • 细胞类型标签的准确性


CorrelateReference(ref_cells = headneck_ref)
#> Running... in case of 1000s of cells, this may take a couple of minutes

使用CHETAH包进行单细胞类型注释分析 - 图9



ClassifyReference(ref_cells = headneck_ref)
#> Preparing data....
#> Running analysis...

使用CHETAH包进行单细胞类型注释分析 - 图10

在此图中,每一行是原始细胞类型标签,每一列是在CHETAH分类后分配的细胞标签。正方形的颜色和大小指示行类型细胞中的哪一部分被分类为列类型。如,第4行第2列显示5-10%的CD4 T细胞被归类为调节性T细胞。


Optimizing the classification

CHETAH包经过优化,可以在大多数的分析中获得良好的结果,但是也可能会发生分类不完善的情况。 如果CHETAH无法提供所需的输出(分类的细胞太少,视觉上的随机分类等),我们可以执行以下步骤(按此顺序)进行完善:

  • 检查我们的参考数据集是否正确创建(请参见上文,这是最重要的步骤!)
  • 如果分类的细胞太少,则将置信度阈值降低到0.05或0.01(注意假阳性!始终检查结果是否有意义。)
  • 使用input[!(grepl(“^RP”, rownames(input))), ]是不完善的方法,但是非常快捷。
  • 尝试使用其他数量的基因进行分类(n_genes参数)。默认为200,但有时100或(在稀疏数据中)500可以产生更好的结果。
  • 寻找其他更好的参考数据集
  • 尝试其他可用的对scRNA-seq数据进行细胞类型分类注释的方法。
