作者:Zuguang Gu8
编译:Steven Shen
原文:2.3 Clustering


Clustering might be the key component of the heatmap visualization. In ComplexHeatmap package, hierarchical clustering is supported with great flexibility. You can specify the clustering either by:

  • a pre-defined distance method (e.g. "euclidean" or "pearson"),
  • a distance function,
  • a object that already contains clustering (a hclust or dendrogram object or object that can be coerced to dendrogram class),
  • a clustering function.

聚类可能是热图可视化的关键组成部分。 在 ComplexHeatmap 包中,支持分层聚类,具有很大的灵活性。 您可以通过以下方式指定聚类:

  • 预定距离法(例如“欧几里德”或“皮尔逊”),
  • 距离函数,
  • 已经包含聚类的对象(一个 hclustdendrogram 对象,或者可以强制为 dendrogram 类的对象),
  • 一个聚类函数。

It is also possible to render the dendrograms with different colors and styles for different branches for better revealing structures of the dendrogram (e.g. by dendextend::color_branches()).
也可以为不同的分支渲染具有不同颜色和样式的树形图,以更好地揭示树形图的结构(例如通过 dendextend::color_branches() )。

First, there are general settings for the clustering, e.g. whether apply clustering or show dendrograms, the side of the dendrograms and heights of the dendrograms.
首先,有一般的聚类设置,例如,是否应用聚类或显示树状图,哪一侧绘制树形图和树形图的高度。

  1. Heatmap(mat, name = "mat", cluster_rows = FALSE) # turn off row clustering

2.3 聚类 - 图1

  1. Heatmap(mat, name = "mat", show_column_dend = FALSE) # hide column dendrogram

2.3 聚类 - 图2

  1. Heatmap(mat, name = "mat", row_dend_side = "right", column_dend_side = "bottom")

2.3 聚类 - 图3

  1. Heatmap(mat, name = "mat", column_dend_height = unit(4, "cm"),
  2. row_dend_width = unit(4, "cm"))

2.3 聚类 - 图4

2.3.1 距离方法

Hierarchical clustering is performed in two steps: calculate the distance matrix and apply clustering. There are three ways to specify distance metric for clustering:

  • specify distance as a pre-defined option. The valid values are the supported methods in dist() function and in "pearson", "spearman" and "kendall". The correlation distance is defined as 1 - cor(x, y, method). All these built-in distance methods allow NA values.
  • a self-defined function which calculates distance from a matrix. The function should only contain one argument. Please note for clustering on columns, the matrix will be transposed automatically.
  • a self-defined function which calculates distance from two vectors. The function should only contain two arguments. Note this might be slow because it is implemented by two nested for loop.

分层聚类分两步执行:计算距离矩阵并应用聚类。有三种方法可以指定聚类的距离尺度:

  • 将距离指定为预定义选项。有效值是在 dist() 函数和 "pearson""spearman""kendall" 中支持的方法。相关距离定义为 1 - cor(x, y, method) 。所有这些内置距离方法都允许 NA 值。
  • 自定义函数,用于计算距矩阵的距离。该函数应该只包含一个参数。请注意,对于列上的聚类,矩阵将自动转置。
  • 一个自定义函数,用于计算两个向量的距离。该函数应该只包含两个参数。请注意,这可能很慢,因为它是由两个嵌套的 for 循环实现的。
    1. Heatmap(mat, name = "mat", clustering_distance_rows = "pearson",
    2. column_title = "pre-defined distance method (1 - pearson)")
    2.3 聚类 - 图5
  1. Heatmap(mat, name = "mat", clustering_distance_rows = function(m) dist(m),
  2. column_title = "a function that calculates distance matrix")

2.3 聚类 - 图6

  1. Heatmap(mat, name = "mat", clustering_distance_rows = function(x, y) 1 - cor(x, y),
  2. column_title = "a function that calculates pairwise distance")

2.3 聚类 - 图7

Based on these features, we can apply clustering which is robust to outliers based on the pairwise distance. Note here we set the color mapping function because we don’t want outliers affect the colors.
基于这些特征,在成对距离的基础上我们可以应用对异常值具有稳健性(robust)的聚类。请注意,我们设置颜色映射功能,因为我们不希望异常值影响颜色。

  1. mat_with_outliers = mat
  2. for(i in 1:10) mat_with_outliers[i, i] = 1000
  3. robust_dist = function(x, y) {
  4. qx = quantile(x, c(0.1, 0.9))
  5. qy = quantile(y, c(0.1, 0.9))
  6. l = x > qx[1] & x < qx[2] & y > qy[1] & y < qy[2]
  7. x = x[l]
  8. y = y[l]
  9. sqrt(sum((x - y)^2))
  10. }

We can compare the two heatmaps with or without the robust distance method:
我们可以比较一下使用或不使用 robust distance 方法的两种热图:

  1. Heatmap(mat_with_outliers, name = "mat",
  2. col = colorRamp2(c(-2, 0, 2), c("green", "white", "red")),
  3. column_title = "dist")
  4. Heatmap(mat_with_outliers, name = "mat",
  5. col = colorRamp2(c(-2, 0, 2), c("green", "white", "red")),
  6. clustering_distance_rows = robust_dist,
  7. clustering_distance_columns = robust_dist,
  8. column_title = "robust_dist")

2.3 聚类 - 图8

If there are proper distance methods (like methods in stringdist package), you can also cluster a character matrix. cell_fun argument will be introduced in Section 2.9.
如果有适当的距离方法(如 stringdist 包中的方法),您还可以聚类字符矩阵。 cell_fun 参数将在 2.9 节中介绍。

  1. mat_letters = matrix(sample(letters[1:4], 100, replace = TRUE), 10)
  2. # distance in the ASCII table
  3. dist_letters = function(x, y) {
  4. x = strtoi(charToRaw(paste(x, collapse = "")), base = 16)
  5. y = strtoi(charToRaw(paste(y, collapse = "")), base = 16)
  6. sqrt(sum((x - y)^2))
  7. }
  8. Heatmap(mat_letters, name = "letters", col = structure(2:5, names = letters[1:4]),
  9. clustering_distance_rows = dist_letters, clustering_distance_columns = dist_letters,
  10. cell_fun = function(j, i, x, y, w, h, col) { # add text to each grid
  11. grid.text(mat_letters[i, j], x, y)
  12. })

2.3 聚类 - 图9

2.3.2 聚类方法

Method to perform hierarchical clustering can be specified by clustering_method_rows and clustering_method_columns. Possible methods are those supported in hclust() function.
可以通过 clustering_method_rowsclustering_method_columns 指定执行分层聚类的方法。可能的方法是 hclust() 函数支持的方法。

  1. Heatmap(mat, name = "mat", clustering_method_rows = "single")

2.3 聚类 - 图10

If you already have a clustering object or a function which directly returns a clustering object, you can ignore the distance settings and set cluster_rows or cluster_columns to the clustering objects or clustering functions. If it is a clustering function, the only argument should be the matrix and it should return a hclust or dendrogram object or a object that can be coerced to the dendrogram class.
如果已有聚类对象或已经有一个可以直接返回聚类对象的函数,则可以忽略距离设置并将 cluster_rowscluster_columns 设置为聚类对象或聚类函数。如果它是一个聚类函数,矩阵应该是它唯一的参数,它应该返回一个 hclust 或树形图对象或一个可以强制转换为树形图类的对象。

In following example, we perform clustering with methods from cluster package either by a pre-calculated clustering object or a clustering function:
在下面的示例中,我们通过使用 cluster 包中预先计算的聚类对象,或聚类函数,执行聚类:

  1. library(cluster)
  2. Heatmap(mat, name = "mat", cluster_rows = diana(mat),
  3. cluster_columns = agnes(t(mat)), column_title = "clustering objects")

2.3 聚类 - 图11

  1. # if cluster_columns is set as a function, you don't need to transpose the matrix
  2. Heatmap(mat, name = "mat", cluster_rows = diana,
  3. cluster_columns = agnes, column_title = "clustering functions")

2.3 聚类 - 图12

The last command is as same as :
最后一个命令与下面是一样的:

  1. # code only for demonstration
  2. Heatmap(mat, name = "mat", cluster_rows = function(m) as.dendrogram(diana(m)),
  3. cluster_columns = function(m) as.dendrogram(agnes(m)),
  4. column_title = "clutering functions")

Please note, when cluster_rows is set as a function, the argument m is the input mat itself, while for cluster_columns, m is the transpose of mat.
请注意,当 cluster_rows 设置为函数时,参数 m 是输入 mat 本身,而对于 cluster_columnsmmat 的转置。

fastcluster::hclust implements a faster version of hclust(). You can set it to cluster_rows and cluster_columns to use the faster version of hclust().
fastcluster::hclust 实现了更快版本的 hclust() 。您可以将其设置为 cluster_rowscluster_columns 以使用更快版本的 hclust()

  1. # code only for demonstration
  2. fh = function(x) fastcluster::hclust(dist(x))
  3. Heatmap(mat, name = "mat", cluster_rows = fh, cluster_columns = fh)

To make it more convinient to use the faster version of hclust() (assuming you have many heatmaps to construct), it can be set as a global option. The usage of ht_opt is introduced in Section 4.13.
为了更方便地使用更快版本的 hclust() (假设您要构建许多热图),可以将其设置为全局选项。第 4.13 节介绍了 ht_opt 的用法。

  1. # code only for demonstration
  2. ht_opt$fast_hclust = TRUE
  3. # now fastcluster::hclust is used in all heatmaps

This is one specific scenario that you might already have a subgroup classification for the matrix rows or columns, and you only want to perform clustering for the features in the same subgroup. There is one way that you can split the heatmap by the subgroup variable (see Section 2.7), or you can use cluster_within_group() clustering function to generate a special dendrogram.
这是一个特定的场景,您可能已经有矩阵的行或列组成的子组分类,并且您只想对同一子组中的要素执行聚类。有一种方法可以通过子组变量拆分热图(参见第 2.7 节),或者可以使用 cluster_within_group() 聚类函数生成特殊的树形图。

  1. group = kmeans(t(mat), centers = 3)$cluster
  2. Heatmap(mat, name = "mat", cluster_columns = cluster_within_group(mat, group))

2.3 聚类 - 图13
In above example, columns in a same group are still clustered, but the dendrogram is degenerated as a flat line. The dendrogram on columns shows the hierarchy of the groups.
在上面的示例中,同一组中的列仍然是聚类的,但树形图退化为扁平线。列上的树形图显示了组的层次结构。

2.3.3 渲染树状图

If you want to render the dendrogram, normally you need to generate a dendrogram object and render it in the first place, then send it to the cluster_rows or cluster_columns argument.
如果要渲染树形图,通常需要生成一个 dendrogram 对象并在第一个位置渲染它,然后将其传递给 cluster_rowscluster_columns 参数。

You can render your dendrogram object by the dendextend package to make a more customized visualization of the dendrogram. Note ComplexHeatmap only allows rendering on the dendrogram lines.
您可以通过 dendextend 包渲染树形图对象,以更加自定义树形图的可视化。注意 ComplexHeatmap 仅允许在树形图线上进行渲染。

  1. library(dendextend)
  2. row_dend = as.dendrogram(hclust(dist(mat)))
  3. row_dend = color_branches(row_dend, k = 2) # `color_branches()` returns a dendrogram object
  4. Heatmap(mat, name = "mat", cluster_rows = row_dend)

2.3 聚类 - 图14

row_dend_gp and column_dend_gp control the global graphic setting for dendrograms. Note e.g. graphic settings in row_dend will be overwritten by row_dend_gp.
row_dend_gpcolumn_dend_gp 控制树形图的全局图形设置。注意,例如 row_dend 中的图形设置将被 row_dend_gp 覆盖。

  1. Heatmap(mat, name = "mat", cluster_rows = row_dend, row_dend_gp = gpar(col = "red"))

2.3 聚类 - 图15

2.3.4 重排序树状图

In the Heatmap() function, dendrograms are reordered to make features with larger difference more separated from each others (please refer to the documentation of reorder.dendrogram()). Here the difference (or it is called the weight) is measured by the row means if it is a row dendrogram or by the column means if it is a column dendrogram. row_dend_reorder and column_dend_reordercontrol whether to apply dendrogram reordering if the value is set as logical. The two arguments also control the weight for the reordering if they are set to numeric vectors (it will be sent to the wts argument of reorder.dendrogram()). The reordering can be turned off by setting e.g. row_dend_reorder = FALSE.
Heatmap() 函数中,重新排序树形图以使具有较大差异的特征彼此更加分离(请参阅 reorder.dendrogram() 的文档)。这里的差异(或称为权重)由行表示,如果它是行树形图;或者由列表示,如果它是列树状图。 row_dend_reordercolumn_dend_reordercontrol 用于控制是否对树形图重新排序,如果将它们的值设置为逻辑值。如果将它们设置为数字向量,则这两个参数还控制重新排序的权重(它将被传递给 reorder.dendrogram()wts 参数)。可以通过设置,如 row_dend_reorder = FALSE ,来关闭重新排序。

By default, dendrogram reordering is turned on if cluster_rows/cluster_columns is set as logical value or a clustering function. It is turned off if cluster_rows/cluster_columns is set as clustering object.
默认情况下,如果将 cluster_rows / cluster_columns 设置为逻辑值或聚类函数,则会开启树形图重新排序。如果将 cluster_rows / cluster_columns 设置为聚类对象,则会将其关闭。

Compare following two heatmaps:
比较以下两个热图:

  1. m2 = matrix(1:100, nr = 10, byrow = TRUE)
  2. Heatmap(m2, name = "mat", row_dend_reorder = FALSE, column_title = "no reordering")
  3. Heatmap(m2, name = "mat", row_dend_reorder = TRUE, column_title = "apply reordering")

2.3 聚类 - 图16

There are many other methods for reordering dendrograms, e.g. the dendsort package. Basically, all these methods still return a dendrogram that has been reordered, thus, we can firstly generate the row or column dendrogram based on the data matrix, reorder it by some method, and assign it back to cluster_rows or cluster_columns.
还有许多其他重新排序树形图的方法,例如, dendsort 包。基本上,所有这些方法仍然返回已重新排序的树形图,因此,我们可以首先根据数据矩阵生成行或列树形图,然后通过某种方法对其重新排序,并将其分配回 cluster_rowscluster_columns

Compare following two reorderings. Can you tell which is better?
比较以下两个重新排序。 你能说出哪个更好吗?

  1. Heatmap(mat, name = "mat", column_title = "default reordering")
  2. library(dendsort)
  3. dend = dendsort(hclust(dist(mat)))
  4. Heatmap(mat, name = "mat", cluster_rows = dend, column_title = "reorder by dendsort")

2.3 聚类 - 图17

—— 本节完 ——