介绍

clustlasso是结合lasso和cluster-lasso策略的R包,并发表在Interpreting k-mer based signatures for antibiotic resistance prediction。更多知识分享请到 [https://zouhua.top/](https://zouhua.top/**)。

标准交叉验证lasso分类或回归流程如下:

  1. 选择交叉验证数据集(数据分割);
  2. 选择最佳模型(训练参数);
  3. 测试集评估模型效能(确定最终模型);

通过看源代码发现相比标准的lasso聚类或回归它多了一个cluster的过程,通过比较自变量之间的相关系数大小进行聚类分析。

数据分析:clustlasso基于聚类分析的特征选择分类构建 - 图1

加载R包和数据

gitlab下载该包的tar.gz文件,然后本地安装软件(可适用于windows和Linux)。
数据分析:clustlasso基于聚类分析的特征选择分类构建 - 图2

  1. install.packages("NMF")
  2. install.packages("D:/Downloads/clustlasso-master.tar.gz", repos = NULL, type = "source")
  3. suppressWarnings(suppressMessages(library(clustlasso)))

数据分析:clustlasso基于聚类分析的特征选择分类构建 - 图3

加载所需要数据

  1. # specify / set random seed
  2. seed = 42
  3. set.seed(seed)
  4. # load example dataset
  5. input.file = system.file("data", "NG-dataset.Rdata", package = "clustlasso")
  6. load(input.file)

以及80%和20%切割数据集合

  1. # pick 20% for test
  2. test.frac = 0.2
  3. # stratify by origin / population structure
  4. ind.by.struct = split(seq(nrow(meta)), meta$pop_structure)
  5. ind.sample = sapply(ind.by.struct, function(x) {
  6. sample(x, round(test.frac * length(x)))
  7. })
  8. ind.test = unlist(ind.sample)
  9. # split
  10. X.test = X[ind.test, ]
  11. y.test = y[ind.test]
  12. meta.test = meta[ind.test, ]
  13. X.train = X[-ind.test, ]
  14. y.train = y[-ind.test]
  15. meta.train = meta[-ind.test, ]

标准交叉验证lasso过程

该过程没有使用cluster方法。

  1. 选择交叉验证数据集(数据分割);
  2. 选择最佳模型(训练参数);
  3. 测试集评估模型效能(确定最终模型);

Cross-validattion process

交叉验证的目的是训练模型参数,调参的对象是lasso模型的lambda参数。可以设置n.folds和n.repeat参数。

  1. # specify cross-validation parameters
  2. n.folds = 10
  3. n.lambda = 100
  4. n.repeat = 3
  5. # run cross-validation process
  6. cv.res.lasso = lasso_cv(X.train, y.train, subgroup = meta.train$pop_structure,
  7. n.lambda = n.lambda, n.folds = n.folds, n.repeat = n.repeat,
  8. seed = seed, verbose = FALSE)

最佳参数展示show_cv_overall(modsel.criterion+best.eps)。模型标准和最佳特征均展示出来。

  1. par(mfcol = c(1, 3))
  2. show_cv_overall(cv.res.lasso, modsel.criterion = "balanced.accuracy.best", best.eps = 1)

数据分析:clustlasso基于聚类分析的特征选择分类构建 - 图4

Selecting the best model

最佳模型根据modsel.criterion参数确定,该参数可根据auc和balanced.accuracy.best确定。

  1. layout(matrix(c(1, 2, 3), nrow = 1, byrow = TRUE), width = c(0.3, 0.3, 0.4), height = c(1))
  2. perf.best.lasso = show_cv_best(cv.res.lasso, modsel.criterion = "balanced.accuracy.best", best.eps = 1, method = "lasso")

数据分析:clustlasso基于聚类分析的特征选择分类构建 - 图5

  1. # print cross-validation performance of best model
  2. print(perf.best.lasso)

数据分析:clustlasso基于聚类分析的特征选择分类构建 - 图6

提取最佳模型extract_best_model.

  1. best.model.lasso = extract_best_model(cv.res.lasso, modsel.criterion = "balanced.accuracy.best", best.eps = 1)

Making predictions and measuring performance

根据上一步选择的最佳模型应用于测试集,进而评估模型的效能。

  1. # make predictions
  2. preds.lasso = predict_clustlasso(X.test, best.model.lasso)
  3. # compute performance
  4. perf.lasso = compute_perf(preds.lasso$preds, preds.lasso$probs, y.test)
  5. # print
  6. print(t(perf.lasso$perf))

数据分析:clustlasso基于聚类分析的特征选择分类构建 - 图7

可视化结果

  1. par(mfcol = c(1, 2))
  2. plot(perf.lasso$roc.curves[[1]], lwd = 2, main = "lasso - test set ROC curve")
  3. grid()
  4. plot(perf.lasso$pr.curves[[1]], lwd = 2, main = "lasso - test set precision / recall curve")
  5. grid()

数据分析:clustlasso基于聚类分析的特征选择分类构建 - 图8

总结:调参后选择最佳参数确定最终模型对分类器构建至关重要,这里选择balanced.accuracy.best而没有选择auc(大家可以试试auc的结果如何)。

Cluster-lasso过程

与上面标准lasso流程类似,但增加了cluster过程。

Cross-validattion process

该过程多增加了screen.threshclust.thresh,该参数用于cluster过程。
数据分析:clustlasso基于聚类分析的特征选择分类构建 - 图9

  1. # specify cross-validation parameters
  2. n.folds = 10
  3. n.lambda = 100
  4. n.repeat = 3
  5. # specify screening and clustering thresholds
  6. screen.thresh = 0.95
  7. clust.thresh = 0.95
  8. # run cross-validation process
  9. cv.res.cluster = clusterlasso_cv(X.train, y.train, subgroup = meta.train$pop_structure,
  10. n.lambda = n.lambda, n.folds = n.folds, n.repeat = n.repeat,
  11. seed = seed, screen.thresh = screen.thresh, clust.thresh = clust.thresh,
  12. verbose = FALSE)
  13. par(mfcol = c(1, 3))
  14. show_cv_overall(cv.res.cluster, modsel.criterion = "balanced.accuracy.best",
  15. best.eps = 1)

数据分析:clustlasso基于聚类分析的特征选择分类构建 - 图10

Selecting the best model

  1. layout(matrix(c(1, 2, 3, 4, 4, 4), nrow = 2, byrow = TRUE), width = c(0.3,0.3, 0.4), height = c(0.6, 0.4))
  2. perf.best.cluster = show_cv_best(cv.res.cluster, modsel.criterion = "balanced.accuracy.best",
  3. best.eps = 1, method = "clusterlasso")

数据分析:clustlasso基于聚类分析的特征选择分类构建 - 图11

  1. # print cross-validation performance of best model
  2. print(perf.best.cluster)

数据分析:clustlasso基于聚类分析的特征选择分类构建 - 图12

  1. best.model.cluster = extract_best_model(cv.res.cluster, modsel.criterion = "balanced.accuracy.best",
  2. best.eps = 1, method = "clusterlasso")

Making predictions and measuring performance

  1. # make predictions
  2. preds.cluster = predict_clustlasso(X.test, best.model.cluster,
  3. method = "clusterlasso")
  4. # compute performance
  5. perf.cluster = compute_perf(preds.cluster$preds, preds.cluster$probs, y.test)
  6. # print
  7. print(t(perf.cluster$perf))

数据分析:clustlasso基于聚类分析的特征选择分类构建 - 图13

比较两类方法的结果

比较standard lasso和cluster-lasso 方法在测试集上的预测效能以及特征的区别。

ROC曲线

  1. plot(perf.lasso$roc.curves[[1]], lwd = 2, main = "test set ROC curves")
  2. points(1 - (perf.lasso$perf$speci)/100, perf.lasso$perf$sensi/100, pch = 19, col = 1, cex = 1.25)
  3. plot(perf.cluster$roc.curves[[1]], lwd = 2, col = 2, add = TRUE)
  4. points(1 - (perf.cluster$perf$speci)/100, perf.cluster$perf$sensi/100,
  5. pch = 17, col = 2, cex = 1.25)
  6. grid()
  7. abline(0, 1, lty = 2)
  8. legend("bottomright", c("lasso", "cluster-lasso"), col = c(1, 2), lwd = 2)

数据分析:clustlasso基于聚类分析的特征选择分类构建 - 图14

特征

  1. heatmap_correlation_signatures(X, best.model.lasso, best.model.cluster,
  2. clust.min = 5, plot.title = "features correlation matrix")

数据分析:clustlasso基于聚类分析的特征选择分类构建 - 图15

Note: 最上面橘色和蓝色分布表示lasso和cluster-lasso选择出来的特征,两者重叠部分较多。从热图聚类结果看,聚类效果和cluster-lasso分类结果类似。

Reference

  1. clustlasso

参考文章如引起任何侵权问题,可以与我联系,谢谢。