获取更多R语言知识,请关注公众号:医学和生信笔记
医学和生信笔记 公众号主要分享:1.医学小知识、肛肠科小知识;2.R语言和Python相关的数据分析、可视化、机器学习等;3.生物信息学学习资料和自己的学习笔记!
mlr3pipelines是一种数据流编程套件,完整的机器学习工作流可被称为Graph/Pipelines,包含数据预处理、建模、多个模型比较等,不同的模型需要不同的数据预处理方法,另外还有集成学习、各种非线性模型等,这些都可以通过mlr3pipelines解决。
数据预处理的R包有很多,比如caret、recipes等,mlr3pipelines创造性的使用了图流的方式。
pipeops
进行数据预处理的各种方法在mlr3pipelines中被称为pipeops,目前基本涵盖常见的数据预处理方法,比如独热编码、稀疏矩阵、缺失值处理、降维、数据标准化、因子分组等等。
可以用来连接预处理和模型,或者构建复杂的统计建模步骤,例如多种不同的预处理连接多种不同的模型等
查看所有的pipeops
library(mlr3pipelines)as.data.table(mlr_pipeops) # 目前共有64种## key packages## 1: boxcox mlr3pipelines,bestNormalize## 2: branch mlr3pipelines## 3: chunk mlr3pipelines## 4: classbalancing mlr3pipelines## 5: classifavg mlr3pipelines,stats## 6: classweights mlr3pipelines## 7: colapply mlr3pipelines## 8: collapsefactors mlr3pipelines## 9: colroles mlr3pipelines## 10: copy mlr3pipelines## 11: datefeatures mlr3pipelines## 12: encode mlr3pipelines,stats## 13: encodeimpact mlr3pipelines## 14: encodelmer mlr3pipelines,lme4,nloptr## 15: featureunion mlr3pipelines## 16: filter mlr3pipelines## 17: fixfactors mlr3pipelines## 18: histbin mlr3pipelines,graphics## 19: ica mlr3pipelines,fastICA## 20: imputeconstant mlr3pipelines## 21: imputehist mlr3pipelines,graphics## 22: imputelearner mlr3pipelines## 23: imputemean mlr3pipelines## 24: imputemedian mlr3pipelines,stats## 25: imputemode mlr3pipelines## 26: imputeoor mlr3pipelines## 27: imputesample mlr3pipelines## 28: kernelpca mlr3pipelines,kernlab## 29: learner mlr3pipelines## 30: learner_cv mlr3pipelines## 31: missind mlr3pipelines## 32: modelmatrix mlr3pipelines,stats## 33: multiplicityexply mlr3pipelines## 34: multiplicityimply mlr3pipelines## 35: mutate mlr3pipelines## 36: nmf mlr3pipelines,MASS,NMF## 37: nop mlr3pipelines## 38: ovrsplit mlr3pipelines## 39: ovrunite mlr3pipelines## 40: pca mlr3pipelines## 41: proxy mlr3pipelines## 42: quantilebin mlr3pipelines,stats## 43: randomprojection mlr3pipelines## 44: randomresponse mlr3pipelines## 45: regravg mlr3pipelines## 46: removeconstants mlr3pipelines## 47: renamecolumns mlr3pipelines## 48: replicate mlr3pipelines## 49: scale mlr3pipelines## 50: scalemaxabs mlr3pipelines## 51: scalerange mlr3pipelines## 52: select mlr3pipelines## 53: smote mlr3pipelines,smotefamily## 54: spatialsign mlr3pipelines## 55: subsample mlr3pipelines## 56: targetinvert mlr3pipelines## 57: targetmutate mlr3pipelines## 58: targettrafoscalerange mlr3pipelines## 59: textvectorizer mlr3pipelines,quanteda,stopwords## 60: threshold mlr3pipelines## 61: tunethreshold mlr3pipelines,bbotk## 62: unbranch mlr3pipelines## 63: vtreat mlr3pipelines,vtreat## 64: yeojohnson mlr3pipelines,bestNormalize## key packages## tags## 1: data transform## 2: meta## 3: meta## 4: imbalanced data,data transform## 5: ensemble## 6: imbalanced data,data transform## 7: data transform## 8: data transform## 9: data transform## 10: meta## 11: data transform## 12: encode,data transform## 13: encode,data transform## 14: encode,data transform## 15: ensemble## 16: feature selection,data transform## 17: robustify,data transform## 18: data transform## 19: data transform## 20: missings## 21: missings## 22: missings## 23: missings## 24: missings## 25: missings## 26: missings## 27: missings## 28: data transform## 29: learner## 30: learner,ensemble,data transform## 31: missings,data transform## 32: data transform## 33: multiplicity## 34: multiplicity## 35: data transform## 36: data transform## 37: meta## 38: target transform,multiplicity## 39: multiplicity,ensemble## 40: data transform## 41: meta## 42: data transform## 43: data transform## 44: abstract## 45: ensemble## 46: robustify,data transform## 47: data transform## 48: multiplicity## 49: data transform## 50: data transform## 51: data transform## 52: feature selection,data transform## 53: imbalanced data,data transform## 54: data transform## 55: data transform## 56: abstract## 57: target transform## 58: target transform## 59: data transform## 60: target transform## 61: target transform## 62: meta## 63: encode,missings,data transform## 64: data transform## tags## feature_types input.num output.num## 1: numeric,integer 1 1## 2: NA 1 NA## 3: NA 1 NA## 4: logical,integer,numeric,character,factor,ordered,... 1 1## 5: NA NA 1## 6: logical,integer,numeric,character,factor,ordered,... 1 1## 7: logical,integer,numeric,character,factor,ordered,... 1 1## 8: factor,ordered 1 1## 9: logical,integer,numeric,character,factor,ordered,... 1 1## 10: NA 1 NA## 11: POSIXct 1 1## 12: factor,ordered 1 1## 13: factor,ordered 1 1## 14: factor,ordered 1 1## 15: NA NA 1## 16: logical,integer,numeric,character,factor,ordered,... 1 1## 17: factor,ordered 1 1## 18: numeric,integer 1 1## 19: numeric,integer 1 1## 20: logical,integer,numeric,character,factor,ordered,... 1 1## 21: integer,numeric 1 1## 22: logical,factor,ordered 1 1## 23: numeric,integer 1 1## 24: numeric,integer 1 1## 25: factor,integer,logical,numeric,ordered 1 1## 26: character,factor,integer,numeric,ordered 1 1## 27: factor,integer,logical,numeric,ordered 1 1## 28: numeric,integer 1 1## 29: NA 1 1## 30: logical,integer,numeric,character,factor,ordered,... 1 1## 31: logical,integer,numeric,character,factor,ordered,... 1 1## 32: logical,integer,numeric,character,factor,ordered,... 1 1## 33: NA 1 NA## 34: NA NA 1## 35: logical,integer,numeric,character,factor,ordered,... 1 1## 36: numeric,integer 1 1## 37: NA 1 1## 38: NA 1 1## 39: NA 1 1## 40: numeric,integer 1 1## 41: NA NA 1## 42: numeric,integer 1 1## 43: numeric,integer 1 1## 44: NA 1 1## 45: NA NA 1## 46: logical,integer,numeric,character,factor,ordered,... 1 1## 47: logical,integer,numeric,character,factor,ordered,... 1 1## 48: NA 1 1## 49: numeric,integer 1 1## 50: numeric,integer 1 1## 51: numeric,integer 1 1## 52: logical,integer,numeric,character,factor,ordered,... 1 1## 53: logical,integer,numeric,character,factor,ordered,... 1 1## 54: numeric,integer 1 1## 55: logical,integer,numeric,character,factor,ordered,... 1 1## 56: NA 2 1## 57: NA 1 2## 58: NA 1 2## 59: character 1 1## 60: NA 1 1## 61: NA 1 1## 62: NA NA 1## 63: logical,integer,numeric,character,factor,ordered,... 1 1## 64: numeric,integer 1 1## feature_types input.num output.num## input.type.train input.type.predict output.type.train output.type.predict## 1: Task Task Task Task## 2: * * * *## 3: Task Task Task Task## 4: TaskClassif TaskClassif TaskClassif TaskClassif## 5: NULL PredictionClassif NULL PredictionClassif## 6: TaskClassif TaskClassif TaskClassif TaskClassif## 7: Task Task Task Task## 8: Task Task Task Task## 9: Task Task Task Task## 10: * * * *## 11: Task Task Task Task## 12: Task Task Task Task## 13: Task Task Task Task## 14: Task Task Task Task## 15: Task Task Task Task## 16: Task Task Task Task## 17: Task Task Task Task## 18: Task Task Task Task## 19: Task Task Task Task## 20: Task Task Task Task## 21: Task Task Task Task## 22: Task Task Task Task## 23: Task Task Task Task## 24: Task Task Task Task## 25: Task Task Task Task## 26: Task Task Task Task## 27: Task Task Task Task## 28: Task Task Task Task## 29: TaskClassif TaskClassif NULL PredictionClassif## 30: TaskClassif TaskClassif TaskClassif TaskClassif## 31: Task Task Task Task## 32: Task Task Task Task## 33: [*] [*] * *## 34: * * [*] [*]## 35: Task Task Task Task## 36: Task Task Task Task## 37: * * * *## 38: TaskClassif TaskClassif [TaskClassif] [TaskClassif]## 39: [NULL] [PredictionClassif] NULL PredictionClassif## 40: Task Task Task Task## 41: * * * *## 42: Task Task Task Task## 43: Task Task Task Task## 44: NULL Prediction NULL Prediction## 45: NULL PredictionRegr NULL PredictionRegr## 46: Task Task Task Task## 47: Task Task Task Task## 48: * * [*] [*]## 49: Task Task Task Task## 50: Task Task Task Task## 51: Task Task Task Task## 52: Task Task Task Task## 53: Task Task Task Task## 54: Task Task Task Task## 55: Task Task Task Task## 56: NULL,NULL function,Prediction NULL Prediction## 57: Task Task NULL,Task function,Task## 58: TaskRegr TaskRegr NULL,TaskRegr function,TaskRegr## 59: Task Task Task Task## 60: NULL PredictionClassif NULL PredictionClassif## 61: Task Task NULL Prediction## 62: * * * *## 63: Task Task Task Task## 64: Task Task Task Task## input.type.train input.type.predict output.type.train output.type.predict
看到有很多数据预处理方法了,但其实常用的也就10来种左右。
创建预处理步骤可通过以下方法:
pca <- mlr_pipeops$get("pca")# 或者用简便写法pca <- po("pca")
非常重要的一点是,不仅能创建预处理步骤,也可以用这种方法选择算法,选择特征选择方法等:
# 选择学习器/算法library(mlr3)learner <- po("learner" ,lrn("classif.rpart"))# 选择特征选择的方法并设置参数filter <- po("filter",filter = mlr3filters::flt("variance"),filter.frac = 0.5)
mlr3pipelines中的管道符: %>>%
这是mlr3团队发明的专用管道符,可用于连接不同的预处理步骤、预处理和模型等操作:
gr <- po("scale") %>>% po("pca")gr$plot(html = F)

很多强大的操作都是基于此管道符运行的。
建立模型
一个简单的例子,先预处理数据,再训练
# 连接预处理和模型,有点类似tidymodels的workflowmutate <- po("mutate")filter <- po("filter",filter = mlr3filters::flt("variance"),param_vals = list(filter.frac = 0.5))graph <- mutate %>>%filter %>>%po("learner", learner = lrn("classif.rpart"))
现在这个graph就变成了一个含有预处理步骤的学习器(learner),可以像前面介绍的那样直接用于训练、预测:
task <- tsk("iris")graph$train(task)## $classif.rpart.output## NULL
预测
graph$predict(task)## $classif.rpart.output## <PredictionClassif> for 150 observations:## row_ids truth response## 1 setosa setosa## 2 setosa setosa## 3 setosa setosa## ---## 148 virginica virginica## 149 virginica virginica## 150 virginica virginica
除此之外,还可以把graph变成一个GraphLearner对象,用于resample和benchmark等
glrn <- as_learner(graph) # 变成graphlearnercv3 <- rsmp("cv", folds = 5)resample(task, glrn, cv3)## INFO [21:01:53.145] [mlr3] Applying learner 'mutate.variance.classif.rpart' on task 'iris' (iter 2/5)## INFO [21:01:53.230] [mlr3] Applying learner 'mutate.variance.classif.rpart' on task 'iris' (iter 5/5)## INFO [21:01:53.291] [mlr3] Applying learner 'mutate.variance.classif.rpart' on task 'iris' (iter 1/5)## INFO [21:01:53.350] [mlr3] Applying learner 'mutate.variance.classif.rpart' on task 'iris' (iter 4/5)## INFO [21:01:53.407] [mlr3] Applying learner 'mutate.variance.classif.rpart' on task 'iris' (iter 3/5)## <ResampleResult> of 5 iterations## * Task: iris## * Learner: mutate.variance.classif.rpart## * Warnings: 0 in 0 iterations## * Errors: 0 in 0 iterations
在很多数据预处理步骤中也是有参数需要调整的,mlr3pipelines不仅可以用于调整算法的超参数,还可以调整预处理中的参数。
library(paradox)ps <- ps(classif.rpart.cp = p_dbl(0,0.05), # 算法中的参数variance.filter.frac = p_dbl(0.25,1) # 特征选择方法中的参数)library(mlr3tuning)instance <- TuningInstanceSingleCrit$new(task = task,learner = glrn,resampling = rsmp("holdout", ratio = 0.7),measure = msr("classif.acc"),search_space = ps,terminator = trm("evals", n_evals = 20))tuner <- tnr("random_search")lgr::get_logger("mlr3")$set_threshold("warn")lgr::get_logger("bbotk")$set_threshold("warn")tuner$optimize(instance)## classif.rpart.cp variance.filter.frac learner_param_vals x_domain## 1: 0.02162802 0.3852356 <list[5]> <list[2]>## classif.acc## 1: 0.9777778instance$result_y## classif.acc## 0.9777778instance$result_learner_param_vals## $mutate.mutation## list()#### $mutate.delete_originals## [1] FALSE#### $variance.filter.frac## [1] 0.3852356#### $classif.rpart.xval## [1] 0#### $classif.rpart.cp## [1] 0.02162802
可以看到结果直接给出了算法的超参数和特征选择中的参数。
非线性graph
- Branching: 一个点通往多个分支,例如在比较多个特征选择方法时很有用。只有一条路会被执行。
- Copying: 一个点通往多个分支,所有的分支都会被执行,但是只能1次执行1个分支,并行计算目前还不支持。
- Stacking: 单个图被彼此堆叠,一个图的输出是另一个图的输入。
branching & copying
使用PipeOpBranch和PipeOpUnbranch实现分支操作,分支操作的概念如下图所示:
下面一个例子演示了分支操作,分支之后一定要解除分支:
graph <- po("branch", c("nop","pca","scale")) %>>% # 开始分支gunion(list(po("nop", id = "null1"), # 分支1,并且取了个名字null1po("pca"), # 分支2po("scale") # 分支3)) %>>%po("unbranch",c("nop","pca","scale")) # 结束分支graph$plot(html = F)

bagging
属于集成学习的一种,概念不做介绍,感兴趣的可自行学习,其概念可查看下图:
下面演示基本使用方法。
single_pred <- po("subsample", frac = 0.7) %>>%po("learner", lrn("classif.rpart")) # 建立一个模型pred_set <- ppl("greplicate", single_pred, 10L) # 复制10次bagging <- pred_set %>>%po("classifavg", innum = 10)bagging$plot(html = FALSE)

把上面的对象变成一个GraphLearner,然后就可以进行训练和预测了:
task <- tsk("iris")split <- partition(task, ratio = 0.7, stratify = T)baglrn <- as_learner(bagging)baglrn$train(task, row_ids = split$train)baglrn$predict(task, row_ids = split$test)## <PredictionClassif> for 45 observations:## row_ids truth response prob.setosa prob.versicolor prob.virginica## 4 setosa setosa 1 0 0## 6 setosa setosa 1 0 0## 8 setosa setosa 1 0 0## ---## 141 virginica virginica 0 0 1## 147 virginica virginica 0 0 1## 150 virginica virginica 0 0 1
stacking
另一种提高模型性能的方法,概念可看下图:
这里为了防止过拟合,使用PipeOpLearnerCV预测袋外数据,它可以在数据内部自动执行嵌套重抽样。
首先创建level 0学习器,然后复制一份,并取一个名字:
lrn <- lrn("classif.rpart")lrn_0 <- po("learner_cv", lrn$clone())lrn_0$id<- "rpart_cv"
然后联合使用gunion和PipeOpNOP,把没动过的task传到下一个level,这样经过决策树的task和没处理过的task就能一起传到下一个level了。
level_0 <- gunion(list(lrn_0, po("nop")))
把上面传下来的东西联合到一起:
combined <- level_0 %>>% po("featureunion", 2)stack <- combined %>>% po("learner", lrn$clone())stack$plot(html = FALSE)

然后就可以进行训练、预测了:
stacklrn <- as_learner(stack)stacklrn$train(task, split$train)stacklrn$predict(task, split$test)## <PredictionClassif> for 45 observations:## row_ids truth response## 4 setosa setosa## 6 setosa setosa## 8 setosa setosa## ---## 141 virginica virginica## 147 virginica virginica## 150 virginica virginica
一个超级复杂的例子
这个例子有多个不同的预处理步骤,使用多个不同的算法。
library("magrittr")library("mlr3learners")rprt = lrn("classif.rpart", predict_type = "prob")glmn = lrn("classif.glmnet", predict_type = "prob")# 创建学习器lrn_0 = po("learner_cv", rprt, id = "rpart_cv_1")lrn_0$param_set$values$maxdepth = 5Llrn_1 = po("pca", id = "pca1") %>>% po("learner_cv", rprt, id = "rpart_cv_2")lrn_1$param_set$values$rpart_cv_2.maxdepth = 1Llrn_2 = po("pca", id = "pca2") %>>% po("learner_cv", glmn)# 第0层level_0 = gunion(list(lrn_0, lrn_1, lrn_2, po("nop", id = "NOP1")))# 第1层level_1 = level_0 %>>%po("featureunion", 4) %>>%po("copy", 3) %>>%gunion(list(po("learner_cv", rprt, id = "rpart_cv_l1"),po("learner_cv", glmn, id = "glmnt_cv_l1"),po("nop", id = "NOP_l1")))# 第2层level_2 = level_1 %>>%po("featureunion", 3, id = "u2") %>>%po("learner", rprt, id = "rpart_l2")level_2$plot(html = FALSE)

下面就可以进行训练、预测:
task = tsk("iris")lrn = as_learner(level_2)lrn$train(task, split$train)$predict(task, split$test)$score()## classif.ce## 0.08888889
一些特殊预处理步骤
其实是一些很常用的步骤…
缺失值处理:PipeOpImpute
缺失值处理实在是太常见了,mlr3pipelines对于数值型和因子型都能处理。
pom <- po("missind")pon <- po("imputehist", # 条形图插补数值型id = "impute_num", # 取个名字affect_columns = is.numeric # 设置处理哪些列)pof = po("imputeoor", id = "imputer_fct", affect_columns = is.factor) # 处理因子imputer = pom %>>% pon %>>% pof
连接学习器:
polrn <- po("learner", lrn("classif.rpart"))lrn <- as_learner(imputer %>>% polrn)
创建新的变量:PipeOpMutate
pom <- po("mutate",mutation = list(Sepal.Sum = ~ Sepal.Length + Sepal.Width,Petal.Sum = ~ Petal.Length + Petal.Width,Sepal.Petal.Ratio = ~ (Sepal.Length / Petal.Length)))
使用子集训练:PipeOpChunk
有时候数据集太大,把数据分割成小块进行分块训练是很好的办法。
chks = po("chunk", 4)lrns = ppl("greplicate", po("learner", lrn("classif.rpart")), 4)mjv = po("classifavg", 4)pipeline = chks %>>% lrns %>>% mjvpipeline$plot(html = FALSE)

task = tsk("iris")train.idx = sample(seq_len(task$nrow), 120)test.idx = setdiff(seq_len(task$nrow), train.idx)pipelrn = as_learner(pipeline)pipelrn$train(task, train.idx)$predict(task, train.idx)$score()## classif.ce## 0.3333333
特征选择:PipeOpFilter和PipeOpSelect
可以使用PipeOpFilter对象把mlr3filters里面的变量选择方法放进mlr3pipelines中。
po("filter", mlr3filters::flt("information_gain"))## PipeOp: <information_gain> (not trained)## values: <list()>## Input channels <name [train type, predict type]>:## input [Task,Task]## Output channels <name [train type, predict type]>:## output [Task,Task]
可使用filter_nfeat/filter_frac/filter_cutoff决定保留哪些变量/特征。
获取更多R语言知识,请关注公众号:医学和生信笔记
医学和生信笔记 公众号主要分享:1.医学小知识、肛肠科小知识;2.R语言和Python相关的数据分析、可视化、机器学习等;3.生物信息学学习资料和自己的学习笔记!
