获取更多R语言知识,请关注公众号:医学和生信笔记

医学和生信笔记 公众号主要分享:1.医学小知识、肛肠科小知识;2.R语言和Python相关的数据分析、可视化、机器学习等;3.生物信息学学习资料和自己的学习笔记!

mlr3pipelines是一种数据流编程套件,完整的机器学习工作流可被称为Graph/Pipelines,包含数据预处理、建模、多个模型比较等,不同的模型需要不同的数据预处理方法,另外还有集成学习、各种非线性模型等,这些都可以通过mlr3pipelines解决。

数据预处理的R包有很多,比如caretrecipes等,mlr3pipelines创造性的使用了图流的方式。
R语言机器学习mlr3:数据预处理和Pipelines - 图1

pipeops

进行数据预处理的各种方法在mlr3pipelines中被称为pipeops,目前基本涵盖常见的数据预处理方法,比如独热编码、稀疏矩阵、缺失值处理、降维、数据标准化、因子分组等等。

可以用来连接预处理和模型,或者构建复杂的统计建模步骤,例如多种不同的预处理连接多种不同的模型等

查看所有的pipeops

  1. library(mlr3pipelines)
  2. as.data.table(mlr_pipeops) # 目前共有64种
  3. ## key packages
  4. ## 1: boxcox mlr3pipelines,bestNormalize
  5. ## 2: branch mlr3pipelines
  6. ## 3: chunk mlr3pipelines
  7. ## 4: classbalancing mlr3pipelines
  8. ## 5: classifavg mlr3pipelines,stats
  9. ## 6: classweights mlr3pipelines
  10. ## 7: colapply mlr3pipelines
  11. ## 8: collapsefactors mlr3pipelines
  12. ## 9: colroles mlr3pipelines
  13. ## 10: copy mlr3pipelines
  14. ## 11: datefeatures mlr3pipelines
  15. ## 12: encode mlr3pipelines,stats
  16. ## 13: encodeimpact mlr3pipelines
  17. ## 14: encodelmer mlr3pipelines,lme4,nloptr
  18. ## 15: featureunion mlr3pipelines
  19. ## 16: filter mlr3pipelines
  20. ## 17: fixfactors mlr3pipelines
  21. ## 18: histbin mlr3pipelines,graphics
  22. ## 19: ica mlr3pipelines,fastICA
  23. ## 20: imputeconstant mlr3pipelines
  24. ## 21: imputehist mlr3pipelines,graphics
  25. ## 22: imputelearner mlr3pipelines
  26. ## 23: imputemean mlr3pipelines
  27. ## 24: imputemedian mlr3pipelines,stats
  28. ## 25: imputemode mlr3pipelines
  29. ## 26: imputeoor mlr3pipelines
  30. ## 27: imputesample mlr3pipelines
  31. ## 28: kernelpca mlr3pipelines,kernlab
  32. ## 29: learner mlr3pipelines
  33. ## 30: learner_cv mlr3pipelines
  34. ## 31: missind mlr3pipelines
  35. ## 32: modelmatrix mlr3pipelines,stats
  36. ## 33: multiplicityexply mlr3pipelines
  37. ## 34: multiplicityimply mlr3pipelines
  38. ## 35: mutate mlr3pipelines
  39. ## 36: nmf mlr3pipelines,MASS,NMF
  40. ## 37: nop mlr3pipelines
  41. ## 38: ovrsplit mlr3pipelines
  42. ## 39: ovrunite mlr3pipelines
  43. ## 40: pca mlr3pipelines
  44. ## 41: proxy mlr3pipelines
  45. ## 42: quantilebin mlr3pipelines,stats
  46. ## 43: randomprojection mlr3pipelines
  47. ## 44: randomresponse mlr3pipelines
  48. ## 45: regravg mlr3pipelines
  49. ## 46: removeconstants mlr3pipelines
  50. ## 47: renamecolumns mlr3pipelines
  51. ## 48: replicate mlr3pipelines
  52. ## 49: scale mlr3pipelines
  53. ## 50: scalemaxabs mlr3pipelines
  54. ## 51: scalerange mlr3pipelines
  55. ## 52: select mlr3pipelines
  56. ## 53: smote mlr3pipelines,smotefamily
  57. ## 54: spatialsign mlr3pipelines
  58. ## 55: subsample mlr3pipelines
  59. ## 56: targetinvert mlr3pipelines
  60. ## 57: targetmutate mlr3pipelines
  61. ## 58: targettrafoscalerange mlr3pipelines
  62. ## 59: textvectorizer mlr3pipelines,quanteda,stopwords
  63. ## 60: threshold mlr3pipelines
  64. ## 61: tunethreshold mlr3pipelines,bbotk
  65. ## 62: unbranch mlr3pipelines
  66. ## 63: vtreat mlr3pipelines,vtreat
  67. ## 64: yeojohnson mlr3pipelines,bestNormalize
  68. ## key packages
  69. ## tags
  70. ## 1: data transform
  71. ## 2: meta
  72. ## 3: meta
  73. ## 4: imbalanced data,data transform
  74. ## 5: ensemble
  75. ## 6: imbalanced data,data transform
  76. ## 7: data transform
  77. ## 8: data transform
  78. ## 9: data transform
  79. ## 10: meta
  80. ## 11: data transform
  81. ## 12: encode,data transform
  82. ## 13: encode,data transform
  83. ## 14: encode,data transform
  84. ## 15: ensemble
  85. ## 16: feature selection,data transform
  86. ## 17: robustify,data transform
  87. ## 18: data transform
  88. ## 19: data transform
  89. ## 20: missings
  90. ## 21: missings
  91. ## 22: missings
  92. ## 23: missings
  93. ## 24: missings
  94. ## 25: missings
  95. ## 26: missings
  96. ## 27: missings
  97. ## 28: data transform
  98. ## 29: learner
  99. ## 30: learner,ensemble,data transform
  100. ## 31: missings,data transform
  101. ## 32: data transform
  102. ## 33: multiplicity
  103. ## 34: multiplicity
  104. ## 35: data transform
  105. ## 36: data transform
  106. ## 37: meta
  107. ## 38: target transform,multiplicity
  108. ## 39: multiplicity,ensemble
  109. ## 40: data transform
  110. ## 41: meta
  111. ## 42: data transform
  112. ## 43: data transform
  113. ## 44: abstract
  114. ## 45: ensemble
  115. ## 46: robustify,data transform
  116. ## 47: data transform
  117. ## 48: multiplicity
  118. ## 49: data transform
  119. ## 50: data transform
  120. ## 51: data transform
  121. ## 52: feature selection,data transform
  122. ## 53: imbalanced data,data transform
  123. ## 54: data transform
  124. ## 55: data transform
  125. ## 56: abstract
  126. ## 57: target transform
  127. ## 58: target transform
  128. ## 59: data transform
  129. ## 60: target transform
  130. ## 61: target transform
  131. ## 62: meta
  132. ## 63: encode,missings,data transform
  133. ## 64: data transform
  134. ## tags
  135. ## feature_types input.num output.num
  136. ## 1: numeric,integer 1 1
  137. ## 2: NA 1 NA
  138. ## 3: NA 1 NA
  139. ## 4: logical,integer,numeric,character,factor,ordered,... 1 1
  140. ## 5: NA NA 1
  141. ## 6: logical,integer,numeric,character,factor,ordered,... 1 1
  142. ## 7: logical,integer,numeric,character,factor,ordered,... 1 1
  143. ## 8: factor,ordered 1 1
  144. ## 9: logical,integer,numeric,character,factor,ordered,... 1 1
  145. ## 10: NA 1 NA
  146. ## 11: POSIXct 1 1
  147. ## 12: factor,ordered 1 1
  148. ## 13: factor,ordered 1 1
  149. ## 14: factor,ordered 1 1
  150. ## 15: NA NA 1
  151. ## 16: logical,integer,numeric,character,factor,ordered,... 1 1
  152. ## 17: factor,ordered 1 1
  153. ## 18: numeric,integer 1 1
  154. ## 19: numeric,integer 1 1
  155. ## 20: logical,integer,numeric,character,factor,ordered,... 1 1
  156. ## 21: integer,numeric 1 1
  157. ## 22: logical,factor,ordered 1 1
  158. ## 23: numeric,integer 1 1
  159. ## 24: numeric,integer 1 1
  160. ## 25: factor,integer,logical,numeric,ordered 1 1
  161. ## 26: character,factor,integer,numeric,ordered 1 1
  162. ## 27: factor,integer,logical,numeric,ordered 1 1
  163. ## 28: numeric,integer 1 1
  164. ## 29: NA 1 1
  165. ## 30: logical,integer,numeric,character,factor,ordered,... 1 1
  166. ## 31: logical,integer,numeric,character,factor,ordered,... 1 1
  167. ## 32: logical,integer,numeric,character,factor,ordered,... 1 1
  168. ## 33: NA 1 NA
  169. ## 34: NA NA 1
  170. ## 35: logical,integer,numeric,character,factor,ordered,... 1 1
  171. ## 36: numeric,integer 1 1
  172. ## 37: NA 1 1
  173. ## 38: NA 1 1
  174. ## 39: NA 1 1
  175. ## 40: numeric,integer 1 1
  176. ## 41: NA NA 1
  177. ## 42: numeric,integer 1 1
  178. ## 43: numeric,integer 1 1
  179. ## 44: NA 1 1
  180. ## 45: NA NA 1
  181. ## 46: logical,integer,numeric,character,factor,ordered,... 1 1
  182. ## 47: logical,integer,numeric,character,factor,ordered,... 1 1
  183. ## 48: NA 1 1
  184. ## 49: numeric,integer 1 1
  185. ## 50: numeric,integer 1 1
  186. ## 51: numeric,integer 1 1
  187. ## 52: logical,integer,numeric,character,factor,ordered,... 1 1
  188. ## 53: logical,integer,numeric,character,factor,ordered,... 1 1
  189. ## 54: numeric,integer 1 1
  190. ## 55: logical,integer,numeric,character,factor,ordered,... 1 1
  191. ## 56: NA 2 1
  192. ## 57: NA 1 2
  193. ## 58: NA 1 2
  194. ## 59: character 1 1
  195. ## 60: NA 1 1
  196. ## 61: NA 1 1
  197. ## 62: NA NA 1
  198. ## 63: logical,integer,numeric,character,factor,ordered,... 1 1
  199. ## 64: numeric,integer 1 1
  200. ## feature_types input.num output.num
  201. ## input.type.train input.type.predict output.type.train output.type.predict
  202. ## 1: Task Task Task Task
  203. ## 2: * * * *
  204. ## 3: Task Task Task Task
  205. ## 4: TaskClassif TaskClassif TaskClassif TaskClassif
  206. ## 5: NULL PredictionClassif NULL PredictionClassif
  207. ## 6: TaskClassif TaskClassif TaskClassif TaskClassif
  208. ## 7: Task Task Task Task
  209. ## 8: Task Task Task Task
  210. ## 9: Task Task Task Task
  211. ## 10: * * * *
  212. ## 11: Task Task Task Task
  213. ## 12: Task Task Task Task
  214. ## 13: Task Task Task Task
  215. ## 14: Task Task Task Task
  216. ## 15: Task Task Task Task
  217. ## 16: Task Task Task Task
  218. ## 17: Task Task Task Task
  219. ## 18: Task Task Task Task
  220. ## 19: Task Task Task Task
  221. ## 20: Task Task Task Task
  222. ## 21: Task Task Task Task
  223. ## 22: Task Task Task Task
  224. ## 23: Task Task Task Task
  225. ## 24: Task Task Task Task
  226. ## 25: Task Task Task Task
  227. ## 26: Task Task Task Task
  228. ## 27: Task Task Task Task
  229. ## 28: Task Task Task Task
  230. ## 29: TaskClassif TaskClassif NULL PredictionClassif
  231. ## 30: TaskClassif TaskClassif TaskClassif TaskClassif
  232. ## 31: Task Task Task Task
  233. ## 32: Task Task Task Task
  234. ## 33: [*] [*] * *
  235. ## 34: * * [*] [*]
  236. ## 35: Task Task Task Task
  237. ## 36: Task Task Task Task
  238. ## 37: * * * *
  239. ## 38: TaskClassif TaskClassif [TaskClassif] [TaskClassif]
  240. ## 39: [NULL] [PredictionClassif] NULL PredictionClassif
  241. ## 40: Task Task Task Task
  242. ## 41: * * * *
  243. ## 42: Task Task Task Task
  244. ## 43: Task Task Task Task
  245. ## 44: NULL Prediction NULL Prediction
  246. ## 45: NULL PredictionRegr NULL PredictionRegr
  247. ## 46: Task Task Task Task
  248. ## 47: Task Task Task Task
  249. ## 48: * * [*] [*]
  250. ## 49: Task Task Task Task
  251. ## 50: Task Task Task Task
  252. ## 51: Task Task Task Task
  253. ## 52: Task Task Task Task
  254. ## 53: Task Task Task Task
  255. ## 54: Task Task Task Task
  256. ## 55: Task Task Task Task
  257. ## 56: NULL,NULL function,Prediction NULL Prediction
  258. ## 57: Task Task NULL,Task function,Task
  259. ## 58: TaskRegr TaskRegr NULL,TaskRegr function,TaskRegr
  260. ## 59: Task Task Task Task
  261. ## 60: NULL PredictionClassif NULL PredictionClassif
  262. ## 61: Task Task NULL Prediction
  263. ## 62: * * * *
  264. ## 63: Task Task Task Task
  265. ## 64: Task Task Task Task
  266. ## input.type.train input.type.predict output.type.train output.type.predict

看到有很多数据预处理方法了,但其实常用的也就10来种左右。

创建预处理步骤可通过以下方法:

  1. pca <- mlr_pipeops$get("pca")
  2. # 或者用简便写法
  3. pca <- po("pca")

非常重要的一点是,不仅能创建预处理步骤,也可以用这种方法选择算法,选择特征选择方法等:

  1. # 选择学习器/算法
  2. library(mlr3)
  3. learner <- po("learner" ,lrn("classif.rpart"))
  4. # 选择特征选择的方法并设置参数
  5. filter <- po("filter",
  6. filter = mlr3filters::flt("variance"),
  7. filter.frac = 0.5
  8. )

mlr3pipelines中的管道符: %>>%

这是mlr3团队发明的专用管道符,可用于连接不同的预处理步骤、预处理和模型等操作:

  1. gr <- po("scale") %>>% po("pca")
  2. gr$plot(html = F)

R语言机器学习mlr3:数据预处理和Pipelines - 图2

很多强大的操作都是基于此管道符运行的。

建立模型

一个简单的例子,先预处理数据,再训练

  1. # 连接预处理和模型,有点类似tidymodels的workflow
  2. mutate <- po("mutate")
  3. filter <- po("filter",
  4. filter = mlr3filters::flt("variance"),
  5. param_vals = list(filter.frac = 0.5))
  6. graph <- mutate %>>%
  7. filter %>>%
  8. po("learner", learner = lrn("classif.rpart"))

现在这个graph就变成了一个含有预处理步骤的学习器(learner),可以像前面介绍的那样直接用于训练、预测:

  1. task <- tsk("iris")
  2. graph$train(task)
  3. ## $classif.rpart.output
  4. ## NULL

预测

  1. graph$predict(task)
  2. ## $classif.rpart.output
  3. ## <PredictionClassif> for 150 observations:
  4. ## row_ids truth response
  5. ## 1 setosa setosa
  6. ## 2 setosa setosa
  7. ## 3 setosa setosa
  8. ## ---
  9. ## 148 virginica virginica
  10. ## 149 virginica virginica
  11. ## 150 virginica virginica

除此之外,还可以把graph变成一个GraphLearner对象,用于resamplebenchmark

  1. glrn <- as_learner(graph) # 变成graphlearner
  2. cv3 <- rsmp("cv", folds = 5)
  3. resample(task, glrn, cv3)
  4. ## INFO [21:01:53.145] [mlr3] Applying learner 'mutate.variance.classif.rpart' on task 'iris' (iter 2/5)
  5. ## INFO [21:01:53.230] [mlr3] Applying learner 'mutate.variance.classif.rpart' on task 'iris' (iter 5/5)
  6. ## INFO [21:01:53.291] [mlr3] Applying learner 'mutate.variance.classif.rpart' on task 'iris' (iter 1/5)
  7. ## INFO [21:01:53.350] [mlr3] Applying learner 'mutate.variance.classif.rpart' on task 'iris' (iter 4/5)
  8. ## INFO [21:01:53.407] [mlr3] Applying learner 'mutate.variance.classif.rpart' on task 'iris' (iter 3/5)
  9. ## <ResampleResult> of 5 iterations
  10. ## * Task: iris
  11. ## * Learner: mutate.variance.classif.rpart
  12. ## * Warnings: 0 in 0 iterations
  13. ## * Errors: 0 in 0 iterations

在很多数据预处理步骤中也是有参数需要调整的,mlr3pipelines不仅可以用于调整算法的超参数,还可以调整预处理中的参数。

  1. library(paradox)
  2. ps <- ps(
  3. classif.rpart.cp = p_dbl(0,0.05), # 算法中的参数
  4. variance.filter.frac = p_dbl(0.25,1) # 特征选择方法中的参数
  5. )
  6. library(mlr3tuning)
  7. instance <- TuningInstanceSingleCrit$new(
  8. task = task,
  9. learner = glrn,
  10. resampling = rsmp("holdout", ratio = 0.7),
  11. measure = msr("classif.acc"),
  12. search_space = ps,
  13. terminator = trm("evals", n_evals = 20)
  14. )
  15. tuner <- tnr("random_search")
  16. lgr::get_logger("mlr3")$set_threshold("warn")
  17. lgr::get_logger("bbotk")$set_threshold("warn")
  18. tuner$optimize(instance)
  19. ## classif.rpart.cp variance.filter.frac learner_param_vals x_domain
  20. ## 1: 0.02162802 0.3852356 <list[5]> <list[2]>
  21. ## classif.acc
  22. ## 1: 0.9777778
  23. instance$result_y
  24. ## classif.acc
  25. ## 0.9777778
  26. instance$result_learner_param_vals
  27. ## $mutate.mutation
  28. ## list()
  29. ##
  30. ## $mutate.delete_originals
  31. ## [1] FALSE
  32. ##
  33. ## $variance.filter.frac
  34. ## [1] 0.3852356
  35. ##
  36. ## $classif.rpart.xval
  37. ## [1] 0
  38. ##
  39. ## $classif.rpart.cp
  40. ## [1] 0.02162802

可以看到结果直接给出了算法的超参数和特征选择中的参数。

非线性graph

  • Branching: 一个点通往多个分支,例如在比较多个特征选择方法时很有用。只有一条路会被执行。
  • Copying: 一个点通往多个分支,所有的分支都会被执行,但是只能1次执行1个分支,并行计算目前还不支持。
  • Stacking: 单个图被彼此堆叠,一个图的输出是另一个图的输入。

branching & copying

使用PipeOpBranchPipeOpUnbranch实现分支操作,分支操作的概念如下图所示:
R语言机器学习mlr3:数据预处理和Pipelines - 图3

下面一个例子演示了分支操作,分支之后一定要解除分支:

  1. graph <- po("branch", c("nop","pca","scale")) %>>% # 开始分支
  2. gunion(list(
  3. po("nop", id = "null1"), # 分支1,并且取了个名字null1
  4. po("pca"), # 分支2
  5. po("scale") # 分支3
  6. )) %>>%
  7. po("unbranch",c("nop","pca","scale")) # 结束分支
  8. graph$plot(html = F)

R语言机器学习mlr3:数据预处理和Pipelines - 图4

bagging

属于集成学习的一种,概念不做介绍,感兴趣的可自行学习,其概念可查看下图:
R语言机器学习mlr3:数据预处理和Pipelines - 图5

下面演示基本使用方法。

  1. single_pred <- po("subsample", frac = 0.7) %>>%
  2. po("learner", lrn("classif.rpart")) # 建立一个模型
  3. pred_set <- ppl("greplicate", single_pred, 10L) # 复制10次
  4. bagging <- pred_set %>>%
  5. po("classifavg", innum = 10)
  6. bagging$plot(html = FALSE)

R语言机器学习mlr3:数据预处理和Pipelines - 图6

把上面的对象变成一个GraphLearner,然后就可以进行训练和预测了:

  1. task <- tsk("iris")
  2. split <- partition(task, ratio = 0.7, stratify = T)
  3. baglrn <- as_learner(bagging)
  4. baglrn$train(task, row_ids = split$train)
  5. baglrn$predict(task, row_ids = split$test)
  6. ## <PredictionClassif> for 45 observations:
  7. ## row_ids truth response prob.setosa prob.versicolor prob.virginica
  8. ## 4 setosa setosa 1 0 0
  9. ## 6 setosa setosa 1 0 0
  10. ## 8 setosa setosa 1 0 0
  11. ## ---
  12. ## 141 virginica virginica 0 0 1
  13. ## 147 virginica virginica 0 0 1
  14. ## 150 virginica virginica 0 0 1

stacking

另一种提高模型性能的方法,概念可看下图:
R语言机器学习mlr3:数据预处理和Pipelines - 图7

这里为了防止过拟合,使用PipeOpLearnerCV预测袋外数据,它可以在数据内部自动执行嵌套重抽样。

首先创建level 0学习器,然后复制一份,并取一个名字:

  1. lrn <- lrn("classif.rpart")
  2. lrn_0 <- po("learner_cv", lrn$clone())
  3. lrn_0$id<- "rpart_cv"

然后联合使用gunionPipeOpNOP,把没动过的task传到下一个level,这样经过决策树的task和没处理过的task就能一起传到下一个level了。

  1. level_0 <- gunion(list(lrn_0, po("nop")))

把上面传下来的东西联合到一起:

  1. combined <- level_0 %>>% po("featureunion", 2)
  2. stack <- combined %>>% po("learner", lrn$clone())
  3. stack$plot(html = FALSE)

R语言机器学习mlr3:数据预处理和Pipelines - 图8

然后就可以进行训练、预测了:

  1. stacklrn <- as_learner(stack)
  2. stacklrn$train(task, split$train)
  3. stacklrn$predict(task, split$test)
  4. ## <PredictionClassif> for 45 observations:
  5. ## row_ids truth response
  6. ## 4 setosa setosa
  7. ## 6 setosa setosa
  8. ## 8 setosa setosa
  9. ## ---
  10. ## 141 virginica virginica
  11. ## 147 virginica virginica
  12. ## 150 virginica virginica

一个超级复杂的例子

这个例子有多个不同的预处理步骤,使用多个不同的算法。

  1. library("magrittr")
  2. library("mlr3learners")
  3. rprt = lrn("classif.rpart", predict_type = "prob")
  4. glmn = lrn("classif.glmnet", predict_type = "prob")
  5. # 创建学习器
  6. lrn_0 = po("learner_cv", rprt, id = "rpart_cv_1")
  7. lrn_0$param_set$values$maxdepth = 5L
  8. lrn_1 = po("pca", id = "pca1") %>>% po("learner_cv", rprt, id = "rpart_cv_2")
  9. lrn_1$param_set$values$rpart_cv_2.maxdepth = 1L
  10. lrn_2 = po("pca", id = "pca2") %>>% po("learner_cv", glmn)
  11. # 第0层
  12. level_0 = gunion(list(lrn_0, lrn_1, lrn_2, po("nop", id = "NOP1")))
  13. # 第1层
  14. level_1 = level_0 %>>%
  15. po("featureunion", 4) %>>%
  16. po("copy", 3) %>>%
  17. gunion(list(
  18. po("learner_cv", rprt, id = "rpart_cv_l1"),
  19. po("learner_cv", glmn, id = "glmnt_cv_l1"),
  20. po("nop", id = "NOP_l1")
  21. ))
  22. # 第2层
  23. level_2 = level_1 %>>%
  24. po("featureunion", 3, id = "u2") %>>%
  25. po("learner", rprt, id = "rpart_l2")
  26. level_2$plot(html = FALSE)

R语言机器学习mlr3:数据预处理和Pipelines - 图9

下面就可以进行训练、预测:

  1. task = tsk("iris")
  2. lrn = as_learner(level_2)
  3. lrn$
  4. train(task, split$train)$
  5. predict(task, split$test)$
  6. score()
  7. ## classif.ce
  8. ## 0.08888889

一些特殊预处理步骤

其实是一些很常用的步骤…

缺失值处理:PipeOpImpute

缺失值处理实在是太常见了,mlr3pipelines对于数值型和因子型都能处理。

  1. pom <- po("missind")
  2. pon <- po("imputehist", # 条形图插补数值型
  3. id = "impute_num", # 取个名字
  4. affect_columns = is.numeric # 设置处理哪些列
  5. )
  6. pof = po("imputeoor", id = "imputer_fct", affect_columns = is.factor) # 处理因子
  7. imputer = pom %>>% pon %>>% pof

连接学习器:

  1. polrn <- po("learner", lrn("classif.rpart"))
  2. lrn <- as_learner(imputer %>>% polrn)

创建新的变量:PipeOpMutate

  1. pom <- po("mutate",
  2. mutation = list(
  3. Sepal.Sum = ~ Sepal.Length + Sepal.Width,
  4. Petal.Sum = ~ Petal.Length + Petal.Width,
  5. Sepal.Petal.Ratio = ~ (Sepal.Length / Petal.Length)
  6. )
  7. )

使用子集训练:PipeOpChunk

有时候数据集太大,把数据分割成小块进行分块训练是很好的办法。

  1. chks = po("chunk", 4)
  2. lrns = ppl("greplicate", po("learner", lrn("classif.rpart")), 4)
  3. mjv = po("classifavg", 4)
  4. pipeline = chks %>>% lrns %>>% mjv
  5. pipeline$plot(html = FALSE)

R语言机器学习mlr3:数据预处理和Pipelines - 图10

  1. task = tsk("iris")
  2. train.idx = sample(seq_len(task$nrow), 120)
  3. test.idx = setdiff(seq_len(task$nrow), train.idx)
  4. pipelrn = as_learner(pipeline)
  5. pipelrn$train(task, train.idx)$
  6. predict(task, train.idx)$
  7. score()
  8. ## classif.ce
  9. ## 0.3333333

特征选择:PipeOpFilterPipeOpSelect

可以使用PipeOpFilter对象把mlr3filters里面的变量选择方法放进mlr3pipelines中。

  1. po("filter", mlr3filters::flt("information_gain"))
  2. ## PipeOp: <information_gain> (not trained)
  3. ## values: <list()>
  4. ## Input channels <name [train type, predict type]>:
  5. ## input [Task,Task]
  6. ## Output channels <name [train type, predict type]>:
  7. ## output [Task,Task]

可使用filter_nfeat/filter_frac/filter_cutoff决定保留哪些变量/特征。

获取更多R语言知识,请关注公众号:医学和生信笔记

医学和生信笔记 公众号主要分享:1.医学小知识、肛肠科小知识;2.R语言和Python相关的数据分析、可视化、机器学习等;3.生物信息学学习资料和自己的学习笔记!