获取更多R语言知识,请关注公众号:医学和生信笔记

医学和生信笔记 公众号主要分享:1.医学小知识、肛肠科小知识;2.R语言和Python相关的数据分析、可视化、机器学习等;3.生物信息学学习资料和自己的学习笔记!

关于模型解释平常接触的不是特别多,简单学习下。

理论上,所有通用的模型解释框架都可应用于mlr3,只需要把训练好的模型从Learner对象中提取出来即可。

目前最受欢迎的两个框架分别是:

  • iml
  • DALEX

IML

关于iml包进行模型解释有专门一本书:IML Book
这里简单介绍。

企鹅任务

企鹅数据包括8个变量,344个企鹅(344行)。

  1. data("penguins", package = "palmerpenguins")
  2. str(penguins)
  1. ## tibble [344 x 8] (S3: tbl_df/tbl/data.frame)
  2. ## $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
  3. ## $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
  4. ## $ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
  5. ## $ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
  6. ## $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
  7. ## $ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
  8. ## $ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
  9. ## $ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

创建任务:

  1. library(iml)
  2. library(mlr3)
  3. library(mlr3learners)
  4. set.seed(1)
  5. penguins <- na.omit(penguins)
  6. task_peng <- as_task_classif(penguins, target = "species")

选择模型并训练,提取模型:

  1. learner <- lrn("classif.ranger", predict_type = "prob")
  2. learner$train(task_peng)
  3. learner$model
  1. ## Ranger result
  2. ##
  3. ## Call:
  4. ## ranger::ranger(dependent.variable.name = task$target_names, data = task$data(), probability = self$predict_type == "prob", case.weights = task$weights$weight, num.threads = 1L)
  5. ##
  6. ## Type: Probability estimation
  7. ## Number of trees: 500
  8. ## Sample size: 333
  9. ## Number of independent variables: 7
  10. ## Mtry: 2
  11. ## Target node size: 10
  12. ## Variable importance mode: none
  13. ## Splitrule: gini
  14. ## OOB prediction error (Brier s.): 0.01790106
  1. x <- penguins[which(names(penguins) != "species")]
  2. model <- Predictor$new(learner, data = x, y = penguins$species)

FeatureEffects

  1. num_features <- c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g", "year")
  2. effect <- FeatureEffects$new(model)
  3. plot(effect, features = num_features)

R语言机器学习mlr3:模型解释 - 图1

Shapley

  1. x <- penguins[which(names(penguins) != "species")]
  2. model <- Predictor$new(learner, data = penguins, y = "species")
  3. x.interest <- data.frame(penguins[1, ])
  4. shapley <- Shapley$new(model, x.interest = x.interest)
  5. plot(shapley)

R语言机器学习mlr3:模型解释 - 图2

Featurelmp

  1. effect <- FeatureImp$new(model, loss = "ce")
  2. effect$plot(features = num_features)

R语言机器学习mlr3:模型解释 - 图3

独立测试数据

  1. split <- partition(task_peng, ratio = 0.8)
  2. train_set <- split$train
  3. test_set <- split$test
  4. learner$train(task_peng, row_ids = train_set)
  5. prediction <- learner$predict(task_peng, row_ids = test_set)
  1. # 训练集
  2. model <- Predictor$new(learner, data = penguins[train_set, ], y = "species")
  3. effect <- FeatureImp$new(model, loss = "ce")
  4. plot_train <- plot(effect, features = num_features)
  5. # 测试集
  6. model <- Predictor$new(learner, data = penguins[test_set, ], y = "species")
  7. effect <- FeatureImp$new(model, loss = "ce")
  8. plot_test <- plot(effect, features = num_features)
  9. # 放到一起
  10. library("patchwork")
  11. plot_train + plot_test

R语言机器学习mlr3:模型解释 - 图4

分别查看feasurelmp

  1. model <- Predictor$new(learner, data = penguins[train_set, ], y = "species")
  2. effect <- FeatureEffects$new(model)
  3. plot(effect, features = num_features)

R语言机器学习mlr3:模型解释 - 图5

  1. model <- Predictor$new(learner, data = penguins[test_set, ], y = "species")
  2. effect <- FeatureEffects$new(model)
  3. plot(effect, features = num_features)

R语言机器学习mlr3:模型解释 - 图6

DALEX

这个包介绍的方法也有一本书:Explanatory Model Analysis

DALEX包可透视预测模型,帮助我们探索、解释、可视化模型行为。将使用fifa20数据集进行演示。

这个包干的事情可通过下图理解:
R语言机器学习mlr3:模型解释 - 图7

读取数据

  1. library(DALEX)
  1. ## Welcome to DALEX (version: 2.3.0).
  2. ## Find examples and detailed introduction at: http://ema.drwhy.ai/
  1. ##
  2. ## 载入程辑包:'DALEX'
  1. ## The following object is masked from 'package:generics':
  2. ##
  3. ## explain
  1. ## The following object is masked from 'package:dplyr':
  2. ##
  3. ## explain
  1. data(fifa, package = "DALEX")
  2. fifa[1:2, c("value_eur", "age", "height_cm", "nationality", "attacking_crossing")]
  1. ## value_eur age height_cm nationality attacking_crossing
  2. ## L. Messi 95500000 32 170 Argentina 88
  3. ## Cristiano Ronaldo 58500000 34 187 Portugal 84

对于每个球员,都有42个feature,

  1. dim(fifa)
  1. ## [1] 5000 42

进行简单的处理,有助于我们理解:

  1. fifa[, c("nationality", "overall", "potential", "wage_eur")] = NULL
  2. for (i in 1:ncol(fifa)) fifa[, i] = as.numeric(fifa[, i])

建模

  1. library(mlr3)
  2. library(mlr3learners)
  3. fifa_task <- as_task_regr(fifa, target = "value_eur")
  4. fifa_ranger <- lrn("regr.ranger", num.trees = 250)
  5. fifa_ranger$train(fifa_task)
  6. fifa_ranger
  1. ## <LearnerRegrRanger:regr.ranger>
  2. ## * Model: ranger
  3. ## * Parameters: num.threads=1, num.trees=250
  4. ## * Packages: mlr3, mlr3learners, ranger
  5. ## * Predict Type: response
  6. ## * Feature types: logical, integer, numeric, character, factor, ordered
  7. ## * Properties: hotstart_backward, importance, oob_error, weights

DALEX工作的一般流程

  1. model %>%
  2. explain_mlr3(data = ..., y = ..., label = ...) %>%
  3. model_parts() %>%
  4. plot()
  1. library("DALEX")
  2. library("DALEXtra")
  1. ## Anaconda not found on your computer. Conda related functionality such as create_env.R and condaenv and yml parameters from explain_scikitlearn will not be available
  1. ranger_exp <- explain_mlr3(fifa_ranger,
  2. data = fifa,
  3. y = fifa$value_eur,
  4. label = "Ranger RF",
  5. colorize = FALSE)
  1. ## Preparation of a new explainer is initiated
  2. ## -> model label : Ranger RF
  3. ## -> data : 5000 rows 38 cols
  4. ## -> target variable : 5000 values
  5. ## -> predict function : yhat.LearnerRegr will be used ( default )
  6. ## -> predicted values : No value for predict function target column. ( default )
  7. ## -> model_info : package mlr3 , ver. 0.13.2.9000 , task regression ( default )
  8. ## -> predicted values : numerical, min = 509536.7 , mean = 7472248 , max = 92074300
  9. ## -> residual function : difference between y and yhat ( default )
  10. ## -> residuals : numerical, min = -8364287 , mean = 1039.203 , max = 17510200
  11. ## A new explainer has been created!

数据集水平的探索

  1. fifa_vi <- model_parts(ranger_exp)
  2. head(fifa_vi)
  1. ## variable mean_dropout_loss label
  2. ## 1 _full_model_ 1339676 Ranger RF
  3. ## 2 value_eur 1339676 Ranger RF
  4. ## 3 weight_kg 1400918 Ranger RF
  5. ## 4 movement_balance 1402226 Ranger RF
  6. ## 5 goalkeeping_kicking 1405259 Ranger RF
  7. ## 6 height_cm 1409160 Ranger RF
  1. plot(fifa_vi, max_vars = 12, show_boxplots = F)

R语言机器学习mlr3:模型解释 - 图8

  1. selected_variables <- c("age", "movement_reactions",
  2. "skill_ball_control", "skill_dribbling")
  3. fifa_pd <- model_profile(ranger_exp,
  4. variables = selected_variables)$agr_profiles
  5. fifa_pd
  1. ## Top profiles :
  2. ## _vname_ _label_ _x_ _yhat_ _ids_
  3. ## 1 skill_ball_control Ranger RF 5 7535469 0
  4. ## 2 skill_dribbling Ranger RF 7 7911763 0
  5. ## 3 skill_dribbling Ranger RF 11 7904604 0
  6. ## 4 skill_dribbling Ranger RF 12 7903967 0
  7. ## 5 skill_dribbling Ranger RF 13 7902823 0
  8. ## 6 skill_dribbling Ranger RF 14 7901248 0
  1. library("ggplot2")
  2. plot(fifa_pd) +
  3. scale_y_continuous("Estimated value in Euro", labels = scales::dollar_format(suffix = "€", prefix = "")) +
  4. ggtitle("Partial Dependence profiles for selected variables")

R语言机器学习mlr3:模型解释 - 图9

instance水平的探索

  1. ronaldo <- fifa["Cristiano Ronaldo", ]
  2. ronaldo_bd_ranger <- predict_parts(ranger_exp,
  3. new_observation = ronaldo)
  4. head(ronaldo_bd_ranger)
  1. ## contribution
  2. ## Ranger RF: intercept 7472248
  3. ## Ranger RF: movement_reactions = 96 11845999
  4. ## Ranger RF: skill_ball_control = 92 7170577
  5. ## Ranger RF: mentality_positioning = 95 4565939
  6. ## Ranger RF: attacking_finishing = 94 4874197
  7. ## Ranger RF: attacking_short_passing = 83 4279799
  1. plot(ronaldo_bd_ranger)

R语言机器学习mlr3:模型解释 - 图10

  1. ronaldo_shap_ranger <- predict_parts(ranger_exp,
  2. new_observation = ronaldo,
  3. type = "shap")
  4. plot(ronaldo_shap_ranger) +
  5. scale_y_continuous("Estimated value in Euro", labels = scales::dollar_format(suffix = "€", prefix = ""))

R语言机器学习mlr3:模型解释 - 图11

  1. selected_variables <- c("age", "movement_reactions",
  2. "skill_ball_control", "skill_dribbling")
  3. ronaldo_cp_ranger <- predict_profile(ranger_exp, ronaldo, variables = selected_variables)
  4. plot(ronaldo_cp_ranger, variables = selected_variables) +
  5. scale_y_continuous("Estimated value of Christiano Ronaldo", labels = scales::dollar_format(suffix = "€", prefix = ""))

R语言机器学习mlr3:模型解释 - 图12

获取更多R语言知识,请关注公众号:医学和生信笔记

医学和生信笔记 公众号主要分享:1.医学小知识、肛肠科小知识;2.R语言和Python相关的数据分析、可视化、机器学习等;3.生物信息学学习资料和自己的学习笔记!