获取更多R语言知识,请关注公众号:医学和生信笔记

医学和生信笔记 公众号主要分享:1.医学小知识、肛肠科小知识;2.R语言和Python相关的数据分析、可视化、机器学习等;3.生物信息学学习资料和自己的学习笔记!

预测King Country地区的房价,将学习使用mlr3及其生态进行数据预处理、建模、重抽样、超参数调优等内容。

加载数据和R包

  1. library(mlr3verse)
  1. set.seed(123) # 设置种子数,数据可重复
  2. lgr::get_logger("mlr3")$set_threshold("warn") # 减少屏幕日志
  3. lgr::get_logger("bbotk")$set_threshold("warn")
  4. data("kc_housing", package = "mlr3data") # 加载数据

数据探索

  1. str(kc_housing)
  1. ## 'data.frame': 21613 obs. of 20 variables:
  2. ## $ date : POSIXct, format: "2014-10-13" "2014-12-09" ...
  3. ## $ price : num 221900 538000 180000 604000 510000 ...
  4. ## $ bedrooms : int 3 3 2 4 3 4 3 3 3 3 ...
  5. ## $ bathrooms : num 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
  6. ## $ sqft_living : int 1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
  7. ## $ sqft_lot : int 5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
  8. ## $ floors : num 1 2 1 1 1 1 2 1 1 2 ...
  9. ## $ waterfront : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
  10. ## $ view : int 0 0 0 0 0 0 0 0 0 0 ...
  11. ## $ condition : int 3 3 3 5 3 3 3 3 3 3 ...
  12. ## $ grade : int 7 7 6 7 8 11 7 7 7 7 ...
  13. ## $ sqft_above : int 1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
  14. ## $ sqft_basement: int NA 400 NA 910 NA 1530 NA NA 730 NA ...
  15. ## $ yr_built : int 1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
  16. ## $ yr_renovated : int NA 1991 NA NA NA NA NA NA NA NA ...
  17. ## $ zipcode : int 98178 98125 98028 98136 98074 98053 98003 98198 98146 98038 ...
  18. ## $ lat : num 47.5 47.7 47.7 47.5 47.6 ...
  19. ## $ long : num -122 -122 -122 -122 -122 ...
  20. ## $ sqft_living15: int 1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
  21. ## $ sqft_lot15 : int 5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...
  22. ## - attr(*, "index")= int(0)
  1. dim(kc_housing) # 21613,20
  1. ## [1] 21613 20
  1. summary(kc_housing)
  1. ## date price bedrooms
  2. ## Min. :2014-05-02 00:00:00 Min. : 75000 Min. : 0.000
  3. ## 1st Qu.:2014-07-22 00:00:00 1st Qu.: 321950 1st Qu.: 3.000
  4. ## Median :2014-10-16 00:00:00 Median : 450000 Median : 3.000
  5. ## Mean :2014-10-29 03:58:09 Mean : 540088 Mean : 3.371
  6. ## 3rd Qu.:2015-02-17 00:00:00 3rd Qu.: 645000 3rd Qu.: 4.000
  7. ## Max. :2015-05-27 00:00:00 Max. :7700000 Max. :33.000
  8. ##
  9. ## bathrooms sqft_living sqft_lot floors
  10. ## Min. :0.000 Min. : 290 Min. : 520 Min. :1.000
  11. ## 1st Qu.:1.750 1st Qu.: 1427 1st Qu.: 5040 1st Qu.:1.000
  12. ## Median :2.250 Median : 1910 Median : 7618 Median :1.500
  13. ## Mean :2.115 Mean : 2080 Mean : 15107 Mean :1.494
  14. ## 3rd Qu.:2.500 3rd Qu.: 2550 3rd Qu.: 10688 3rd Qu.:2.000
  15. ## Max. :8.000 Max. :13540 Max. :1651359 Max. :3.500
  16. ##
  17. ## waterfront view condition grade
  18. ## Mode :logical Min. :0.0000 Min. :1.000 Min. : 1.000
  19. ## FALSE:21450 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.: 7.000
  20. ## TRUE :163 Median :0.0000 Median :3.000 Median : 7.000
  21. ## Mean :0.2343 Mean :3.409 Mean : 7.657
  22. ## 3rd Qu.:0.0000 3rd Qu.:4.000 3rd Qu.: 8.000
  23. ## Max. :4.0000 Max. :5.000 Max. :13.000
  24. ##
  25. ## sqft_above sqft_basement yr_built yr_renovated zipcode
  26. ## Min. : 290 Min. : 10.0 Min. :1900 Min. :1934 Min. :98001
  27. ## 1st Qu.:1190 1st Qu.: 450.0 1st Qu.:1951 1st Qu.:1987 1st Qu.:98033
  28. ## Median :1560 Median : 700.0 Median :1975 Median :2000 Median :98065
  29. ## Mean :1788 Mean : 742.4 Mean :1971 Mean :1996 Mean :98078
  30. ## 3rd Qu.:2210 3rd Qu.: 980.0 3rd Qu.:1997 3rd Qu.:2007 3rd Qu.:98118
  31. ## Max. :9410 Max. :4820.0 Max. :2015 Max. :2015 Max. :98199
  32. ## NA's :13126 NA's :20699
  33. ## lat long sqft_living15 sqft_lot15
  34. ## Min. :47.16 Min. :-122.5 Min. : 399 Min. : 651
  35. ## 1st Qu.:47.47 1st Qu.:-122.3 1st Qu.:1490 1st Qu.: 5100
  36. ## Median :47.57 Median :-122.2 Median :1840 Median : 7620
  37. ## Mean :47.56 Mean :-122.2 Mean :1987 Mean : 12768
  38. ## 3rd Qu.:47.68 3rd Qu.:-122.1 3rd Qu.:2360 3rd Qu.: 10083
  39. ## Max. :47.78 Max. :-121.3 Max. :6210 Max. :871200
  40. ##

数据预处理

price是结果变量(target),其余是预测变量(feature)。

首先要把日期型变量date变为数值型,然后以最早的日期为标准变成数值,以天为单位。

把邮政编码变为因子型。

增加新列renovates,记录房子是否翻修过。

增加新列has_basement,记录有无地下室情况。

把房子价格单位从1mlr3实战1:决策树和xgboost - 图1

删除有缺失值的行。

  1. library(anytime)
  1. ## Warning: 程辑包'anytime'是用R版本4.1.3 来建造的
  1. dates <- anytime(kc_housing$date)
  2. kc_housing$date <- as.numeric(difftime(dates, min(dates), units = "days"))
  3. kc_housing$renovated <- as.numeric(!is.na(kc_housing$yr_renovated))
  4. kc_housing$has_basement <- as.numeric(!is.na(kc_housing$sqft_basement))
  5. kc_housing$yr_renovated <- NULL
  6. kc_housing$sqft_basement <- NULL
  7. kc_housing$price <- kc_housing$price / 1000

简单画图看一下:

  1. library(ggplot2)
  2. ggplot(kc_housing, aes(x = price)) + geom_density() + theme_minimal()

mlr3实战1:决策树和xgboost - 图2

创建任务

  1. task <- as_task_regr(kc_housing, target = "price")
  2. task
  1. ## <TaskRegr:kc_housing> (21613 x 20)
  2. ## * Target: price
  3. ## * Properties: -
  4. ## * Features (19):
  5. ## - int (11): bedrooms, condition, grade, sqft_above, sqft_living,
  6. ## sqft_living15, sqft_lot, sqft_lot15, view, yr_built, zipcode
  7. ## - dbl (7): bathrooms, date, floors, has_basement, lat, long,
  8. ## renovated
  9. ## - lgl (1): waterfront
  1. autoplot(task)+facet_wrap(~ condition)

mlr3实战1:决策树和xgboost - 图3

  1. # 变量间关系
  2. autoplot(task$clone()$select(task$feature_names[c(3,17)]),type = "pairs")
  1. ## Registered S3 method overwritten by 'GGally':
  2. ## method from
  3. ## +.gg ggplot2
  1. ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
  2. ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

mlr3实战1:决策树和xgboost - 图4

划分数据开始建模

  1. split <- partition(task, ratio = 0.7)
  2. train_idx <- split$train
  3. test_idx <- split$test
  4. task_train <- task$clone()$filter(train_idx)
  5. task_test <- task$clone()$filter(test_idx)

决策树

  1. # 先不用zipcode这一列
  2. task_nozip <- task_train$clone()$select(setdiff(task$feature_names, "zipcode"))
  3. # 建模
  4. lrn <- lrn("regr.rpart")
  5. lrn$train(task_nozip, row_ids = train_idx)
  6. # 可视化决策树
  7. library(rpart.plot)
  1. ## 载入需要的程辑包:rpart
  1. rpart.plot(lrn$model)

mlr3实战1:决策树和xgboost - 图5

可以看到决策树在grade/sqft_living/lat等水平上进行了分支,下面画一个地图,看看经纬度对价格的影响。

  1. library(ggmap)
  1. ## Warning: 程辑包'ggmap'是用R版本4.1.3 来建造的
  1. ## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
  1. ## Please cite ggmap if you use it! See citation("ggmap") for details.
  1. qmplot(long, lat, maptype = "watercolor", color = log(price),
  2. data = kc_housing[train_idx[1:3000],]) +
  3. scale_colour_viridis_c()

mlr3实战1:决策树和xgboost - 图6

很明显还是靠近水边的房子价格更贵!经纬度对房价影响也是有一点的。

下面看看不同邮政区域对价格的影响。

  1. qmplot(long, lat, maptype = "watercolor", color = zipcode,
  2. data = kc_housing[train_idx[1:3000],]) + guides(color = "none")

mlr3实战1:决策树和xgboost - 图7

看起来不同邮政区域对价格有影响的。

下面用加上邮政区域的数据进行建模,使用3折交叉验证提高模型稳定性:

  1. lrn_rpart <- lrn("regr.rpart")
  2. cv3 <- rsmp("cv", folds = 3)
  3. res <- resample(task_train, lrn_rpart, cv3, store_models = T)
  4. res$aggregate(msr("regr.rmse"))
  1. ## regr.rmse
  2. ## 221.0799

xgboost

  1. lrn_xgboost <- lrn("regr.xgboost")
  2. lrn_xgboost$param_set # 查看可以设置的超参数
  1. ## <ParamSet>
  2. ## id class lower upper nlevels default
  3. ## 1: alpha ParamDbl 0 Inf Inf 0
  4. ## 2: approxcontrib ParamLgl NA NA 2 FALSE
  5. ## 3: base_score ParamDbl -Inf Inf Inf 0.5
  6. ## 4: booster ParamFct NA NA 3 gbtree
  7. ## 5: callbacks ParamUty NA NA Inf <list[0]>
  8. ## 6: colsample_bylevel ParamDbl 0 1 Inf 1
  9. ## 7: colsample_bynode ParamDbl 0 1 Inf 1
  10. ## 8: colsample_bytree ParamDbl 0 1 Inf 1
  11. ## 9: disable_default_eval_metric ParamLgl NA NA 2 FALSE
  12. ## 10: early_stopping_rounds ParamInt 1 Inf Inf
  13. ## 11: eta ParamDbl 0 1 Inf 0.3
  14. ## 12: eval_metric ParamUty NA NA Inf rmse
  15. ## 13: feature_selector ParamFct NA NA 5 cyclic
  16. ## 14: feval ParamUty NA NA Inf
  17. ## 15: gamma ParamDbl 0 Inf Inf 0
  18. ## 16: grow_policy ParamFct NA NA 2 depthwise
  19. ## 17: interaction_constraints ParamUty NA NA Inf <NoDefault[3]>
  20. ## 18: iterationrange ParamUty NA NA Inf <NoDefault[3]>
  21. ## 19: lambda ParamDbl 0 Inf Inf 1
  22. ## 20: lambda_bias ParamDbl 0 Inf Inf 0
  23. ## 21: max_bin ParamInt 2 Inf Inf 256
  24. ## 22: max_delta_step ParamDbl 0 Inf Inf 0
  25. ## 23: max_depth ParamInt 0 Inf Inf 6
  26. ## 24: max_leaves ParamInt 0 Inf Inf 0
  27. ## 25: maximize ParamLgl NA NA 2
  28. ## 26: min_child_weight ParamDbl 0 Inf Inf 1
  29. ## 27: missing ParamDbl -Inf Inf Inf NA
  30. ## 28: monotone_constraints ParamUty NA NA Inf 0
  31. ## 29: normalize_type ParamFct NA NA 2 tree
  32. ## 30: nrounds ParamInt 1 Inf Inf <NoDefault[3]>
  33. ## 31: nthread ParamInt 1 Inf Inf 1
  34. ## 32: ntreelimit ParamInt 1 Inf Inf
  35. ## 33: num_parallel_tree ParamInt 1 Inf Inf 1
  36. ## 34: objective ParamUty NA NA Inf reg:squarederror
  37. ## 35: one_drop ParamLgl NA NA 2 FALSE
  38. ## 36: outputmargin ParamLgl NA NA 2 FALSE
  39. ## 37: predcontrib ParamLgl NA NA 2 FALSE
  40. ## 38: predictor ParamFct NA NA 2 cpu_predictor
  41. ## 39: predinteraction ParamLgl NA NA 2 FALSE
  42. ## 40: predleaf ParamLgl NA NA 2 FALSE
  43. ## 41: print_every_n ParamInt 1 Inf Inf 1
  44. ## 42: process_type ParamFct NA NA 2 default
  45. ## 43: rate_drop ParamDbl 0 1 Inf 0
  46. ## 44: refresh_leaf ParamLgl NA NA 2 TRUE
  47. ## 45: reshape ParamLgl NA NA 2 FALSE
  48. ## 46: sample_type ParamFct NA NA 2 uniform
  49. ## 47: sampling_method ParamFct NA NA 2 uniform
  50. ## 48: save_name ParamUty NA NA Inf
  51. ## 49: save_period ParamInt 0 Inf Inf
  52. ## 50: scale_pos_weight ParamDbl -Inf Inf Inf 1
  53. ## 51: seed_per_iteration ParamLgl NA NA 2 FALSE
  54. ## 52: single_precision_histogram ParamLgl NA NA 2 FALSE
  55. ## 53: sketch_eps ParamDbl 0 1 Inf 0.03
  56. ## 54: skip_drop ParamDbl 0 1 Inf 0
  57. ## 55: strict_shape ParamLgl NA NA 2 FALSE
  58. ## 56: subsample ParamDbl 0 1 Inf 1
  59. ## 57: top_k ParamInt 0 Inf Inf 0
  60. ## 58: training ParamLgl NA NA 2 FALSE
  61. ## 59: tree_method ParamFct NA NA 5 auto
  62. ## 60: tweedie_variance_power ParamDbl 1 2 Inf 1.5
  63. ## 61: updater ParamUty NA NA Inf <NoDefault[3]>
  64. ## 62: verbose ParamInt 0 2 3 1
  65. ## 63: watchlist ParamUty NA NA Inf
  66. ## 64: xgb_model ParamUty NA NA Inf
  67. ## id class lower upper nlevels default
  68. ## parents value
  69. ## 1:
  70. ## 2:
  71. ## 3:
  72. ## 4:
  73. ## 5:
  74. ## 6:
  75. ## 7:
  76. ## 8:
  77. ## 9:
  78. ## 10:
  79. ## 11:
  80. ## 12:
  81. ## 13: booster
  82. ## 14:
  83. ## 15:
  84. ## 16: tree_method
  85. ## 17:
  86. ## 18:
  87. ## 19:
  88. ## 20: booster
  89. ## 21: tree_method
  90. ## 22:
  91. ## 23:
  92. ## 24: grow_policy
  93. ## 25:
  94. ## 26:
  95. ## 27:
  96. ## 28:
  97. ## 29: booster
  98. ## 30: 1
  99. ## 31: 1
  100. ## 32:
  101. ## 33:
  102. ## 34:
  103. ## 35: booster
  104. ## 36:
  105. ## 37:
  106. ## 38:
  107. ## 39:
  108. ## 40:
  109. ## 41: verbose
  110. ## 42:
  111. ## 43: booster
  112. ## 44:
  113. ## 45:
  114. ## 46: booster
  115. ## 47: booster
  116. ## 48:
  117. ## 49:
  118. ## 50:
  119. ## 51:
  120. ## 52: tree_method
  121. ## 53: tree_method
  122. ## 54: booster
  123. ## 55:
  124. ## 56:
  125. ## 57: booster,feature_selector
  126. ## 58:
  127. ## 59: booster
  128. ## 60: objective
  129. ## 61:
  130. ## 62: 0
  131. ## 63:
  132. ## 64:
  133. ## parents value
  1. search_space <- ps(
  2. eta = p_dbl(lower = 0.2, upper = .4),
  3. min_child_weight = p_dbl(lower = 1, upper = 20),
  4. subsample = p_dbl(lower = .7, upper = .8),
  5. colsample_bytree = p_dbl( lower = .9, upper = 1),
  6. colsample_bylevel = p_dbl(lower = .5, upper = .7),
  7. nrounds = p_int(lower = 1L, upper = 25)
  8. )
  9. at <- auto_tuner(
  10. method = "random_search",
  11. learner = lrn_xgboost,
  12. resampling = rsmp("holdout"),
  13. measure = msr("regr.rmse"),
  14. search_space = search_space,
  15. term_evals = 10,
  16. batch_size = 8
  17. )
  18. res <- resample(task_nozip, at, cv3, store_models = T)
  19. res$aggregate()
  1. ## regr.mse
  2. ## 19122.95

效果比决策树好很多!

获取更多R语言知识,请关注公众号:医学和生信笔记

医学和生信笔记 公众号主要分享:1.医学小知识、肛肠科小知识;2.R语言和Python相关的数据分析、可视化、机器学习等;3.生物信息学学习资料和自己的学习笔记!