离散型的二分类变量(预测值只有两种,如0或1)在没有做转化是无法进行回归分析(一般回归分析适用于连续型变量),我们可以借助广义线性回归模型logistic回归对二分类变量进行转化(对预测值使用了逻辑函数,也即是预测值的范围是0到1),最终实现回归分析。更多知识分享请到 https://zouhua.top/

logistic regression基础

logistic 函数:
数据分析:二分类变量的logistic regression计算及ROC可视化 - 图1

logistic方程和logistic曲线:逻辑曲线在z=0时,十分敏感,在z>>0或z<<0处,都不敏感,将预测值限定为(0,1)
数据分析:二分类变量的logistic regression计算及ROC可视化 - 图2

加载数据

  1. # install.packages("ISLR") # 使用该包的数据
  2. data <- ISLR::Default
  3. head(data)
  1. default student balance income
  2. 1 No No 729.5265 44361.625
  3. 2 No Yes 817.1804 12106.135
  4. 3 No No 1073.5492 31767.139
  5. 4 No No 529.2506 35704.494
  6. 5 No No 785.6559 38463.496
  7. 6 No Yes 919.5885 7491.559

训练模型

  • 构建训练集和测试集
  • 训练模型
  1. set.seed(123)
  2. sample <- sample(c(TRUE, FALSE), nrow(data), replace=TRUE, prob=c(0.7,0.3))
  3. train <- data[sample, ]
  4. test <- data[-sample, ]
  5. fit <- glm(default ~ student+balance+income, family="binomial", data=train)
  6. summary(fit)
  1. Call:
  2. glm(formula = default ~ student + balance + income, family = "binomial",
  3. data = train)
  4. Deviance Residuals:
  5. Min 1Q Median 3Q Max
  6. -2.5586 -0.1353 -0.0519 -0.0177 3.7973
  7. Coefficients:
  8. Estimate Std. Error z value Pr(>|z|)
  9. (Intercept) -1.148e+01 6.234e-01 -18.412 <2e-16 ***
  10. studentYes -4.933e-01 2.857e-01 -1.726 0.0843 .
  11. balance 5.988e-03 2.938e-04 20.384 <2e-16 ***
  12. income 7.857e-06 9.965e-06 0.788 0.4304
  13. ---
  14. Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 1
  15. (Dispersion parameter for binomial family taken to be 1)
  16. Null deviance: 2021.1 on 6963 degrees of freedom
  17. Residual deviance: 1065.4 on 6960 degrees of freedom
  18. AIC: 1073.4
  19. Number of Fisher Scoring iterations: 8

ROC曲线

  • 获取预测值: type的值可以是“link”, “response”, “terms” (需要了解下有何区别)
    • link:
    • response:因变量分类变量预测为”Yes|No”的概率值
    • terms:自变量预测的概率值(待定)
  1. predicted <- predict(fit, test, type="response")
  2. head(predicted)
  1. 4 6 7 15 17 18
  2. 3.259679e-04 1.648925e-03 1.762607e-03 9.694038e-03 1.536866e-05 1.709650e-04

Notes:每个样本对应的预测值是数字而不是我以为的 default的 ”Yes|No“
Notes:测试样本的default预测为”Yes|No“的概率值(待定)

  • 获取ROC曲线:通过pROC的roc和auc函数分别获取roc对象和auc值
  1. library(ggplot2)
  2. library(pROC)
  3. rocobj <- roc(test$default, predicted)
  4. auc <- round(auc(test$default, predicted),4)
  • 可视化结果
  • legacy.axes 控制specificity(x 轴)是否升序排列
  • geom_abline 添加对角线
  • annotate 添加AUC结果
  • coord_cartesian 设置坐标轴范围
  • family=”serif” 设置新罗马字体
  1. ggroc(rocobj, color = "red", linetype = 1, size = 1, alpha = 1, legacy.axes = T)+
  2. geom_abline(intercept = 0, slope = 1, color="grey", size = 1, linetype=1)+
  3. labs(x = "False Positive Rate (1 - Specificity)",
  4. y = "True Positive Rate (Sensivity or Recall)")+
  5. annotate("text",x = .75, y = .25,label=paste("AUC =", auc),
  6. size = 5, family="serif")+
  7. coord_cartesian(xlim = c(0, 1), ylim = c(0, 1))+
  8. theme_bw()+
  9. theme(panel.background = element_rect(fill = 'transparent'),
  10. axis.ticks.length = unit(0.4, "lines"),
  11. axis.ticks = element_line(color='black'),
  12. axis.line = element_line(size=.5, colour = "black"),
  13. axis.title = element_text(colour='black', size=12,face = "bold"),
  14. axis.text = element_text(colour='black',size=10,face = "bold"),
  15. text = element_text(size=8, color="black", family="serif"))

数据分析:二分类变量的logistic regression计算及ROC可视化 - 图3

R Information

sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ISLR_1.2            forcats_0.5.0       stringr_1.4.0       purrr_0.3.4         readr_1.4.0         tidyr_1.1.2        
 [7] tidyverse_1.3.0     xgboost_1.3.1.1     mlbench_2.1-1       survminer_0.4.8     ggpubr_0.4.0        survcomp_1.40.0    
[13] prodlim_2019.11.13  survival_3.2-7      caretEnsemble_2.0.1 pROC_1.16.2         caret_6.0-86        ggplot2_3.3.3      
[19] lattice_0.20-41     data.table_1.13.6   tibble_3.0.4        dplyr_1.0.2        

loaded via a namespace (and not attached):
  [1] readxl_1.3.1         backports_1.2.0      plyr_1.8.6           splines_4.0.3        digest_0.6.27       
  [6] SuppDists_1.1-9.5    foreach_1.5.1        htmltools_0.5.0      fansi_0.4.1          magrittr_1.5        
 [11] openxlsx_4.2.3       recipes_0.1.15       modelr_0.1.8         gower_0.2.2          colorspace_2.0-0    
 [16] rvest_0.3.6          haven_2.3.1          xfun_0.19            crayon_1.3.4         jsonlite_1.7.1      
 [21] libcoin_1.0-7        zoo_1.8-8            iterators_1.0.13     glue_1.4.2           gtable_0.3.0        
 [26] ipred_0.9-9          questionr_0.7.3      car_3.0-10           kernlab_0.9-29       abind_1.4-5         
 [31] scales_1.1.1         mvtnorm_1.1-1        DBI_1.1.0            rstatix_0.6.0        miniUI_0.1.1.1      
 [36] Rcpp_1.0.5           xtable_1.8-4         Cubist_0.2.3         foreign_0.8-80       km.ci_0.5-2         
 [41] Formula_1.2-4        stats4_4.0.3         lava_1.6.8.1         httr_1.4.2           ellipsis_0.3.1      
 [46] pkgconfig_2.0.3      farver_2.0.3         nnet_7.3-14          dbplyr_2.0.0         utf8_1.1.4          
 [51] tidyselect_1.1.0     labeling_0.4.2       rlang_0.4.8          reshape2_1.4.4       later_1.1.0.1       
 [56] munsell_0.5.0        cellranger_1.1.0     tools_4.0.3          cli_2.1.0            generics_0.1.0      
 [61] broom_0.7.3          evaluate_0.14        fastmap_1.0.1        yaml_2.2.1           bootstrap_2019.6    
 [66] ModelMetrics_1.2.2.2 knitr_1.30           fs_1.5.0             zip_2.1.1            survMisc_0.5.5      
 [71] caTools_1.18.0       randomForest_4.6-14  pbapply_1.4-3        nlme_3.1-150         mime_0.9            
 [76] xml2_1.3.2           compiler_4.0.3       rstudioapi_0.12      curl_4.3             e1071_1.7-4         
 [81] ggsignif_0.6.0       reprex_0.3.0         klaR_0.6-15          stringi_1.5.3        highr_0.8           
 [86] Matrix_1.2-18        gbm_2.1.8            ggsci_2.9            survivalROC_1.0.3    KMsurv_0.1-5        
 [91] vctrs_0.3.4          pillar_1.4.6         lifecycle_0.2.0      combinat_0.0-8       cowplot_1.1.1       
 [96] bitops_1.0-6         httpuv_1.5.4         R6_2.5.0             promises_1.1.1       KernSmooth_2.23-18  
[101] gridExtra_2.3        C50_0.1.3.1          rio_0.5.16           codetools_0.2-18     MASS_7.3-53         
[106] assertthat_0.2.1     withr_2.3.0          parallel_4.0.3       hms_0.5.3            grid_4.0.3          
[111] rpart_4.1-15         labelled_2.7.0       timeDate_3043.102    class_7.3-17         rmarkdown_2.5       
[116] inum_1.0-1           carData_3.0-4        partykit_1.2-11      shiny_1.5.0          lubridate_1.7.9     
[121] rmeta_3.0

参考

  1. logistic回归介绍
  2. logistic回归模型基础
  3. How to Plot a ROC curve using ggplot2

参考文章如引起任何侵权问题,可以与我联系,谢谢。