离散型的二分类变量(预测值只有两种,如0或1)在没有做转化是无法进行回归分析(一般回归分析适用于连续型变量),我们可以借助广义线性回归模型logistic回归对二分类变量进行转化(对预测值使用了逻辑函数,也即是预测值的范围是0到1),最终实现回归分析。

logistic regression基础

logistic 函数:
数据分析:二分类变量的logistic regression计算及ROC可视化 - 图1

数据分析:二分类变量的logistic regression计算及ROC可视化 - 图2


  1. # install.packages("ISLR") # 使用该包的数据
  2. data <- ISLR::Default
  3. head(data)
  1. default student balance income
  2. 1 No No 729.5265 44361.625
  3. 2 No Yes 817.1804 12106.135
  4. 3 No No 1073.5492 31767.139
  5. 4 No No 529.2506 35704.494
  6. 5 No No 785.6559 38463.496
  7. 6 No Yes 919.5885 7491.559


  • 构建训练集和测试集
  • 训练模型
  1. set.seed(123)
  2. sample <- sample(c(TRUE, FALSE), nrow(data), replace=TRUE, prob=c(0.7,0.3))
  3. train <- data[sample, ]
  4. test <- data[-sample, ]
  5. fit <- glm(default ~ student+balance+income, family="binomial", data=train)
  6. summary(fit)
  1. Call:
  2. glm(formula = default ~ student + balance + income, family = "binomial",
  3. data = train)
  4. Deviance Residuals:
  5. Min 1Q Median 3Q Max
  6. -2.5586 -0.1353 -0.0519 -0.0177 3.7973
  7. Coefficients:
  8. Estimate Std. Error z value Pr(>|z|)
  9. (Intercept) -1.148e+01 6.234e-01 -18.412 <2e-16 ***
  10. studentYes -4.933e-01 2.857e-01 -1.726 0.0843 .
  11. balance 5.988e-03 2.938e-04 20.384 <2e-16 ***
  12. income 7.857e-06 9.965e-06 0.788 0.4304
  13. ---
  14. Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 1
  15. (Dispersion parameter for binomial family taken to be 1)
  16. Null deviance: 2021.1 on 6963 degrees of freedom
  17. Residual deviance: 1065.4 on 6960 degrees of freedom
  18. AIC: 1073.4
  19. Number of Fisher Scoring iterations: 8


  • 获取预测值: type的值可以是“link”, “response”, “terms” (需要了解下有何区别)
    • link:
    • response:因变量分类变量预测为”Yes|No”的概率值
    • terms:自变量预测的概率值(待定)
  1. predicted <- predict(fit, test, type="response")
  2. head(predicted)
  1. 4 6 7 15 17 18
  2. 3.259679e-04 1.648925e-03 1.762607e-03 9.694038e-03 1.536866e-05 1.709650e-04

Notes:每个样本对应的预测值是数字而不是我以为的 default的 ”Yes|No“

  • 获取ROC曲线:通过pROC的roc和auc函数分别获取roc对象和auc值
  1. library(ggplot2)
  2. library(pROC)
  3. rocobj <- roc(test$default, predicted)
  4. auc <- round(auc(test$default, predicted),4)
  • 可视化结果
  • legacy.axes 控制specificity(x 轴)是否升序排列
  • geom_abline 添加对角线
  • annotate 添加AUC结果
  • coord_cartesian 设置坐标轴范围
  • family=”serif” 设置新罗马字体
  1. ggroc(rocobj, color = "red", linetype = 1, size = 1, alpha = 1, legacy.axes = T)+
  2. geom_abline(intercept = 0, slope = 1, color="grey", size = 1, linetype=1)+
  3. labs(x = "False Positive Rate (1 - Specificity)",
  4. y = "True Positive Rate (Sensivity or Recall)")+
  5. annotate("text",x = .75, y = .25,label=paste("AUC =", auc),
  6. size = 5, family="serif")+
  7. coord_cartesian(xlim = c(0, 1), ylim = c(0, 1))+
  8. theme_bw()+
  9. theme(panel.background = element_rect(fill = 'transparent'),
  10. axis.ticks.length = unit(0.4, "lines"),
  11. axis.ticks = element_line(color='black'),
  12. axis.line = element_line(size=.5, colour = "black"),
  13. axis.title = element_text(colour='black', size=12,face = "bold"),
  14. axis.text = element_text(colour='black',size=10,face = "bold"),
  15. text = element_text(size=8, color="black", family="serif"))

数据分析:二分类变量的logistic regression计算及ROC可视化 - 图3

