评分卡模型如何快速搭建?——本文旨在从0-1完成基础的评分卡模型,另附完全代码见文末.

    步骤一栏:
    1.准备样本(划分三个样本:训练集(dev)、验证集(val)、跨时间集(off))
    2.清洗数据(缺失、异常、数据类型转换等)
    3.特征挖掘(分箱IV计算、psi计算、变量组合、剔除无价值变量)
    4.生成二分类模型(LR、Xgboost、Gbdt等)
    5.计算各模型在各样本的表现(计算F、ROC、KS、PSI值,挑选合适模型)
    6.生成模型报告并制作评分卡(根据模型得到的特征权重计算出具体变量的分值)

    1、准备数据
    选样本,结合实际情况选取合适的样本,另外将样本分成3份:训练集、验证集、跨时间集(用于验证模型稳定性)

    本案例样本如下:
    scorecard.txt
    详细:[95806 rows x 13 columns] 最后一列指样本类型:训练集(dev)、验证集(val)、跨时间集(off)
    image.png
    划分开发样本、验证样本与跨时间(时间外)样本
    训练集(dev)、验证集(val)、跨时间集(off)>>7:1.5:1.5>> 65304:14527:15975
    dev = data_all[(data_all[‘samp_type’] == ‘dev’)] # 65304
    val = data_all[(data_all[‘samp_type’] == ‘val’)] # 14527
    off = data_all[(data_all[‘samp_type’] == ‘off’)] # 15975

    2、清洗数据
    因样本较好,无异常数据,这步跳过
    代码:toad.detector.detect(data_all) # 观测数据的缺失情况、具体的值类别、均值、标准差、分位数、最大最小等
    image.png

    3、特征挖掘
    注:Toad可以用来过滤大量的特征,如高缺失率、低iv和高度相关的特征。它还可以使用各种分箱技巧进行分箱和实现WOE转化。

    计算IV、gini、entropy:
    # 计算IV的快捷方法
    # toad.quality(dataframe, target):返回每个特征的质量,包括iv、基尼系数和熵。可以帮助我们发现更有用的潜在信息。
    quality = toad.quality(data_all,’bad_ind’)
    quality.sort_values(‘iv’,ascending=False)
    image.png

    下面以缺失率大于0.7,IV值小于0.03,相关性大于1(保留更多的特征)来进行特征筛选(剔除不符合这些条件的 或的逻辑剔除,不满足其中一个均会剔除)。【其实corr=0.7更好,可以去掉有多重共线性的变量,即vif>10】
    dev_slct1, drop_lst = toad.selection.select(dev, dev[‘bad_ind’],
    empty=0.7, iv=0.03,
    corr=1,
    return_drop=True,
    exclude=ex_list)
    print(‘keep:’, dev_slct1.shape[1],
    ‘drop empty:’, len(drop_lst[‘empty’]),
    ‘drop iv:’, len(drop_lst[‘iv’]),
    ‘drop corr:’, len(drop_lst[‘corr’]))
    >>keep: 12 drop empty: 0 drop iv: 1 drop corr: 0
    #结果剔除了 1个变量:rh_score (因为IV不合格,iv<0.03了)

    再根据卡方分箱的结果看IV:
    bin_plot(dev_slct2, x=’act_info’, target=’bad_ind’) # toad.plot.bin_plot 可以看分箱后结果的IV 在图片上显示
    bin_plot(val2, x=’act_info’, target=’bad_ind’)
    bin_plot(off2, x=’act_info’, target=’bad_ind’)
    在不同样本中看变量的IV
    Figure 2022-03-25 113300.png

    Figure 2022-03-25 113310.png
    Figure 2022-03-25 113315.png
    还可以调整分箱节点:
    Figure 2022-03-25 113318.png
    Figure 2022-03-25 113321.png

    Figure 2022-03-25 113324.png
    根据PSI再剔除:
    筛选PSI值小于0.13的值
    psi_df = toad.metrics.PSI(dev_slct2_woe, val_woe).sort_values(0)
    psi_df = psi_df.reset_index()
    psi_df.columns = [‘feature’, ‘psi’]
    psi_013 = list(psi_df[psi_df.psi<0.13].feature)

    再重新选择特征:
    dev_woe_psi2, drop_lst = toad.selection.select(dev_woe_psi,
    dev_woe_psi[‘bad_ind’],
    empty=0.6,
    iv=0.001,
    corr=0.5,
    return_drop=True,
    exclude=ex_list)
    print(‘keep:’, dev_woe_psi2.shape[1],
    ‘drop empty:’, len(drop_lst[‘empty’]),
    ‘drop iv:’, len(drop_lst[‘iv’]),
    ‘drop corr:’, len(drop_lst[‘corr’]))
    # keep: 7 drop empty: 0 drop iv: 4 drop corr: 0 根据PSI稳定性 去掉4个变量

    再根据缺失值、iv、相关性条件去判断,最终保留变量:’uid’, ‘samp_type’, ‘zcx_score’, ‘bad_ind’, ‘credit_info’, ‘act_info’, ‘person_info’

    再根据逐步回归剔除变量:
    dev_woe_psi_stp = toad.selection.stepwise(dev_woe_psi2,
    dev_woe_psi2[‘bad_ind’],
    exclude=ex_list,
    direction=’both’,
    criterion=’aic’,
    estimator=’ols’,
    intercept=False)
    val_woe_psi_stp = val_woe_psi[dev_woe_psi_stp.columns]
    off_woe_psi_stp = off_woe_psi[dev_woe_psi_stp.columns]
    data = pd.concat([dev_woe_psi_stp, val_woe_psi_stp, off_woe_psi_stp])
    print(data.shape) # (95806, 6) 又剔除一个变量 zcx_score
    最后的6个变量[‘uid’, ‘samp_type’, ‘bad_ind’, ‘credit_info’, ‘act_info’, ‘person_info‘],仅最后三个为X变量

    4、生成二分类模型
    回归模型、xgboost模型 见代码

    5、计算各模型在各样本的表现
    Figure 2022-03-25 113548.pngFigure 2022-03-25 113551.pngFigure 2022-03-25 113554.png
    调用变量的结果:
    # 结果
    # 逻辑回归正向:
    # train_ks: 0.4175617517173731
    # val_ks: 0.3588328912466844
    # off_ks: 0.36930421478753034
    # 逻辑回归反向:
    # train_ks: 0.38527996483390414
    # val_ks: 0.37396631393463364
    # off_ks: 0.40858234950836764

    # XGBoost正向:
    # train_ks: 0.4237042266146137
    # val_ks: 0.3595635995538518
    # off_ks: 0.37518296190183736
    # XGBoost反向:
    # train_ks: 0.3938834608675862
    # val_ks: 0.37747626322745126
    # off_ks: 0.39338531134694127
    效果都还可以,KS基本在0.4左右

    计算F1分数、KS值和AUC值:
    prob_dev = lr.predict_proba(x)[:,1]
    print(‘训练集’)
    print(‘F1:’, F1(prob_dev,y))
    print(‘KS:’, KS(prob_dev,y))
    print(‘AUC:’, AUC(prob_dev,y))
    # 训练集
    # F1: 0.029351208260445533
    # KS: 0.4175617517173731
    # AUC: 0.7721143736676812

    6、生成模型报告并制作评分卡
    # 生成模型时间外样本的KS报告
    toad.metrics.KS_bucket(prob_off,offy,
    bucket=10,
    method=’quantile’)

    # 制作评分卡
    card = ScoreCard(combiner=combiner,
    transer=t, C=0.1,
    class_weight=’balanced’,
    base_score=600,
    base_odds=35,
    pdo=60,
    rate=2)
    card.fit(x,y)
    final_card = card.export(to_frame=True) # 输出每个变量得分的情况
    print(final_card)

    image.png
    得到这三个变量的具体值对应的得分,这样一个简易的评分卡就完成了。

    代码:

    1. # -*- coding: utf-8 -*-
    2. """
    3. Created on Fri Mar 25 09:26:19 2022
    4. @author: yingtao.xiang
    5. """
    6. # 1.加载库
    7. # 导入基本库
    8. import warnings
    9. warnings.filterwarnings('ignore')
    10. import math
    11. import matplotlib
    12. import matplotlib.pyplot as plt
    13. import numpy as np
    14. import pandas as pd
    15. import toad
    16. from toad.plot import bin_plot, badrate_plot
    17. from toad.metrics import KS, F1, AUC
    18. from toad.scorecard import ScoreCard
    19. # 导入算法相关库
    20. import xgboost as xgb
    21. from sklearn.metrics import roc_auc_score, roc_curve, auc
    22. from sklearn.model_selection import train_test_split
    23. from sklearn.linear_model import LogisticRegression
    24. from sklearn.preprocessing import StandardScaler
    25. # 2.加载数据
    26. # 加载数据
    27. data_all = pd.read_csv(r'C:\Users\yingtao.xiang\Desktop\风控模型\CREDIT_SCORING_CARD_MODEL-master\CREDIT_SCORING_CARD_MODEL-master\v1/data/scorecard.txt')
    28. # 指定不参与训练列名
    29. ex_list = ['uid', 'samp_type', 'bad_ind']
    30. # 看下总样本及好坏样本数
    31. # data_all['bad_ind'].value_counts() # 一共95806个用户
    32. # 0.0 94008
    33. # 1.0 1798
    34. # Name: bad_ind, dtype: int64
    35. # 参与训练列名 共10个变量
    36. ft_list = data_all.columns.tolist()
    37. for col in ex_list:
    38. ft_list.remove(col)
    39. # 3.划分开发样本、验证样本与时间外样本 7:1.5:1.5
    40. dev = data_all[(data_all['samp_type'] == 'dev')] # 65304
    41. val = data_all[(data_all['samp_type'] == 'val')] # 14527
    42. off = data_all[(data_all['samp_type'] == 'off')] # 15975
    43. # 4.探索性数据分析
    44. toad.detector.detect(data_all) # 观测数据的缺失情况、具体的值类别、均值、标准差、分位数、最大最小等
    45. # 5.特征剔除 #剔除条件可配置
    46. dev_slct1, drop_lst = toad.selection.select(dev, dev['bad_ind'],
    47. empty=0.7, iv=0.03,
    48. corr=1,
    49. return_drop=True,
    50. exclude=ex_list)
    51. # dev_slct1, drop_lst = toad.selection.select(dev, dev['bad_ind'],
    52. # empty=0.7, iv=0.1,
    53. # corr=1,
    54. # return_drop=True,
    55. # exclude=ex_list)
    56. print('keep:', dev_slct1.shape[1],
    57. 'drop empty:', len(drop_lst['empty']),
    58. 'drop iv:', len(drop_lst['iv']),
    59. 'drop corr:', len(drop_lst['corr']))
    60. #结果剔除了 1个变量:rh_score
    61. # 6.卡方分箱
    62. # 得到切分节点
    63. combiner = toad.transform.Combiner()
    64. combiner.fit(dev_slct1, dev_slct1['bad_ind'], method='chi',
    65. min_samples=0.05, exclude=ex_list)
    66. # 导出箱的节点
    67. bins = combiner.export()
    68. print(bins)
    69. # 7.分箱效果图
    70. # 根据节点实施分箱
    71. dev_slct2 = combiner.transform(dev_slct1) # 计算每个用户具体值在每一个箱体位置
    72. val2 = combiner.transform(val[dev_slct1.columns])
    73. off2 = combiner.transform(off[dev_slct1.columns])
    74. # 分箱后通过画图观察
    75. # 使用bad_rate验证分箱在测试集或者跨时间验证集上的稳定性
    76. # 使用bad_rate验证分箱在测试集或者跨时间验证集上的稳定性
    77. # 使用bad_rate验证分箱在测试集或者跨时间验证集上的稳定性
    78. bin_plot(dev_slct2, x='act_info', target='bad_ind') # toad.plot.bin_plot 可以看分箱后结果的IV 在图片上显示
    79. bin_plot(val2, x='act_info', target='bad_ind') # toad库 评分卡模型很赞
    80. bin_plot(off2, x='act_info', target='bad_ind')
    81. # 查看单箱节点
    82. bins['act_info']
    83. # 只保留需要的两个分割点,并重新画出Bivar图 #进行调整 只分2个节点,三个箱体
    84. adj_bin = {'act_info': [0.16666666666666666, 0.35897435897435903,]}
    85. combiner.set_rules(adj_bin)
    86. dev_slct3 = combiner.transform(dev_slct1)
    87. val3 = combiner.transform(val[dev_slct1.columns])
    88. off3 = combiner.transform(off[dev_slct1.columns])
    89. # 画出
    90. bin_plot(dev_slct3, x='act_info', target='bad_ind')
    91. bin_plot(val3, x='act_info', target='bad_ind')
    92. bin_plot(off3, x='act_info', target='bad_ind')
    93. # 8.绘制负样本占比关联图
    94. # 无错位
    95. data = pd.concat([dev_slct3, val3, off3], join='inner')
    96. # 画出不同数据集的每一箱的bad_rate图。这里可以是训练集测试集,也可以不同月份的对比。by后面是纵轴。x是需要对比的维度,比如训练集测试集、不同的月份。
    97. badrate_plot(data, x='samp_type', target='bad_ind', by='person_info')
    98. badrate_plot(dev_slct3, x='samp_type', target='bad_ind', by='person_info')
    99. # 9.WOE编码
    100. t = toad.transform.WOETransformer()
    101. dev_slct2_woe = t.fit_transform(dev_slct3, dev_slct3['bad_ind'], exclude=ex_list)
    102. val_woe = t.transform(val3[dev_slct3.columns])
    103. off_woe = t.transform(off3[dev_slct3.columns])
    104. data = pd.concat([dev_slct2_woe, val_woe, off_woe])
    105. # 10.筛选PSI值小于0.13的值
    106. psi_df = toad.metrics.PSI(dev_slct2_woe, val_woe).sort_values(0)
    107. psi_df = psi_df.reset_index()
    108. psi_df.columns = ['feature', 'psi']
    109. psi_013 = list(psi_df[psi_df.psi<0.13].feature)
    110. for i in ex_list:
    111. if i in psi_013:
    112. pass
    113. else:
    114. psi_013.append(i)
    115. data = data[psi_013]
    116. dev_woe_psi = dev_slct2_woe[psi_013]
    117. val_woe_psi = val_woe[psi_013]
    118. off_woe_psi = off_woe[psi_013]
    119. print(data.shape)
    120. # 计算训练集与时间外验证样本的PSI,删除PSI大于0.13的特征。注意,通过单个特征的PSI值建议在0.02以下,根据具体情况可以适当调整。
    121. # 11.重新选择特征
    122. dev_woe_psi2, drop_lst = toad.selection.select(dev_woe_psi,
    123. dev_woe_psi['bad_ind'],
    124. empty=0.6,
    125. iv=0.001,
    126. corr=0.5,
    127. return_drop=True,
    128. exclude=ex_list)
    129. print('keep:', dev_woe_psi2.shape[1],
    130. 'drop empty:', len(drop_lst['empty']),
    131. 'drop iv:', len(drop_lst['iv']),
    132. 'drop corr:', len(drop_lst['corr']))
    133. # keep: 7 drop empty: 0 drop iv: 4 drop corr: 0 根据PSI稳定性 去掉4个变量
    134. # 12.逐步回归
    135. dev_woe_psi_stp = toad.selection.stepwise(dev_woe_psi2,
    136. dev_woe_psi2['bad_ind'],
    137. exclude=ex_list,
    138. direction='both',
    139. criterion='aic',
    140. estimator='ols',
    141. intercept=False)
    142. val_woe_psi_stp = val_woe_psi[dev_woe_psi_stp.columns]
    143. off_woe_psi_stp = off_woe_psi[dev_woe_psi_stp.columns]
    144. data = pd.concat([dev_woe_psi_stp, val_woe_psi_stp, off_woe_psi_stp])
    145. print(data.shape) # (95806, 6)
    146. # 13.模型训练
    147. # 13.1.定义逻辑回归函数
    148. def lr_model(x, y, valx, valy, offx, offy, C):
    149. model = LogisticRegression(C=C, class_weight='balanced')
    150. model.fit(x, y)
    151. y_pred = model.predict_proba(x)[:,1]
    152. fpr_dev, tpr_dev, _ = roc_curve(y, y_pred)
    153. train_ks = abs(fpr_dev - tpr_dev).max()
    154. print('train_ks: ', train_ks)
    155. y_pred = model.predict_proba(valx)[:,1]
    156. fpr_val, tpr_val, _ = roc_curve(valy, y_pred)
    157. val_ks = abs(fpr_val - tpr_val).max()
    158. print('val_ks: ', val_ks)
    159. y_pred = model.predict_proba(offx)[:,1]
    160. fpr_off, tpr_off, _ = roc_curve(offy, y_pred)
    161. off_ks = abs(fpr_off - tpr_off).max()
    162. print('off_ks: ', off_ks)
    163. plt.plot(fpr_dev, tpr_dev, label='dev')
    164. plt.plot(fpr_val, tpr_val, label='val')
    165. plt.plot(fpr_off, tpr_off, label='off')
    166. plt.plot([0,1], [0,1], 'k--')
    167. plt.xlabel('False positive rate')
    168. plt.ylabel('True positive rate')
    169. plt.title('ROC Curve')
    170. plt.legend(loc='best')
    171. plt.show()
    172. # 13.2.定义XGBoost函数
    173. def xgb_model(x, y, valx, valy, offx, offy):
    174. model = xgb.XGBClassifier(learning_rate=0.05,
    175. n_estimators=400,
    176. max_depth=2,
    177. class_weight='balanced',
    178. min_child_weight=1,
    179. subsample=1,
    180. nthread=-1,
    181. scale_pos_weight=1,
    182. random_state=1,
    183. n_jobs=-1,
    184. reg_lambda=300)
    185. model.fit(x, y)
    186. y_pred = model.predict_proba(x)[:,1]
    187. fpr_dev, tpr_dev, _ = roc_curve(y, y_pred)
    188. train_ks = abs(fpr_dev - tpr_dev).max()
    189. print('train_ks: ', train_ks)
    190. y_pred = model.predict_proba(valx)[:,1]
    191. fpr_val, tpr_val, _ = roc_curve(valy, y_pred)
    192. val_ks = abs(fpr_val - tpr_val).max()
    193. print('val_ks: ', val_ks)
    194. y_pred = model.predict_proba(offx)[:,1]
    195. fpr_off, tpr_off, _ = roc_curve(offy, y_pred)
    196. off_ks = abs(fpr_off - tpr_off).max()
    197. print('off_ks: ', off_ks)
    198. plt.plot(fpr_dev, tpr_dev, label='dev')
    199. plt.plot(fpr_val, tpr_val, label='val')
    200. plt.plot(fpr_off, tpr_off, label='off')
    201. plt.plot([0,1], [0,1], 'k--')
    202. plt.xlabel('False positive rate')
    203. plt.ylabel('True positive rate')
    204. plt.title('ROC Curve')
    205. plt.legend(loc='best')
    206. plt.show()
    207. # 13.3.调用模型
    208. def bi_train(data, dep='bad_ind', exclude=None):
    209. std_scaler = StandardScaler()
    210. # 变量名
    211. lis = data.columns.tolist()
    212. for i in exclude:
    213. lis.remove(i)
    214. data[lis] = std_scaler.fit_transform(data[lis])
    215. devv = data[(data['samp_type'] == 'dev')]
    216. vall = data[(data['samp_type'] == 'val')]
    217. offf = data[(data['samp_type'] == 'off')]
    218. x, y = devv[lis], devv[dep]
    219. valx, valy = vall[lis], vall[dep]
    220. offx, offy = offf[lis], offf[dep]
    221. # 逻辑回归正向
    222. print('逻辑回归正向:')
    223. lr_model(x, y, valx, valy, offx, offy, 0.1)
    224. # 逻辑回归反向
    225. print('逻辑回归反向:')
    226. lr_model(offx, offy, valx, valy, x, y, 0.1)
    227. # XGBoost正向
    228. print('XGBoost正向:')
    229. xgb_model(x, y, valx, valy, offx, offy)
    230. # XGBoost反向
    231. xgb_model(offx, offy, valx, valy, x, y)
    232. bi_train(data, dep='bad_ind', exclude=ex_list)
    233. # 结果
    234. # 逻辑回归正向:
    235. # train_ks: 0.41733648227995124
    236. # train_ks: 0.3593935758405114
    237. # train_ks: 0.3758086175640308
    238. # 逻辑回归反向:
    239. # train_ks: 0.3892612859630226
    240. # train_ks: 0.3717891855920369
    241. # train_ks: 0.4061965880072622
    242. # XGBoost正向:
    243. # [23:20:46] WARNING: C:\Users\Administrator\workspace\xgboost-win64_release_1.2.0\src\learner.cc:516:
    244. # Parameters: { class_weight } might not be used.
    245. # This may not be accurate due to some parameters are only used in language bindings but
    246. # passed down to XGBoost core. Or some parameters are not used but slip through this
    247. # verification. Please open an issue if you find above cases.
    248. # train_ks: 0.42521927400747045
    249. # train_ks: 0.3595542266920359
    250. # train_ks: 0.37437103192850807
    251. # [23:20:48] WARNING: C:\Users\Administrator\workspace\xgboost-win64_release_1.2.0\src\learner.cc:516:
    252. # Parameters: { class_weight } might not be used.
    253. # This may not be accurate due to some parameters are only used in language bindings but
    254. # passed down to XGBoost core. Or some parameters are not used but slip through this
    255. # verification. Please open an issue if you find above cases .
    256. # train_ks: 0.3939473708822855
    257. # train_ks: 0.3799497614606668
    258. # train_ks: 0.3936270948436908
    259. # 14.单个逻辑回归模型进行拟合
    260. dep = 'bad_ind'
    261. lis = list(data.columns)
    262. for i in ex_list:
    263. lis.remove(i)
    264. devv = data[data['samp_type']=='dev']
    265. vall = data[data['samp_type']=='val']
    266. offf = data[data['samp_type']=='off' ]
    267. x, y = devv[lis], devv[dep]
    268. valx, valy = vall[lis], vall[dep]
    269. offx, offy = offf[lis], offf[dep]
    270. lr = LogisticRegression()
    271. lr.fit(x, y)
    272. # 15.计算F1分数、KS值和AUC值
    273. prob_dev = lr.predict_proba(x)[:,1]
    274. print('训练集')
    275. print('F1:', F1(prob_dev,y))
    276. print('KS:', KS(prob_dev,y))
    277. print('AUC:', AUC(prob_dev,y))
    278. # 训练集
    279. # F1: 0.029351208260445533
    280. # KS: 0.4175617517173731
    281. # AUC: 0.7721143736676812
    282. prob_val = lr.predict_proba(valx)[:,1]
    283. print('验证集')
    284. print('F1:', F1(prob_val,valy))
    285. print('KS:', KS(prob_val,valy))
    286. print('AUC:', AUC(prob_val,valy))
    287. # 验证集
    288. # F1: 0.03797468354430379
    289. # KS: 0.3588328912466844
    290. # AUC: 0.7222256797668032
    291. prob_off = lr.predict_proba(offx)[:,1]
    292. print('跨时间')
    293. print('F1:', F1(prob_off,offy))
    294. print('KS:', KS(prob_off,offy))
    295. print('AUC:', AUC(prob_off,offy))
    296. # 跨时间
    297. # F1: 0.022392178506662464
    298. # KS: 0.3696948842371405
    299. # AUC: 0.7436315034285385
    300. # 从两个角度衡量稳定性,分别计算模型PSI和单变量PSI
    301. print('模型PSI:',toad.metrics.PSI(prob_dev,prob_off))
    302. print('特征PSI:','\n',toad.metrics.PSI(x,offx).sort_values(0))
    303. # 模型PSI: 0.33682698167332736
    304. # 特征PSI:
    305. # credit_info 0.098585
    306. # act_info 0.122049
    307. # person_info 0.127833
    308. # dtype: float64
    309. # 生成模型时间外样本的KS报告
    310. toad.metrics.KS_bucket(prob_off,offy,
    311. bucket=10,
    312. method='quantile')
    313. # toad.metrics.KS_bucket(prob_val,valy,
    314. # bucket=10,
    315. # method='quantile')
    316. # toad.metrics.KS_bucket(prob_dev,y,
    317. # bucket=10,
    318. # method='quantile')
    319. # 制作评分卡
    320. card = ScoreCard(combiner=combiner,
    321. transer=t, C=0.1,
    322. class_weight='balanced',
    323. base_score=600,
    324. base_odds=35,
    325. pdo=60,
    326. rate=2)
    327. card.fit(x,y)
    328. final_card = card.export(to_frame=True) # 输出每个变量得分的情况
    329. print(final_card)