来源:https://blog.csdn.net/LuYi_WeiLin/article/details/85060190
贷前准入环节流程图大致如下
为什么需要建立评分卡?
所有的模型一定是服务于业务的,那么业务上到底出现了什么问题,需要用到评分卡模型去解决呢?我们先从金融机构传统定价模式说起。
我们知道银行将钱借出去是要收取利息的,那么收取多少利息是合理的呢?
利息的本质是租金,银行借钱给客户,客户获得了一定时间内这笔钱的使用权,从而需要支付租金,就好像你租房子需要付房租一个道理。但是借钱和借房子不一样的地方在于,借钱出去有信用风险,借款人有可能到期无法偿还这笔钱,所以利息包括了这两方面的价值:
利息=时间价值+风险价值
其中,时间价值取决于市场供需关系(银行同业拆借利率),风险价值取决于客户的违约率。
那么核心问题就是如何量化风险价值,银行的传统模式是对同一类客户群体根据历史表现计算一个平均的违约率。这种做法会存在什么问题呢?
现在我们假设市场上有A和B两家银行,A采取传统定价(计算的平均违约率3%),B采取风险定价(对不同违约率的客户采取不同的定价),想想会发生什么现象?
逆向选择(劣币驱逐良币):那些违约率低于3%的优质客户会从A银行转去B银行借钱,因为B银行根据这个客户的违约率给了更低的利息。从而A银行的优质客户会流失,导致A银行的平均违约率上升,比如上升到5%,A银行被迫上调贷款利率,这时候低于5%违约率的客户又会从A银行流失…
所以银行需要对贷款客户进行风险评级,并且对不同风险等级的客户采取不同的定价,这样就可以有效避免逆向选择问题,那么为什么由模型而不是人来完成这一任务呢?
稳定:人工判断具有一定的随意性和波动性,没有模型稳定
效率:模型的判断效率远远高于人工判断
成本:模型的边际成本为0,也就是开发好模型后,再多的客户数量也不会增加成本。
评分卡的建立步骤
数据获取,包括获取存量客户及潜在客户的数据。存量客户是指已经在证券公司开展相关融资类业务的客户,包括个人客户和机构客户;潜在客户是指未来拟在证券公司开展相关融资类业务的客户,主要包括机构客户,这也是解决证券业样本较少的常用方法,这些潜在机构客户包括上市公司、公开发行债券的发债主体、新三板上市公司、区域股权交易中心挂牌公司、非标融资机构等。
数据预处理,主要工作包括数据清洗、缺失值处理、异常值处理,主要是为了将获取的原始数据转化为可用作模型开发的格式化数据。
探索性数据分析,该步骤主要是获取样本总体的大概情况,描述样本总体情况的指标主要有直方图、箱形图等。
变量选择,该步骤主要是通过统计学的方法,筛选出对违约状态影响最显著的指标。主要有单变量特征选择方法和基于机器学习模型的方法 。
模型开发,该步骤主要包括变量分段、变量的WOE(证据权重)变换和逻辑回归估算三部分。
模型评估,该步骤主要是评估模型的区分能力、预测能力、稳定性,并形成模型评估报告,得出模型是否可以使用的结论。
信用评分,根据逻辑回归的系数和WOE等确定信用评分的方法。将Logistic模型转换为标准评分的形式。
建立评分系统,根据信用评分方法,建立自动信用评分系统。
大家有不懂的可以去看查看我的相关博客,申请评分卡(A卡)共有六篇相关博客,代码附上,数据集在我的博客资料可以下载
给出数据集的字典
字段 名称
member_id ID
loan_amnt 申请额度
term 产品期限
int_rate 利率
emp_length 工作期限
home_ownership 是否有自有住宅
annual_inc 年收入
verification_status 收入核验状态
desc 描述
purpose 贷款目的
title 贷款目的描述
zip_code 联系地址邮政编码
addr_state 联系地址所属州
delinq_2yrs 申贷日期前2年逾期次数
inq_last_6mths 申请日前6个月咨询次数
mths_since_last_delinq 上次逾期距今月份数
mths_since_last_record 上次登记公众记录距今的月份数
open_acc 征信局中记录的信用产品数
pub_rec 公众不良记录数
total_acc 正在使用的信用产品数
pub_rec_bankruptcies 公众破产记录数
earliest_cr_line 第一次借贷时间
loan_status 贷款状态—目标变量
代码:
# -*- coding:utf-8 -*-
import pandas as pd
import re
import time
import datetime
import numpy as np
import pickle
from dateutil.relativedelta import relativedelta
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.linear_model import LogisticRegressionCV
import statsmodels.api as sm
from sklearn.ensemble import RandomForestClassifier
from numpy import log
from sklearn.metrics import roc_auc_score
'''
作者:小象学院
时间:20190224
'''
#当连续变量的初始取值集合太多时(>100),我们先对其进行初步划分
def SplitData(df, col, numOfSplit, special_attribute=[]):
'''
:param df: 按照col排序后的数据集
:param col: 待分箱的变量
:param numOfSplit: 切分的组别数
:param special_attribute: 在切分数据集的时候,某些特殊值需要排除在外
:return: 在原数据集上增加一列,把原始细粒度的col重新划分成粗粒度的值,便于分箱中的合并处理
'''
df2 = df.copy()
if special_attribute != []:
df2 = df.loc[~df[col].isin(special_attribute)]
N = df2.shape[0]
n = N//numOfSplit #每组样本数
splitPointIndex = [i*n for i in range(1,numOfSplit)] #分割点的下标
rawValues = sorted(list(df2[col])) #对取值进行排序
splitPoint = [rawValues[i] for i in splitPointIndex] #分割点的取值
splitPoint = sorted(list(set(splitPoint)))
return splitPoint
def Chi2(df, total_col, bad_col, overallRate):
'''
:param df: 包含全部样本总计与坏样本总计的数据框
:param total_col: 全部样本的个数
:param bad_col: 坏样本的个数
:param overallRate: 全体样本的坏样本占比
:return: 卡方值
'''
df2 = df.copy()
# 期望坏样本个数=全部样本个数*平均坏样本占比
df2['expected'] = df[total_col].apply(lambda x: x*overallRate)
combined = zip(df2['expected'], df2[bad_col])
chi = [(i[0]-i[1])**2/i[0] for i in combined]
chi2 = sum(chi)
return chi2
def BinBadRate(df, col, target, grantRateIndicator=0):
'''
:param df: 需要计算好坏比率的数据集
:param col: 需要计算好坏比率的特征
:param target: 好坏标签
:param grantRateIndicator: 1返回总体的坏样本率,0不返回
:return: 每箱的坏样本率,以及总体的坏样本率(当grantRateIndicator==1时)
'''
total = df.groupby([col])[target].count()
total = pd.DataFrame({'total': total})
bad = df.groupby([col])[target].sum()
bad = pd.DataFrame({'bad': bad})
regroup = total.merge(bad, left_index=True, right_index=True, how='left')
regroup.reset_index(level=0, inplace=True)
regroup['bad_rate'] = regroup.apply(lambda x: x.bad * 1.0 / x.total, axis=1)
dicts = dict(zip(regroup[col],regroup['bad_rate']))
if grantRateIndicator==0:
return (dicts, regroup)
N = sum(regroup['total'])
B = sum(regroup['bad'])
overallRate = B * 1.0 / N
return (dicts, regroup, overallRate)
### ChiMerge_MaxInterval: split the continuous variable using Chi-square value by specifying the max number of intervals
def ChiMerge(df, col, target, max_interval=5,special_attribute=[],minBinPcnt=0):
'''
:param df: 包含目标变量与分箱属性的数据框
:param col: 需要分箱的属性
:param target: 目标变量,取值0或1
:param max_interval: 最大分箱数。如果原始属性的取值个数低于该参数,不执行这段函数
:param special_attribute: 不参与分箱的属性取值
:param minBinPcnt:最小箱的占比,默认为0
:return: 分箱结果
'''
colLevels = sorted(list(set(df[col])))
N_distinct = len(colLevels)#不同的取值个数
if N_distinct <= max_interval: #如果原始属性的取值个数低于max_interval,不执行这段函数
print ("The number of original levels for {} is less than or equal to max intervals".format(col))
return colLevels[:-1]
else:
if len(special_attribute)>=1:
df1 = df.loc[df[col].isin(special_attribute)]
df2 = df.loc[~df[col].isin(special_attribute)]
else:
df2 = df.copy()
N_distinct = len(list(set(df2[col])))#该特征不同的取值
# 步骤一: 通过col对数据集进行分组,求出每组的总样本数与坏样本数
if N_distinct > 100:
split_x = SplitData(df2, col, 100)
df2['temp'] = df2[col].map(lambda x: AssignGroup(x, split_x))
else:
df2['temp'] = df2[col]
# 总体bad rate将被用来计算expected bad count
(binBadRate, regroup, overallRate) = BinBadRate(df2, 'temp', target, grantRateIndicator=1)
# 首先,每个单独的属性值将被分为单独的一组
# 对属性值进行排序,然后两两组别进行合并
colLevels = sorted(list(set(df2['temp'])))
groupIntervals = [[i] for i in colLevels]
# 步骤二:建立循环,不断合并最优的相邻两个组别,直到:
# 1,最终分裂出来的分箱数<=预设的最大分箱数
# 2,每箱的占比不低于预设值(可选)
# 3,每箱同时包含好坏样本
# 如果有特殊属性,那么最终分裂出来的分箱数=预设的最大分箱数-特殊属性的个数
split_intervals = max_interval - len(special_attribute)
while (len(groupIntervals) > split_intervals): # 终止条件: 当前分箱数=预设的分箱数
# 每次循环时, 计算合并相邻组别后的卡方值。具有最小卡方值的合并方案,是最优方案
chisqList = []
for k in range(len(groupIntervals)-1):
temp_group = groupIntervals[k] + groupIntervals[k+1]
df2b = regroup.loc[regroup['temp'].isin(temp_group)]
chisq = Chi2(df2b, 'total', 'bad', overallRate)
chisqList.append(chisq)
best_comnbined = chisqList.index(min(chisqList))
groupIntervals[best_comnbined] = groupIntervals[best_comnbined] + groupIntervals[best_comnbined+1]
# after combining two intervals, we need to remove one of them
groupIntervals.remove(groupIntervals[best_comnbined+1])
groupIntervals = [sorted(i) for i in groupIntervals]
cutOffPoints = [max(i) for i in groupIntervals[:-1]]
# 检查是否有箱没有好或者坏样本。如果有,需要跟相邻的箱进行合并,直到每箱同时包含好坏样本
groupedvalues = df2['temp'].apply(lambda x: AssignBin(x, cutOffPoints))
df2['temp_Bin'] = groupedvalues
(binBadRate,regroup) = BinBadRate(df2, 'temp_Bin', target)
[minBadRate, maxBadRate] = [min(binBadRate.values()),max(binBadRate.values())]
while minBadRate ==0 or maxBadRate == 1:
# 找出全部为好/坏样本的箱
indexForBad01 = regroup[regroup['bad_rate'].isin([0,1])].temp_Bin.tolist()
bin=indexForBad01[0]
# 如果是最后一箱,则需要和上一个箱进行合并,也就意味着分裂点cutOffPoints中的最后一个需要移除
if bin == max(regroup.temp_Bin):
cutOffPoints = cutOffPoints[:-1]
# 如果是第一箱,则需要和下一个箱进行合并,也就意味着分裂点cutOffPoints中的第一个需要移除
elif bin == min(regroup.temp_Bin):
cutOffPoints = cutOffPoints[1:]
# 如果是中间的某一箱,则需要和前后中的一个箱进行合并,依据是较小的卡方值
else:
# 和前一箱进行合并,并且计算卡方值
currentIndex = list(regroup.temp_Bin).index(bin)
prevIndex = list(regroup.temp_Bin)[currentIndex - 1]
df3 = df2.loc[df2['temp_Bin'].isin([prevIndex, bin])]
(binBadRate, df2b) = BinBadRate(df3, 'temp_Bin', target)
chisq1 = Chi2(df2b, 'total', 'bad', overallRate)
# 和后一箱进行合并,并且计算卡方值
laterIndex = list(regroup.temp_Bin)[currentIndex + 1]
df3b = df2.loc[df2['temp_Bin'].isin([laterIndex, bin])]
(binBadRate, df2b) = BinBadRate(df3b, 'temp_Bin', target)
chisq2 = Chi2(df2b, 'total', 'bad', overallRate)
if chisq1 < chisq2:
cutOffPoints.remove(cutOffPoints[currentIndex - 1])
else:
cutOffPoints.remove(cutOffPoints[currentIndex])
# 完成合并之后,需要再次计算新的分箱准则下,每箱是否同时包含好坏样本
groupedvalues = df2['temp'].apply(lambda x: AssignBin(x, cutOffPoints))
df2['temp_Bin'] = groupedvalues
(binBadRate, regroup) = BinBadRate(df2, 'temp_Bin', target)
[minBadRate, maxBadRate] = [min(binBadRate.values()), max(binBadRate.values())]
# 需要检查分箱后的最小占比
if minBinPcnt > 0:
groupedvalues = df2['temp'].apply(lambda x: AssignBin(x, cutOffPoints))
df2['temp_Bin'] = groupedvalues
valueCounts = groupedvalues.value_counts().to_frame()
valueCounts['pcnt'] = valueCounts['temp'].apply(lambda x: x * 1.0 / N)
valueCounts = valueCounts.sort_index()
minPcnt = min(valueCounts['pcnt'])
while minPcnt < minBinPcnt and len(cutOffPoints) > 2:
# 找出占比最小的箱
indexForMinPcnt = valueCounts[valueCounts['pcnt'] == minPcnt].index.tolist()[0]
# 如果占比最小的箱是最后一箱,则需要和上一个箱进行合并,也就意味着分裂点cutOffPoints中的最后一个需要移除
if indexForMinPcnt == max(valueCounts.index):
cutOffPoints = cutOffPoints[:-1]
# 如果占比最小的箱是第一箱,则需要和下一个箱进行合并,也就意味着分裂点cutOffPoints中的第一个需要移除
elif indexForMinPcnt == min(valueCounts.index):
cutOffPoints = cutOffPoints[1:]
# 如果占比最小的箱是中间的某一箱,则需要和前后中的一个箱进行合并,依据是较小的卡方值
else:
# 和前一箱进行合并,并且计算卡方值
currentIndex = list(valueCounts.index).index(indexForMinPcnt)
prevIndex = list(valueCounts.index)[currentIndex - 1]
df3 = df2.loc[df2['temp_Bin'].isin([prevIndex, indexForMinPcnt])]
(binBadRate, df2b) = BinBadRate(df3, 'temp_Bin', target)
chisq1 = Chi2(df2b, 'total', 'bad', overallRate)
# 和后一箱进行合并,并且计算卡方值
laterIndex = list(valueCounts.index)[currentIndex + 1]
df3b = df2.loc[df2['temp_Bin'].isin([laterIndex, indexForMinPcnt])]
(binBadRate, df2b) = BinBadRate(df3b, 'temp_Bin', target)
chisq2 = Chi2(df2b, 'total', 'bad', overallRate)
if chisq1 < chisq2:
cutOffPoints.remove(cutOffPoints[currentIndex - 1])
else:
cutOffPoints.remove(cutOffPoints[currentIndex])
cutOffPoints = special_attribute + cutOffPoints
return cutOffPoints
def UnsupervisedSplitBin(df,var,numOfSplit = 5, method = 'equal freq'):
'''
:param df: 数据集
:param var: 需要分箱的变量。仅限数值型。
:param numOfSplit: 需要分箱个数,默认是5
:param method: 分箱方法,'equal freq':,默认是等频,否则是等距
:return:
'''
if method == 'equal freq':
N = df.shape[0]
n = N / numOfSplit
splitPointIndex = [i * n for i in range(1, numOfSplit)]
rawValues = sorted(list(df[col]))
splitPoint = [rawValues[i] for i in splitPointIndex]
splitPoint = sorted(list(set(splitPoint)))
return splitPoint
else:
var_max, var_min = max(df[var]), min(df[var])
interval_len = (var_max - var_min)*1.0/numOfSplit
splitPoint = [var_min + i*interval_len for i in range(1,numOfSplit)]
return splitPoint
def AssignGroup(x, bin):
'''
:param x: 某个变量的某个取值
:param bin: 上述变量的分箱结果
:return: x在分箱结果下的映射
'''
N = len(bin)
if x<=min(bin):
return min(bin)
elif x>max(bin):
return 10e10
else:
for i in range(N-1):
if bin[i] < x <= bin[i+1]:
return bin[i+1]
def BadRateEncoding(df, col, target):
'''
:param df: dataframe containing feature and target
:param col: the feature that needs to be encoded with bad rate, usually categorical type
:param target: good/bad indicator
:return: the assigned bad rate to encode the categorical feature
'''
regroup = BinBadRate(df, col, target, grantRateIndicator=0)[1]
br_dict = regroup[[col,'bad_rate']].set_index([col]).to_dict(orient='index')
for k, v in br_dict.items():
br_dict[k] = v['bad_rate']
badRateEnconding = df[col].map(lambda x: br_dict[x])
return {'encoding':badRateEnconding, 'bad_rate':br_dict}
def AssignBin(x, cutOffPoints,special_attribute=[]):
'''
:param x: 某个变量的某个取值
:param cutOffPoints: 上述变量的分箱结果,用切分点表示
:param special_attribute: 不参与分箱的特殊取值
:return: 分箱后的对应的第几个箱,从0开始
for example, if cutOffPoints = [10,20,30], if x = 7, return Bin 0. If x = 35, return Bin 3
'''
numBin = len(cutOffPoints) + 1 + len(special_attribute)
if x in special_attribute:
i = special_attribute.index(x)+1
return 'Bin {}'.format(0-i)
if x<=cutOffPoints[0]:
return 'Bin 0'
elif x > cutOffPoints[-1]:
return 'Bin {}'.format(numBin-1)
else:
for i in range(0,numBin-1):
if cutOffPoints[i] < x <= cutOffPoints[i+1]:
return 'Bin {}'.format(i+1)
def CalcWOE(df, col, target):
'''
:param df: 包含需要计算WOE的变量和目标变量
:param col: 需要计算WOE、IV的变量,必须是分箱后的变量,或者不需要分箱的类别型变量
:param target: 目标变量,0、1表示好、坏
:return: 返回WOE和IV
'''
total = df.groupby([col])[target].count()
total = pd.DataFrame({'total': total})
bad = df.groupby([col])[target].sum()
bad = pd.DataFrame({'bad': bad})
regroup = total.merge(bad, left_index=True, right_index=True, how='left')
regroup.reset_index(level=0, inplace=True)
N = sum(regroup['total'])
B = sum(regroup['bad'])
regroup['good'] = regroup['total'] - regroup['bad']
G = N - B
regroup['bad_pcnt'] = regroup['bad'].map(lambda x: x*1.0/B)
regroup['good_pcnt'] = regroup['good'].map(lambda x: x * 1.0 / G)
regroup['WOE'] = regroup.apply(lambda x: np.log(x.good_pcnt*1.0/x.bad_pcnt),axis = 1)
WOE_dict = regroup[[col,'WOE']].set_index(col).to_dict(orient='index')
for k, v in WOE_dict.items():
WOE_dict[k] = v['WOE']
IV = regroup.apply(lambda x: (x.good_pcnt-x.bad_pcnt)*np.log(x.good_pcnt*1.0/x.bad_pcnt),axis = 1)
IV = sum(IV)
return {"WOE": WOE_dict, 'IV':IV}
## 判断某变量的坏样本率是否单调
def BadRateMonotone(df, sortByVar, target,special_attribute = []):
'''
:param df: 包含检验坏样本率的变量,和目标变量
:param sortByVar: 需要检验坏样本率的变量
:param target: 目标变量,0、1表示好、坏
:param special_attribute: 不参与检验的特殊值
:return: 坏样本率单调与否
'''
df2 = df.loc[~df[sortByVar].isin(special_attribute)]
if len(set(df2[sortByVar])) <= 2:
return True
regroup = BinBadRate(df2, sortByVar, target)[1]
combined = zip(regroup['total'],regroup['bad'])
badRate = [x[1]*1.0/x[0] for x in combined]
badRateNotMonotone = [badRate[i]<badRate[i+1] and badRate[i] < badRate[i-1] or badRate[i]>badRate[i+1] and badRate[i] > badRate[i-1]
for i in range(1,len(badRate)-1)]
if True in badRateNotMonotone:
return False
else:
return True
def MergeBad0(df,col,target, direction='bad'):
'''
:param df: 包含检验0%或者100%坏样本率
:param col: 分箱后的变量或者类别型变量。检验其中是否有一组或者多组没有坏样本或者没有好样本。如果是,则需要进行合并
:param target: 目标变量,0、1表示好、坏
:return: 合并方案,使得每个组里同时包含好坏样本
'''
regroup = BinBadRate(df, col, target)[1]
if direction == 'bad':
# 如果是合并0坏样本率的组,则跟最小的非0坏样本率的组进行合并
regroup = regroup.sort_values(by = 'bad_rate')
else:
# 如果是合并0好样本样本率的组,则跟最小的非0好样本率的组进行合并
regroup = regroup.sort_values(by='bad_rate',ascending=False)
regroup.index = range(regroup.shape[0])
col_regroup = [[i] for i in regroup[col]]
del_index = []
for i in range(regroup.shape[0]-1):
col_regroup[i+1] = col_regroup[i] + col_regroup[i+1]
del_index.append(i)
if direction == 'bad':
if regroup['bad_rate'][i+1] > 0:
break
else:
if regroup['bad_rate'][i+1] < 1:
break
col_regroup2 = [col_regroup[i] for i in range(len(col_regroup)) if i not in del_index]
newGroup = {}
for i in range(len(col_regroup2)):
for g2 in col_regroup2[i]:
newGroup[g2] = 'Bin '+str(i)
return newGroup
def Prob2Score(prob, basePoint, PDO):
#将概率转化成分数且为正整数
y = np.log(prob/(1-prob))
return int(basePoint+PDO/np.log(2)*(-y))
### 计算KS值
def KS(df, score, target):
'''
:param df: 包含目标变量与预测值的数据集,dataframe
:param score: 得分或者概率,str
:param target: 目标变量,str
:return: KS值
'''
total = df.groupby([score])[target].count()
bad = df.groupby([score])[target].sum()
all = pd.DataFrame({'total':total, 'bad':bad})
all['good'] = all['total'] - all['bad']
all[score] = all.index
all = all.sort_values(by=score,ascending=False)
all.index = range(len(all))
all['badCumRate'] = all['bad'].cumsum() / all['bad'].sum()
all['goodCumRate'] = all['good'].cumsum() / all['good'].sum()
KS = all.apply(lambda x: x.badCumRate - x.goodCumRate, axis=1)
return max(KS)
def CareerYear(x):
#对工作年限进行转换
if str(x).find('nan') > -1:
return -1
elif str(x).find("10+")>-1: #将"10+years"转换成 11
return 11
elif str(x).find('< 1') > -1: #将"< 1 year"转换成 0
return 0
else:
return int(re.sub("\D", "", x)) #其余数据,去掉"years"并转换成整数
def DescExisting(x):
#将desc变量转换成有记录和无记录两种
if type(x).__name__ == 'float':
return 'no desc'
else:
return 'desc'
def ConvertDateStr(x):
mth_dict = {'Jan': 1, 'Feb': 2, 'Mar': 3, 'Apr': 4, 'May': 5, 'Jun': 6, 'Jul': 7, 'Aug': 8, 'Sep': 9, 'Oct': 10,
'Nov': 11, 'Dec': 12}
if str(x) == 'nan':
return datetime.datetime.fromtimestamp(time.mktime(time.strptime('9900-1','%Y-%m')))
#time.mktime 不能读取1970年之前的日期
else:
yr = int(x[4:6])
if yr <=17:
yr = 2000+yr
else:
yr = 1900 + yr
mth = mth_dict[x[:3]]
return datetime.datetime(yr,mth,1)
def MonthGap(earlyDate, lateDate):
if lateDate > earlyDate:
gap = relativedelta(lateDate,earlyDate)
yr = gap.years
mth = gap.months
return yr*12+mth
else:
return 0
def MakeupMissing(x):
if np.isnan(x):
return -1
else:
return x
# 数据预处理
# 1,读入数据
# 2,选择合适的建模样本
# 3,数据集划分成训练集和测试集
folderOfData = 'H:/信贷风控/资料/'
allData = pd.read_csv(folderOfData + '数据集.csv',header = 0, encoding = 'latin1',engine ='python')
allData['term'] = allData['term'].apply(lambda x: int(x.replace(' months','')))
# 处理标签:Fully Paid是正常用户;Charged Off是违约用户
allData['y'] = allData['loan_status'].map(lambda x: int(x == 'Charged Off'))
'''
由于存在不同的贷款期限(term),申请评分卡模型评估的违约概率必须要在统一的期限中,且不宜太长,所以选取term=36months的行本
'''
allData1 = allData.loc[allData.term == 36]
trainData, testData = train_test_split(allData1,test_size=0.4)
#固化变量
trainDataFile = open(folderOfData+'trainData.pkl','wb+')
pickle.dump(trainData, trainDataFile)
trainDataFile.close()
testDataFile = open(folderOfData+'testData.pkl','wb+')
pickle.dump(testData, testDataFile)
testDataFile.close()
'''
第一步:数据预处理,包括
(1)数据清洗
(2)格式转换
(3)确实值填补
'''
# 将带%的百分比变为浮点数
trainData['int_rate_clean'] = trainData['int_rate'].map(lambda x: float(x.replace('%',''))/100)
# 将工作年限进行转化,否则影响排序,CareerYear为函数前面有定义
trainData['emp_length_clean'] = trainData['emp_length'].map(CareerYear)
#贷款描述,有写和没写状态区分,将desc的缺失作为一种状态,非缺失作为另一种状态
trainData['desc_clean'] = trainData['desc'].map(DescExisting)
#处理日期。earliest_cr_line(第一次借贷时间)的格式不统一,需要统一格式且转换成python的日期
trainData['app_date_clean'] = trainData['issue_d'].map(lambda x: ConvertDateStr(x))
trainData['earliest_cr_line_clean'] = trainData['earliest_cr_line'].map(lambda x: ConvertDateStr(x))
# 处理mths_since_last_delinq(上次逾期距今月份数)。注意原始值中有0,所以用-1代替缺失
trainData['mths_since_last_delinq_clean'] = trainData['mths_since_last_delinq'].map(lambda x:MakeupMissing(x))
#上次登记公众记录距今的月份数(mths_since_last_record)
trainData['mths_since_last_record_clean'] = trainData['mths_since_last_record'].map(lambda x:MakeupMissing(x))
#公众破产记录数(pub_rec_bankruptcies)
trainData['pub_rec_bankruptcies_clean'] = trainData['pub_rec_bankruptcies'].map(lambda x:MakeupMissing(x))
'''
第二步:变量衍生
'''
# 考虑申请额度与收入的占比(limit_income)
trainData['limit_income'] = trainData.apply(lambda x: x.loan_amnt / x.annual_inc, axis = 1)
#考虑earliest_cr_line到申请日期的跨度,以月份记
trainData['earliest_cr_to_app'] = trainData.apply(lambda x: MonthGap(x.earliest_cr_line_clean,x.app_date_clean), axis = 1)
'''
第三步:分箱,采用ChiMerge,要求分箱完之后:
(1)不超过5箱
(2)Bad Rate单调
(3)每箱同时包含好坏样本
(4)特殊值如-1,单独成一箱
连续型变量可直接分箱
类别型变量:
(a)当取值较多时,先用bad rate编码,再用连续型分箱的方式进行分箱
(b)当取值较少时:
(b1)如果每种类别同时包含好坏样本,无需分箱
(b2)如果有类别只包含好坏样本的一种,需要合并
'''
#对数据源变量进行划分
#num_features 连续型变量
num_features = ['int_rate_clean','emp_length_clean','annual_inc', 'dti', 'delinq_2yrs', 'earliest_cr_to_app','inq_last_6mths', \
'mths_since_last_record_clean', 'mths_since_last_delinq_clean','open_acc','pub_rec','total_acc','limit_income','earliest_cr_to_app']
#cat_features 离散型变量
cat_features = ['home_ownership', 'verification_status','desc_clean', 'purpose', 'zip_code','addr_state','pub_rec_bankruptcies_clean']
#more_value_features存放类别型变量变量取值大于5的变量、less_value_features存放类别型变量变量取值小于5的变量
more_value_features = []
less_value_features = []
# 第一步,检查类别型变量中,哪些变量取值超过5
for var in cat_features:
valueCounts = len(set(trainData[var]))
print(valueCounts)
if valueCounts > 5:
more_value_features.append(var) #取值超过5的变量,需要bad rate编码,再用卡方分箱法进行分箱
else:
less_value_features.append(var)
# (i)当取值<5时:如果每种类别同时包含好坏样本,无需分箱;如果有类别只包含好坏样本的一种,需要合并
merge_bin_dict = {} #存放需要合并的变量,以及合并方法
var_bin_list = [] #由于某个取值没有好或者坏样本而需要合并的变量
for col in less_value_features:
binBadRate = BinBadRate(trainData, col, 'y')[0]
if min(binBadRate.values()) == 0 : #由于某个取值没有坏样本而进行合并
print('{} need to be combined due to 0 bad rate'.format(col))
combine_bin = MergeBad0(trainData, col, 'y',direction = 'bad')
merge_bin_dict[col] = combine_bin
newVar = col + '_Bin'
trainData[newVar] = trainData[col].map(combine_bin)
var_bin_list.append(newVar)
if max(binBadRate.values()) == 1: #由于某个取值没有好样本而进行合并
print('{} need to be combined due to 0 good rate'.format(col))
combine_bin = MergeBad0(trainData, col, 'y',direction = 'good')
merge_bin_dict[col] = combine_bin
newVar = col + '_Bin'
trainData[newVar] = trainData[col].map(combine_bin)
var_bin_list.append(newVar)
#保存merge_bin_dict
file1 = open(folderOfData+'merge_bin_dict.pkl','wb+')
pickle.dump(merge_bin_dict,file1)
file1.close()
#less_value_features里剩下不需要合并的变量
less_value_features = [i for i in less_value_features if i + '_Bin' not in var_bin_list]
# (ii)当取值>5时:用bad rate进行编码,放入连续型变量里
br_encoding_dict = {} #记录按照bad rate进行编码的变量,及编码方式
for col in more_value_features:
br_encoding = BadRateEncoding(trainData, col, 'y')
trainData[col+'_br_encoding'] = br_encoding['encoding']
br_encoding_dict[col] = br_encoding['bad_rate']
#bad rate进行编码后,就可以放入连续型变量中(因为使用方法都一样)
num_features.append(col+'_br_encoding')
file2 = open(folderOfData+'br_encoding_dict.pkl','wb+')
pickle.dump(br_encoding_dict,file2)
file2.close()
# (iii)对连续型变量进行分箱,包括(ii)中的变量
continous_merged_dict = {}
for col in num_features:
print("{} is in processing".format(col))
if -1 not in set(trainData[col]): #-1会当成特殊值处理。如果没有-1,则所有取值都参与分箱
max_interval = 5 #分箱后的最多的箱数
cutOff = ChiMerge(trainData, col, 'y', max_interval=max_interval,special_attribute=[],minBinPcnt=0)
trainData[col+'_Bin'] = trainData[col].map(lambda x: AssignBin(x, cutOff,special_attribute=[]))
monotone = BadRateMonotone(trainData, col+'_Bin', 'y') # 检验分箱后的单调性是否满足
while(not monotone):
# 检验分箱后的单调性是否满足。如果不满足,则缩减分箱的个数。
max_interval -= 1
cutOff = ChiMerge(trainData, col, 'y', max_interval=max_interval, special_attribute=[],
minBinPcnt=0)
trainData[col + '_Bin'] = trainData[col].map(lambda x: AssignBin(x, cutOff, special_attribute=[]))
if max_interval == 2:
# 当分箱数为2时,必然单调
break
monotone = BadRateMonotone(trainData, col + '_Bin', 'y')
newVar = col + '_Bin'
trainData[newVar] = trainData[col].map(lambda x: AssignBin(x, cutOff, special_attribute=[]))
var_bin_list.append(newVar)
else:
max_interval = 5
# 如果有-1,则除去-1后,其他取值参与分箱
cutOff = ChiMerge(trainData, col, 'y', max_interval=max_interval, special_attribute=[-1],
minBinPcnt=0)
trainData[col + '_Bin'] = trainData[col].map(lambda x: AssignBin(x, cutOff, special_attribute=[-1]))
monotone = BadRateMonotone(trainData, col + '_Bin', 'y',['Bin -1'])
while (not monotone):
max_interval -= 1
# 如果有-1,-1的bad rate不参与单调性检验
cutOff = ChiMerge(trainData, col, 'y', max_interval=max_interval, special_attribute=[-1],
minBinPcnt=0)
trainData[col + '_Bin'] = trainData[col].map(lambda x: AssignBin(x, cutOff, special_attribute=[-1]))
if max_interval == 3:
# 当分箱数为3-1=2时,必然单调
break
monotone = BadRateMonotone(trainData, col + '_Bin', 'y',['Bin -1'])
newVar = col + '_Bin'
trainData[newVar] = trainData[col].map(lambda x: AssignBin(x, cutOff, special_attribute=[-1]))
var_bin_list.append(newVar)
continous_merged_dict[col] = cutOff
file3 = open(folderOfData+'continous_merged_dict.pkl','wb+')
pickle.dump(continous_merged_dict,file3)
file3.close()
'''
第四步:WOE编码、计算IV
'''
WOE_dict = {}
IV_dict = {}
# 分箱后的变量进行编码,包括:
# 1,初始取值个数小于5,且不需要合并的类别型变量。存放在less_value_features中
# 2,初始取值个数小于5,需要合并的类别型变量。合并后新的变量存放在var_bin_list中
# 3,初始取值个数超过5,需要合并的类别型变量。合并后新的变量存放在var_bin_list中
# 4,连续变量。分箱后新的变量存放在var_bin_list中
all_var = var_bin_list + less_value_features
for var in all_var:
woe_iv = CalcWOE(trainData, var, 'y')
WOE_dict[var] = woe_iv['WOE']
IV_dict[var] = woe_iv['IV']
file4 = open(folderOfData+'WOE_dict.pkl','wb+')
pickle.dump(WOE_dict,file4)
file4.close()
#将变量IV值进行降序排列,方便后续挑选变量
IV_dict_sorted = sorted(IV_dict.items(), key=lambda x: x[1], reverse=True)
IV_values = [i[1] for i in IV_dict_sorted]
IV_name = [i[0] for i in IV_dict_sorted]
plt.title('feature IV')
plt.bar(range(len(IV_values)),IV_values)
'''
第五步:单变量分析和多变量分析,均基于WOE编码后的值。
(1)选择IV高于0.01的变量
(2)比较两两线性相关性。如果相关系数的绝对值高于阈值,剔除IV较低的一个
'''
#选取IV>0.01的变量
high_IV = {k:v for k, v in IV_dict.items() if v >= 0.02}
high_IV_sorted = sorted(high_IV.items(),key=lambda x:x[1],reverse=True)
short_list = high_IV.keys()
short_list_2 = []
for var in short_list:
newVar = var + '_WOE'
trainData[newVar] = trainData[var].map(WOE_dict[var])
short_list_2.append(newVar)
#对于上一步的结果,计算相关系数矩阵,并画出热力图进行数据可视化
trainDataWOE = trainData[short_list_2]
f, ax = plt.subplots(figsize=(10, 8))
corr = trainDataWOE.corr()
sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(220, 10, as_cmap=True),square=True, ax=ax)
#两两间的线性相关性检验
#1,将候选变量按照IV进行降序排列
#2,计算第i和第i+1的变量的线性相关系数
#3,对于系数超过阈值的两个变量,剔除IV较低的一个
deleted_index = []
cnt_vars = len(high_IV_sorted)
for i in range(cnt_vars):
if i in deleted_index:
continue
x1 = high_IV_sorted[i][0]+"_WOE"
for j in range(cnt_vars):
if i == j or j in deleted_index:
continue
y1 = high_IV_sorted[j][0]+"_WOE"
roh = np.corrcoef(trainData[x1],trainData[y1])[0,1]
if abs(roh)>0.7:
x1_IV = high_IV_sorted[i][1]
y1_IV = high_IV_sorted[j][1]
if x1_IV > y1_IV:
deleted_index.append(j)
else:
deleted_index.append(i)
multi_analysis_vars_1 = [high_IV_sorted[i][0]+"_WOE" for i in range(cnt_vars) if i not in deleted_index]
'''
多变量分析:VIF
'''
X = np.matrix(trainData[multi_analysis_vars_1])
VIF_list = [variance_inflation_factor(X, i) for i in range(X.shape[1])]
max_VIF = max(VIF_list)
print(max_VIF)
# 最大的VIF是1.32267733123(小于10),因此这一步认为没有多重共线性
multi_analysis = multi_analysis_vars_1
'''
第六步:逻辑回归模型。
要求:
1,变量显著
2,符号为负
'''
### (1)将多变量分析的后变量带入LR模型中
y = trainData['y']
X = trainData[multi_analysis]
X['intercept'] = [1]*X.shape[0]
LR = sm.Logit(y, X).fit()
#summary = LR.summary()
pvals = LR.pvalues
pvals = pvals.to_dict()
### 有些变量不显著,需要逐步剔除
varLargeP = {k: v for k,v in pvals.items() if v >= 0.1}
varLargeP = sorted(varLargeP.items(), key=lambda d:d[1], reverse = True)
while(len(varLargeP) > 0 and len(multi_analysis) > 0):
# 每次迭代中,剔除最不显著的变量,直到
# (1) 剩余所有变量均显著
# (2) 没有特征可选
varMaxP = varLargeP[0][0]
print(varMaxP)
if varMaxP == 'intercept':
print('the intercept is not significant!')
break
multi_analysis.remove(varMaxP)
y = trainData['y']
X = trainData[multi_analysis]
X['intercept'] = [1] * X.shape[0]
LR = sm.Logit(y, X).fit()
#summary = LR.summary()
pvals = LR.pvalues
pvals = pvals.to_dict()
varLargeP = {k: v for k, v in pvals.items() if v >= 0.1}
varLargeP = sorted(varLargeP.items(), key=lambda d: d[1], reverse=True)
#将模型保存
saveModel =open(folderOfData+'LR_Model_Normal.pkl','wb+')
pickle.dump(LR,saveModel)
saveModel.close()
def ModifyDf(x, new_value):
if np.isnan(x):
return new_value
else:
return x
'''
将模型应用在测试数据集上
'''
testDataFile = open(folderOfData+'testData.pkl','rb+')
testData = pickle.load((testDataFile))
testDataFile.close()
'''
第一步:完成数据预处理
在实际工作中,可以只清洗模型实际使用的字段
'''
# 将带%的百分比变为浮点数
testData['int_rate_clean'] = testData['int_rate'].map(lambda x: float(x.replace('%',''))/100)
# 将工作年限进行转化,否则影响排序
testData['emp_length_clean'] = testData['emp_length'].map(CareerYear)
# 将desc的缺失作为一种状态,非缺失作为另一种状态
testData['desc_clean'] = testData['desc'].map(DescExisting)
# 处理日期。earliest_cr_line的格式不统一,需要统一格式且转换成python的日期
testData['app_date_clean'] = testData['issue_d'].map(lambda x: ConvertDateStr(x))
testData['earliest_cr_line_clean'] = testData['earliest_cr_line'].map(lambda x: ConvertDateStr(x))
# 处理mths_since_last_delinq。注意原始值中有0,所以用-1代替缺失
testData['mths_since_last_delinq_clean'] = testData['mths_since_last_delinq'].map(lambda x:MakeupMissing(x))
testData['mths_since_last_record_clean'] = testData['mths_since_last_record'].map(lambda x:MakeupMissing(x))
testData['pub_rec_bankruptcies_clean'] = testData['pub_rec_bankruptcies'].map(lambda x:MakeupMissing(x))
'''
第二步:变量衍生
'''
# 考虑申请额度与收入的占比
testData['limit_income'] = testData.apply(lambda x: x.loan_amnt / x.annual_inc, axis = 1)
# 考虑earliest_cr_line到申请日期的跨度,以月份记
testData['earliest_cr_to_app'] = testData.apply(lambda x: MonthGap(x.earliest_cr_line_clean,x.app_date_clean), axis = 1)
'''
第三步:分箱并代入WOE值
'''
modelFile =open(folderOfData+'LR_Model_Normal.pkl','rb+')
LR = pickle.load(modelFile)
modelFile.close()
#对变量的处理只需针对入模变量即可
var_in_model = list(LR.pvalues.index)
var_in_model.remove('intercept')
file1 = open(folderOfData+'merge_bin_dict.pkl','rb+')
merge_bin_dict = pickle.load(file1)
file1.close()
file2 = open(folderOfData+'br_encoding_dict.pkl','rb+')
br_encoding_dict = pickle.load(file2)
file2.close()
file3 = open(folderOfData+'continous_merged_dict.pkl','rb+')
continous_merged_dict = pickle.load(file3)
file3.close()
file4 = open(folderOfData+'WOE_dict.pkl','rb+')
WOE_dict = pickle.load(file4)
file4.close()
for var in var_in_model:
var1 = var.replace('_Bin_WOE','')
# 有些取值个数少、但是需要合并的变量
if var1 in merge_bin_dict.keys():
print("{} need to be regrouped".format(var1))
testData[var1 + '_Bin'] = testData[var1].map(merge_bin_dict[var1])
# 有些变量需要用bad rate进行编码
if var1.find('_br_encoding')>-1:
var2 =var1.replace('_br_encoding','')
print("{} need to be encoded by bad rate".format(var2))
testData[var1] = testData[var2].map(br_encoding_dict[var2])
#需要注意的是,有可能在测试样中某些值没有出现在训练样本中,从而无法得出对应的bad rate是多少。故可以用最坏(即最大)的bad rate进行编码
max_br = max(testData[var1])
testData[var1] = testData[var1].map(lambda x: ModifyDf(x, max_br))
#上述处理后,需要加上连续型变量一起进行分箱
if -1 not in set(testData[var1]):
testData[var1+'_Bin'] = testData[var1].map(lambda x: AssignBin(x, continous_merged_dict[var1]))
else:
testData[var1 + '_Bin'] = testData[var1].map(lambda x: AssignBin(x, continous_merged_dict[var1],[-1]))
#WOE编码
var3 = var.replace('_WOE','')
testData[var] = testData[var3].map(WOE_dict[var3])
'''
第四步:将WOE值代入LR模型,计算概率和分数
'''
testData['intercept'] = [1]*testData.shape[0]
#预测数据集中,变量顺序需要和LR模型的变量顺序一致
#例如在训练集里,变量在数据中的顺序是“负债比”在“借款目的”之前,对应地,在测试集里,“负债比”也要在“借款目的”之前
testData2 = testData[list(LR.params.index)]
testData['prob'] = LR.predict(testData2)
#计算KS和AUC
auc = roc_auc_score(testData['y'],testData['prob'])
ks = KS(testData, 'prob', 'y')
basePoint = 250
PDO = 200
testData['score'] = testData['prob'].map(lambda x:Prob2Score(x, basePoint, PDO))
testData = testData.sort_values(by = 'score')
#画出分布图
plt.hist(testData['score'], 100)
plt.xlabel('score')
plt.ylabel('freq')
plt.title('distribution')