来源:https://zhuanlan.zhihu.com/p/107255691?utm_source=wechat_timeline

Toad的基本概念.

  • Toad是一个用于在金融场景下分析数据非常方便的库,我这篇是打算根据文档配上例子撸一遍.
  • Toad分为9个子模块.
  1. toad.detecor module 精细版describe
  2. toad.merge module 专门针对分箱
  3. toad.metrics module Sklearn没有的偏金融模型评价指标
  4. toad.plot module 作图模块
  5. toad.scorecard module 直接做卡模块
  6. toad.selection module 看函数是用于根据不同评价指标删除特征用的
  7. toad.stats module 计算特征的熵,基尼系数等,iv,badrate等
  8. toad.transform module Woe转换

toad官方文档:https://toad.readthedocs.io/en/stable/toad.detector.html

Basic Tutorial For Toad

接下来跟着官方文档过一遍Toad的基本功能,使用的数据集可以在这里下载,例子分为五部分:

  1. EDA
  2. 特征选择,WOE分箱
  3. 模型挑选
  4. 模型验证
  5. 分数变换 ```python

    !pip install —upgrade toad

    import pandas as pd import numpy as np from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split

import toad # Our Main Character Today!

data = pd.read_csv(‘german_credit_data.csv’) data.drop(‘Unnamed: 0’,axis=1,inplace=True) data.replace({‘good’:0,’bad’:1},inplace=True)

  1. ![image.png](https://cdn.nlark.com/yuque/0/2022/png/21930551/1648261605539-fad30f02-7e25-416c-8c5d-c9c373c3182a.png#clientId=u1a8c8094-e5a0-4&from=paste&id=u484f030c&originHeight=161&originWidth=720&originalType=url&ratio=1&rotation=0&showTitle=false&size=56910&status=done&style=none&taskId=uf0cbddee-22bf-44d6-8628-4bb10b086f5&title=)
  2. ```python
  3. Xtr,Xts,Ytr,Yts = train_test_split(data.drop('Risk',axis=1),data['Risk'],test_size=0.25,random_state=450)
  4. data_tr = pd.concat([Xtr,Ytr],axis=1)
  5. data_tr['type'] = 'train'
  6. data_ts = pd.concat([Xts,Yts],axis=1)
  7. data['type'] = 'test'
  8. print(data_tr.shape)

使用toad.detector.detect()来进行生成数据EDA报告

  1. toad.detector.detect(data_tr).columns
  2. Index(['type', 'size', 'missing', 'unique', 'mean_or_top1', 'std_or_top2',
  3. 'min_or_top3', '1%_or_top4', '10%_or_top5', '50%_or_bottom5',
  4. '75%_or_bottom4', '90%_or_bottom3', '99%_or_bottom2', 'max_or_bottom1'],
  5. dtype='object')
  6. toad.detector.detect(data_tr)

image.png

特征选择,WOE变换

  • 使用toad.selection.select()来根据特征缺失率,iv值,膨胀因子进行特征过滤 ```python selected_data, drop_lst = toad.selection.select(
    1. data_tr,target = 'Risk',empty = 0.5,
    2. iv=0.05, corr=0.7,return_drop=True,exclude=['type']
    )

selected_test = data_ts[selected_data.columns] print(drop_lst) {‘empty’: array([], dtype=float64), ‘iv’: array([‘Sex’, ‘Job’], dtype=object), ‘corr’: array([], dtype=object)}

quality = toad.quality(data,’Risk’) quality.sort_values(‘iv’,ascending=False)

  1. ![image.png](https://cdn.nlark.com/yuque/0/2022/png/21930551/1648261668101-03e2b4aa-886e-4625-bea5-cc333a098751.png#clientId=u1a8c8094-e5a0-4&from=paste&id=uc92727c9&originHeight=568&originWidth=714&originalType=url&ratio=1&rotation=0&showTitle=false&size=180469&status=done&style=none&taskId=ubfc10bbb-db8d-4091-9508-d872a889302&title=)
  2. <a name="W6ntA"></a>
  3. ## **使用Combiner()对象进行分箱合并**
  4. 1. toad.transform.Combiner()可以用于对数值或分类型特征进行合并,toad支持卡方分箱,决策树分享,百分位分箱.
  5. 2. combiner().fit(data, y = target’, method = chi’, min_samples = None, n_bins = None**)分箱方法,method参数支持:‘chi’, dt’, percentile’, and step’.
  6. 3. combiner().set_rules(dict): 用于确认分箱
  7. 4. combiner().transform(data): 将特征转换为确认的分箱
  8. 5. toad.transform.WOETransformer()对分享后的数据进行woe变换
  9. 6. WOETransformer().fir_transform(data,y_true,exlude=None) 数据的woe变换,exclude传入不需要转换的参数
  10. 7. WOETransformer().transform(data): 用已经建好的转换器转换测试,验证集
  11. <a name="Ju1oN"></a>
  12. ### **作图帮助调整分箱逻辑**
  13. 1. toad.plot.bad_rate_plot(data,target = target’, x = None, by = None) 可视化每一箱在训练测试集的变换情况.
  14. 2. ad.plot.proportion_plot(data[col]): 显示每一箱在某个特征的占比
  15. ```python
  16. # 实例化一个combiner对象
  17. combiner = toad.transform.Combiner()
  18. # fit 并且确定分箱逻辑算法
  19. combiner.fit(selected_data,y='Risk',method='chi',min_samples = 0.05, exclude = 'type')
  20. # 保存分箱
  21. bins = combiner.export()
  22. bins
  23. {'Age': [26, 28, 35, 39, 49],
  24. 'Housing': [['own'], ['free'], ['rent']],
  25. 'Saving accounts': [['nan'],
  26. ['rich'],
  27. ['quite rich'],
  28. ['little'],
  29. ['moderate']],
  30. 'Checking account': [['nan'], ['rich'], ['moderate'], ['little']],
  31. 'Credit amount': [2145, 3914],
  32. 'Duration': [9, 12, 18, 33],
  33. 'Purpose': [['domestic appliances', 'radio/TV'],
  34. ['car'],
  35. ['furniture/equipment', 'repairs', 'business'],
  36. ['education', 'vacation/others']]}
  37. # 通过badrateplot更好的分箱
  38. %matplotlib inline
  39. adj_bin = {'Age': [26, 28, 35, 39, 49]}
  40. c2 = toad.transform.Combiner()
  41. c2.set_rules(adj_bin)
  42. data_ = pd.concat([data_tr,data_ts],axis=0)
  43. temp_data = c2.transform(data_[['Age',"Risk",'type']])
  44. from toad.plot import badrate_plot,proportion_plot
  45. badrate_plot(temp_data,target='Risk',x='type',by='Age')
  46. proportion_plot(temp_data['Age'])

image.png
image.png

  1. # 换个分箱看看
  2. adj_bin = {'Age': [20,25, 28,30, 35, 39, 49]}
  3. c2.set_rules(adj_bin)
  4. temp_data = c2.transform(data_[['Age',"Risk",'type']])
  5. badrate_plot(temp_data,target='Risk',x='type',by='Age')

image.png

  1. # 确认后进行分箱
  2. combiner.set_rules(adj_bin)
  3. binned_data = combiner.transform(selected_data)
  4. transer = toad.transform.WOETransformer()
  5. data_tr_woe = transer.fit_transform(binned_data, binned_data['Risk'], exclude=['Risk','type'])
  6. data_ts_woe = transer.transform(combiner.transform(selected_test))

image.png

  1. # Now ready to model. Fit a lr.
  2. Xtr = data_tr_woe.drop(['Risk','type'],axis=1)
  3. Ytr = data_tr_woe['Risk']
  4. Xts = data_ts_woe.drop(['Risk','type'],axis=1)
  5. Yts = data_ts_woe['Risk']
  6. lr = LogisticRegression()
  7. lr.fit(Xtr, Ytr)

各种花式模型验证

  • 支持ks,F1,auc等等
  1. from toad.metrics import KS, F1, AUC
  2. EYtr_proba = lr.predict_proba(Xtr)[:,1]
  3. EYtr = lr.predict(Xtr)
  4. print('Training error')
  5. print('F1:', F1(EYtr_proba,Ytr))
  6. print('KS:', KS(EYtr_proba,Ytr))
  7. print('AUC:', AUC(EYtr_proba,Ytr))
  8. EYts_proba = lr.predict_proba(Xts)[:,1]
  9. EYts = lr.predict(Xts)
  10. print('\nTest error')
  11. print('F1:', F1(EYts_proba,Yts))
  12. print('KS:', KS(EYts_proba,Yts))
  13. print('AUC:', AUC(EYts_proba,Yts))
  14. Training error
  15. F1: 0.4540763673890609
  16. KS: 0.45453626569857064
  17. AUC: 0.7812139385618382
  18. Test error
  19. F1: 0.44720496894409933
  20. KS: 0.46993266775017406
  21. AUC: 0.7755978639424193

计算训练集和测试机的PSI

  1. psi = toad.metrics.PSI(data_tr_woe,data_ts_woe)
  2. psi.sort_values(0,ascending=False)

生成模型报告(这个我觉得做的也太贴心了吧)

  1. tr_bucket = toad.metrics.KS_bucket(EYtr_proba,Ytr,bucket=10,method='quantile')
  2. tr_bucket

image.png

进行评分卡分数变换

  • 只需要确认分箱数,讲combiner和traner对象以及模型的超参数传入即可.同事能返回每个特征对应的分数
  1. card = toad.scorecard.ScoreCard(combiner = combiner, transer = transer , C = 0.1)
  2. card.fit(Xtr, Ytr)
  3. card.export(to_frame = True,)
  4. # Volia scorecard is done

image.png

  1. pred_scores = card.predict(data_ts)
  2. print('Sample scores:',pred_scores[:10])
  3. print('Test KS: ',KS(pred_scores, data_ts['Risk']))
  4. Sample scores: [588.39992196 473.34800722 657.21263451 498.44359981 577.26501354
  5. 604.90807613 615.34696972 502.9847795 590.77572458 530.03966734]
  6. Test KS: 0.45468616980109905

用gbdt编码,用于gbdt + lr建模的前置

  1. gbdt_transer = toad.transform.GBDTTransformer()
  2. gbdt_transer.fit(final_data[col+['target']], 'target', n_estimators = 10, max_depth = 2)
  3. gbdt_vars = gbdt_transer.transform(final_data[col])
  4. gbdt_vars.shape
  5. >>> (43576, 40)

一个完整的code示例

  1. import pandas as pd
  2. import numpy as np
  3. import toad
  4. from sklearn.datasets import load_iris
  5. # 数据载入
  6. iris = load_iris()
  7. target = iris['target']
  8. iris = pd.DataFrame(iris['data'],columns = iris['feature_names'])
  9. iris['target'] = target
  10. # 数据洞察
  11. eda = toad.detect(iris)
  12. qualitys = toad.quality(iris,'target',iv_only=True)
  13. # 分箱
  14. c = toad.transform.Combiner()
  15. # 使用特征筛选后的数据进行训练:使用稳定的卡方分箱,规定每箱至少有5%数据, 空值将自动被归到最佳箱。
  16. c.fit(iris, y = 'target', method = 'chi', min_samples = 0.05) #empty_separate = False
  17. # 为了演示,仅展示部分分箱
  18. c.export()
  19. print('var_d2:',c.export()['var_d2'])
  20. print('var_d5:',c.export()['var_d5'])
  21. print('var_d6:',c.export()['var_d6'])
  22. # 画分箱图
  23. from toad.plot import bin_plot
  24. # 看'var_d2'在时间内的分箱
  25. col = 'sepal length (cm)'
  26. bin_plot(c.transform(iris[[col,'target']], labels=True), x=col, target='target')
  27. # bar代表了样本量占比,红线代表了正样本占比(e.g. 坏账率)
  28. # GBDT + LR的树模型特征输出
  29. col = 'sepal length (cm)'
  30. gbdt_transer = toad.transform.GBDTTransformer()
  31. gbdt_transer.fit(iris, 'target', n_estimators = 10, max_depth = 2)
  32. gbdt_vars = gbdt_transer.transform(iris[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
  33. 'petal width (cm)']])
  34. iris.shape,gbdt_vars.shape