一、 随机森林

集成算法

集成学习(ensemble learning)不是一个单独的机器学习算法,而是通过在数据上构建多个模型,集成所有模型的建模结果。基本上所有机器学习领域都可以看到集成学习的身影,它可以用来做市场营销模拟的建模,统计客户来源,保留和流失,也可用来预测疾病的风险和病患者的易感性。在现在的各种算法竞赛中,随机森林,梯度提升树(GBDT),Xgboost等集成算法也随处可见。

集成算法的目标:集成算法会考虑多个评估器的建模结果,汇总之后得到一个综合的结果,以此来获取比单个模型更好的回归或分类表现。

多个模型集成成为的模型叫做 集成评估器 (ensemble estimator),组成集成评估器的每个模型都叫做 基评估器 (base estimator)。通常来说,有三类集成算法:装袋法(Bagging),提升法(Boosting)和stacking。
image.png
装袋法的核心思想是构建多个相互独立的评估器,然后对其预测进行平均或多数表决原则来决定集成评估器的结果。装袋法的代表模型就是随机森林。 提升法中,基评估器是相关的,是按顺序一一构建的。其核心思想是结合弱评估器的力量一次次对难以评估的样本进行预测,从而构成一个强评估器。提升法的代表模型有Adaboost和梯度提升树。

随机森林分类器RandomForestClassifier

随机森林是非常具有代表性的Bagging集成算法,所有基评估器都是决策树,分类树组成的森林就叫做随机森林分类器,回归树所集成的森林就叫做随机森林回归器。

控制基评估器的参数

参数 含义
criterion 不纯度的衡量指标,有基尼系数和信息熵两种选择
max_depth 树的最大深度,超过最大深度的树枝都会被剪掉
min_samples_leaf 一个节点在分枝后的每个子节点都必须包含至少min_samples_leaf个训练样本,否则分枝就不会发生
min_samples_split 一个节点必须要包含至少min_samples_split个训练样本,这个节点才允许被分枝,否则分枝就不会发生
max_features max_features 限制分枝时考虑的特征个数,超过限制个数的特征都会被舍弃,

默认值为总特征个数开平方取整 | | min_impurity_decrease | 限制信息增益的大小,信息增益小于设定数值的分枝不会发生 |

sklearn建模基本流程

  1. 分测试集和训练集

【注意】在skleran中,特征和标签需要分开导入 先X后Y,先train后test:Xtrain, Xtest, Ytrain, Ytest

  1. 实例化
  2. 把训练集带入实例化后的模型训练,

    使用接口fit

  3. 使用其他接口将测试集导入训练好的模型

    获取结果:score(score接口)/Y_test(predict接口)

  4. 探索模型的稳定性

    1. 交叉验证
      1. 实例化
      2. 以完整的特征集和标签集为参数,交叉验证自动划分训练集与测试集
      3. 绘图

学习曲线
oobscore属性

  1. import numpy as np
  2. import pandas as pd
  3. from sklearn.tree import DecisionTreeClassifier
  4. from sklearn.ensemble import RandomForestClassifier
  5. from sklearn.datasets import load_wine
  6. from sklearn.model_selection import train_test_split
  7. from sklearn.model_selection import cross_val_score
  8. # 交叉验证会自动分好训练集和测试集,不需要train_test_split
  9. from scipy.special import comb
  10. import matplotlib.pyplot as plt
  11. %matplotlib inline
  1. wine = load_wine()
  1. Xtrain, Xtest, Ytrain, Ytest = train_test_split(wine.data, wine.target, test_size = 0.3)
  1. clf = DecisionTreeClassifier(random_state = 0)
  2. rfc = RandomForestClassifier(random_state = 0)
  3. clf = clf.fit(Xtrain, Ytrain)
  4. rfc = rfc.fit(Xtrain, Ytrain)
  5. score_c = clf.score(Xtest, Ytest)
  6. score_r = rfc.score(Xtest, Ytest)
  7. print('Single Tree:{}'.format(score_c), '\nRandom Forest:{}'.format(score_r))
  8. print()
  1. Single Tree:0.8703703703703703
  2. Random Forest:0.9814814814814815
  3. C:\anaconda\lib\site-packages\sklearn\ensemble\forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  4. "10 in version 0.20 to 100 in 0.22.", FutureWarning)

交叉验证:画出随机森林与决策树在一组交叉验证下的效果对比

  1. clf = DecisionTreeClassifier()
  2. rfc = RandomForestClassifier(n_estimators = 25)
  3. score_c = cross_val_score(clf, wine.data, wine.target, cv = 10)
  4. score_r = cross_val_score(rfc, wine.data, wine.target, cv = 10)
  5. plt.plot(range(1, 11), score_c, label = 'Single Tree') # 一个plot中可以画多条线,但只能有一个标签
  6. plt.plot(range(1, 11), score_r, label = 'Random Forest')
  7. plt.legend()
  1. <matplotlib.legend.Legend at 0x1e8d82f8cf8>

output_10_1.png

画出随机森林与决策树在十组交叉验证下的效果对比

  1. score_c = []
  2. score_r = []
  3. for i in range(10):
  4. clf = DecisionTreeClassifier()
  5. rfc = RandomForestClassifier(n_estimators = 25)
  6. score_c.append(cross_val_score(clf, wine.data, wine.target, cv = 10).mean())
  7. score_r.append(cross_val_score(rfc, wine.data, wine.target, cv = 10).mean())
  8. plt.plot(range(1, 11), score_c, label = 'Single Tree')
  9. plt.plot(range(1, 11), score_r, label = 'Random Forest')
  10. plt.legend()
  1. <matplotlib.legend.Legend at 0x1e8da38fe48>

output_12_1.png

n_estimators的学习曲线

  1. superpa = []
  2. for i in range(200):
  3. rfc = RandomForestClassifier(n_estimators=i+1,n_jobs=-1)
  4. rfc_s = cross_val_score(rfc,wine.data,wine.target,cv=10).mean()
  5. superpa.append(rfc_s)
  6. print(max(superpa),superpa.index(max(superpa))+1)#打印出:最高精确度取值,max(superpa))+1指的是森林数目的数量n_estimators
  7. plt.figure(figsize=[20,5])
  8. plt.plot(range(1,201),superpa)
  1. 0.9888888888888889 27
  2. [<matplotlib.lines.Line2D at 0x1e8d95736d8>]

output_14_2.png

重要属性:estimators_

  1. np.array([comb(25, i) * (0.2 ** i) * (0.8 ** (25 - i)) for i in range(13, 26)]).sum()
  1. 0.00036904803455582827
  1. rfc = RandomForestClassifier(n_estimators = 25, random_state = 2)
  2. rfc = rfc.fit(Xtrain, Ytrain)
  1. rfc.estimators_[:2]
  1. [DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
  2. max_features='auto', max_leaf_nodes=None,
  3. min_impurity_decrease=0.0, min_impurity_split=None,
  4. min_samples_leaf=1, min_samples_split=2,
  5. min_weight_fraction_leaf=0.0, presort=False,
  6. random_state=1462290116, splitter='best'),
  7. DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
  8. max_features='auto', max_leaf_nodes=None,
  9. min_impurity_decrease=0.0, min_impurity_split=None,
  10. min_samples_leaf=1, min_samples_split=2,
  11. min_weight_fraction_leaf=0.0, presort=False,
  12. random_state=2007616322, splitter='best')]
  1. for i in range(len(rfc.estimators_)):
  2. print(rfc.estimators_[i].random_state)
  1. 1872583848
  2. 794921487
  3. 111352301
  4. 1853453896
  5. 213298710
  6. 1922988331
  7. 1869695442
  8. 2081981515
  9. 1805465960
  10. 1376693511
  11. 1418777250
  12. 663257521
  13. 878959199
  14. 854108747
  15. 512264917
  16. 515183663
  17. 1287007039
  18. 2083814687
  19. 1146014426
  20. 570104212
  21. 520265852
  22. 1366773364
  23. 125164325
  24. 786090663
  25. 578016451

oobscore属性

  1. rfc = RandomForestClassifier(n_estimators = 25, oob_score = True)
  2. rfc.fit(wine.data, wine.target)
  3. rfc.oob_score_
  1. 0.9719101123595506

其他重要属性和接口

  1. rfc = RandomForestClassifier(n_estimators = 25, oob_score = True)
  2. rfc.fit(Xtrain, Ytrain)
  1. RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
  2. max_depth=None, max_features='auto', max_leaf_nodes=None,
  3. min_impurity_decrease=0.0, min_impurity_split=None,
  4. min_samples_leaf=1, min_samples_split=2,
  5. min_weight_fraction_leaf=0.0, n_estimators=25,
  6. n_jobs=None, oob_score=True, random_state=None,
  7. verbose=0, warm_start=False)
  1. rfc.score(Xtest, Ytest)
  1. 0.9814814814814815
  1. rfc.feature_importances_
  1. array([0.04545681, 0.05468689, 0.01279688, 0.03417742, 0.02951136,
  2. 0.05256852, 0.1602701 , 0.00956776, 0.00705849, 0.18598547,
  3. 0.10212229, 0.04885997, 0.25693803])
  1. [*zip(wine.feature_names, rfc.feature_importances_)]
  1. [('alcohol', 0.04545681330313256),
  2. ('malic_acid', 0.05468689429182469),
  3. ('ash', 0.01279688364085699),
  4. ('alcalinity_of_ash', 0.03417741743145594),
  5. ('magnesium', 0.029511363752189344),
  6. ('total_phenols', 0.052568517086408285),
  7. ('flavanoids', 0.16027009795568858),
  8. ('nonflavanoid_phenols', 0.009567755401484716),
  9. ('proanthocyanins', 0.007058492408035047),
  10. ('color_intensity', 0.18598547273378016),
  11. ('hue', 0.10212229034493969),
  12. ('od280/od315_of_diluted_wines', 0.0488599676410731),
  13. ('proline', 0.25693803400913084)]
  1. rfc.apply(Xtest)
  1. array([[ 4, 9, 11, ..., 13, 8, 5],
  2. [19, 10, 8, ..., 10, 12, 11],
  3. [17, 4, 3, ..., 2, 9, 5],
  4. ...,
  5. [ 4, 9, 13, ..., 16, 8, 17],
  6. [17, 2, 3, ..., 2, 3, 5],
  7. [21, 10, 15, ..., 22, 18, 22]], dtype=int64)
  1. rfc.predict(Xtest)
  1. array([2, 1, 1, 1, 1, 2, 1, 0, 2, 1, 1, 1, 2, 2, 1, 0, 0, 0, 0, 1, 1, 1,
  2. 1, 1, 0, 1, 0, 2, 2, 2, 1, 2, 2, 1, 2, 1, 1, 2, 0, 2, 0, 2, 2, 0,
  3. 1, 0, 2, 0, 1, 1, 2, 2, 1, 0])
  1. Ytest
  1. array([2, 1, 1, 1, 1, 2, 1, 0, 2, 1, 1, 1, 2, 1, 1, 0, 0, 0, 0, 1, 1, 1,
  2. 1, 1, 0, 1, 0, 2, 2, 2, 1, 2, 2, 1, 2, 1, 1, 2, 0, 2, 0, 2, 2, 0,
  3. 1, 0, 2, 0, 1, 1, 2, 2, 1, 0])
  1. rfc.predict_proba(Xtest)[:10]
  1. array([[0. , 0.12, 0.88],
  2. [0.24, 0.68, 0.08],
  3. [0. , 1. , 0. ],
  4. [0. , 1. , 0. ],
  5. [0. , 1. , 0. ],
  6. [0. , 0.04, 0.96],
  7. [0.04, 0.96, 0. ],
  8. [1. , 0. , 0. ],
  9. [0.08, 0.04, 0.88],
  10. [0.32, 0.48, 0.2 ]])

Bagging的另一个必要条件

  1. np.array([comb(25, i) * (0.5 ** i) * (0.5 ** (25 - i)) for i in range(13, 26)]).sum()
  1. 0.5
  1. y = []
  2. for epsilon in np.linspace(0, 1, 20):
  3. E = np.array([comb(25, i) * (epsilon ** i) * ((1 - epsilon) ** (25 - i)) for i in range(13, 26)]).sum()
  4. y.append(E)
  5. plt.plot(np.linspace(0, 1, 20), y, '-o', label = 'when estimators are different')
  6. plt.plot(np.linspace(0, 1, 20), np.linspace(0, 1, 20), '--', label = 'if all estimators are same')
  7. plt.xlabel("individual estimator's error")
  8. plt.ylabel("RandomForest's error")
  9. plt.legend()
  1. <matplotlib.legend.Legend at 0x1e8da435cf8>

output_33_1.png

基分类器的准确率至少要超过50%,随机森林算法才能有好的效果
所以在使用随机森林之前,一定要检查用来组成随机森林的分类树是否都有至少50%以上的准确率

随机森林回归器RandomForestRegressor

  1. from sklearn.datasets import load_boston
  2. from sklearn.ensemble import RandomForestRegressor
  1. load_boston = load_boston()
  1. rfr1 = RandomForestRegressor(n_estimators = 100, random_state = 0)
  2. Xtrain, Xtest, Ytrain, Ytest = train_test_split(load_boston.data, load_boston.target, test_size = 0.3)
  3. rfr1 = rfr1.fit(Xtrain, Ytrain)
  4. score = cross_val_score(rfr1, load_boston.data, load_boston.target, cv = 10, scoring = 'neg_mean_squared_error')
  1. score
  1. array([-10.72900447, -5.36049859, -4.74614178, -20.84946337,
  2. -12.23497347, -17.99274635, -6.8952756 , -93.78884428,
  3. -29.80411702, -15.25776814])
  1. rfr2 = RandomForestRegressor(n_estimators = 100, random_state = 0)
  2. score = cross_val_score(rfr2, load_boston.data, load_boston.target, cv = 10, scoring = 'neg_mean_squared_error')
  1. score
  1. array([-10.72900447, -5.36049859, -4.74614178, -20.84946337,
  2. -12.23497347, -17.99274635, -6.8952756 , -93.78884428,
  3. -29.80411702, -15.25776814])

sklearn中的模型评分指标

  1. from sklearn.metrics import SCORERS
  1. sorted(SCORERS.keys())[:10]
  1. ['accuracy',
  2. 'adjusted_mutual_info_score',
  3. 'adjusted_rand_score',
  4. 'average_precision',
  5. 'balanced_accuracy',
  6. 'brier_score_loss',
  7. 'completeness_score',
  8. 'explained_variance',
  9. 'f1',
  10. 'f1_macro']

二、 用随机森林回归填补缺失值

  1. 用0填充
  2. 用均值填充
  3. 随机森林回归填补
  4. 几种填补方法与原始数据集对比

以波士顿数据集为例,导入完整的数据集并探索

  1. from sklearn.impute import SimpleImputer # 填补缺失值的类
  1. dataset = load_boston
  2. dataset.data.shape
  1. (506, 13)
  1. # 共有506*13=6578个数据
  2. x_full, y_full = dataset.data, dataset.target
  3. n_samples = x_full.shape[0]
  4. n_features = x_full.shape[1]

为完整数据集放入缺失值

是在特征中放入空值,而不是在标签中放入,标签不能空,否则就成了无监督学习

  1. rng = np.random.RandomState(0) # 随机数种子固定随机模式
  2. # 确定我们希望放入的缺失值的比例,在此我们假设50%,那就总共有3289个数据缺失
  3. missing_rate = 0.5
  4. n_missing_samples = int(np.floor(n_samples * n_features * missing_rate))
  5. # 1. 向下取整,返回.0格式的浮点数
  6. # 2. 将浮点数转换为整数
  7. n_missing_samples
  1. 3289
  1. # 所有数据要随机遍布在数据集的各行各列当中,而一个缺失的数据会需要一个行索引和一个列索引
  2. # 创造放置索引的数组,包含3289个分布在0~506中间的行索引,和3289个分布在0~13之间的列索引
  3. # 利用索引来为数据中的任意3289个位置赋空值
  4. missing_features = rng.randint(0, n_features, n_missing_samples)
  5. missing_samples = rng.randint(0, n_samples, n_missing_samples)
  6. # 如果我们需要的数据量小于我们的样本量506,那我们可以采用np.random.choice来抽样,choice会随机抽取不重复的随机数
  7. # np.random.choice可以让数据更加分散,确保数据不会集中在一些行中
  8. # 这里我们不采用np.random.choice,因为我们现在采样了3289个数据,远远超过我们的样本量506,使用np.random.choice会报错
  1. x_missing = x_full.copy()
  2. y_missing = y_full.copy()
  3. x_missing[missing_samples, missing_features] = None
  1. x_missing[:5]
  1. array([[ nan, 1.8000e+01, nan, nan, 5.3800e-01,
  2. nan, 6.5200e+01, 4.0900e+00, 1.0000e+00, 2.9600e+02,
  3. nan, nan, 4.9800e+00],
  4. [2.7310e-02, 0.0000e+00, nan, 0.0000e+00, 4.6900e-01,
  5. nan, 7.8900e+01, 4.9671e+00, 2.0000e+00, nan,
  6. nan, 3.9690e+02, 9.1400e+00],
  7. [2.7290e-02, nan, 7.0700e+00, 0.0000e+00, nan,
  8. 7.1850e+00, 6.1100e+01, nan, 2.0000e+00, 2.4200e+02,
  9. nan, nan, nan],
  10. [ nan, nan, nan, 0.0000e+00, 4.5800e-01,
  11. nan, 4.5800e+01, nan, nan, 2.2200e+02,
  12. 1.8700e+01, nan, nan],
  13. [ nan, 0.0000e+00, 2.1800e+00, 0.0000e+00, nan,
  14. 7.1470e+00, nan, nan, nan, nan,
  15. 1.8700e+01, nan, 5.3300e+00]])
  1. x_missing = pd.DataFrame(x_missing)
  1. x_missing.head()
0 1 2 3 4 5 6 7 8 9 10 11 12
0 NaN 18.0 NaN NaN 0.538 NaN 65.2 4.0900 1.0 296.0 NaN NaN 4.98
1 0.02731 0.0 NaN 0.0 0.469 NaN 78.9 4.9671 2.0 NaN NaN 396.9 9.14
2 0.02729 NaN 7.07 0.0 NaN 7.185 61.1 NaN 2.0 242.0 NaN NaN NaN
3 NaN NaN NaN 0.0 0.458 NaN 45.8 NaN NaN 222.0 18.7 NaN NaN
4 NaN 0.0 2.18 0.0 NaN 7.147 NaN NaN NaN NaN 18.7 NaN 5.33

1. 用0填充缺失值

  1. x_missing = x_full.copy()
  2. y_missing = y_full.copy()
  3. x_missing[missing_samples, missing_features] = None
  4. x_missing = pd.DataFrame(x_missing)
  5. # 实例化
  6. imp_zero = SimpleImputer(missing_values = np.nan, strategy = 'constant', fill_value = 0)
  7. # 接口fit_transform:fit(训练) + predict(导出)
  8. x_missing_zero = imp_zero.fit_transform(x_missing)
  9. x_missing_zero
  1. array([[0.0000e+00, 1.8000e+01, 0.0000e+00, ..., 0.0000e+00, 0.0000e+00,
  2. 4.9800e+00],
  3. [2.7310e-02, 0.0000e+00, 0.0000e+00, ..., 0.0000e+00, 3.9690e+02,
  4. 9.1400e+00],
  5. [2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 0.0000e+00, 0.0000e+00,
  6. 0.0000e+00],
  7. ...,
  8. [0.0000e+00, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 0.0000e+00,
  9. 5.6400e+00],
  10. [1.0959e-01, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9345e+02,
  11. 6.4800e+00],
  12. [4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 0.0000e+00, 3.9690e+02,
  13. 7.8800e+00]])
  1. pd.DataFrame(x_missing_zero).isnull().sum()
  1. 0 0
  2. 1 0
  3. 2 0
  4. 3 0
  5. 4 0
  6. 5 0
  7. 6 0
  8. 7 0
  9. 8 0
  10. 9 0
  11. 10 0
  12. 11 0
  13. 12 0
  14. dtype: int64

2. 用均值填充缺失值

  1. x_missing = x_full.copy()
  2. y_missing = y_full.copy()
  3. x_missing[missing_samples, missing_features] = None
  4. x_missing = pd.DataFrame(x_missing)
  5. # 实例化
  6. imp_mean = SimpleImputer(missing_values = np.nan, strategy = 'mean')
  7. # 接口fit_transform:fit(训练) + predict(导出)
  8. x_missing_mean = imp_mean.fit_transform(x_missing)
  9. x_missing_mean
  1. array([[3.62757895e+00, 1.80000000e+01, 1.11634641e+01, ...,
  2. 1.85211921e+01, 3.52741952e+02, 4.98000000e+00],
  3. [2.73100000e-02, 0.00000000e+00, 1.11634641e+01, ...,
  4. 1.85211921e+01, 3.96900000e+02, 9.14000000e+00],
  5. [2.72900000e-02, 1.07229508e+01, 7.07000000e+00, ...,
  6. 1.85211921e+01, 3.52741952e+02, 1.29917666e+01],
  7. ...,
  8. [3.62757895e+00, 1.07229508e+01, 1.19300000e+01, ...,
  9. 2.10000000e+01, 3.52741952e+02, 5.64000000e+00],
  10. [1.09590000e-01, 0.00000000e+00, 1.19300000e+01, ...,
  11. 2.10000000e+01, 3.93450000e+02, 6.48000000e+00],
  12. [4.74100000e-02, 0.00000000e+00, 1.19300000e+01, ...,
  13. 1.85211921e+01, 3.96900000e+02, 7.88000000e+00]])
  1. pd.DataFrame(x_missing_mean).isnull().sum()
  1. 0 0
  2. 1 0
  3. 2 0
  4. 3 0
  5. 4 0
  6. 5 0
  7. 6 0
  8. 7 0
  9. 8 0
  10. 9 0
  11. 10 0
  12. 11 0
  13. 12 0
  14. dtype: int64

3. 用随机森林回归填补缺失值

  1. x_missing_reg = x_missing.copy()
  2. # 找出数据集中,缺失值从小到大排列的特征们的顺序
  3. sortindex = np.argsort(x_missing_reg.isnull().sum()).values
  4. # argsort返回的是,特征按缺失值数量从小到大排序的顺序,及原来对应的索引
  1. for i in sortindex:
  2. # 构建新特征矩阵和新标签
  3. df = x_missing_reg
  4. fillc = df.iloc[:, i]
  5. df = pd.concat([df.iloc[:, df.columns != i], pd.DataFrame(y_full)], axis = 1)
  6. # 用0填充新矩阵的缺失值
  7. imp_zero = SimpleImputer(missing_values = np.nan, strategy = 'constant', fill_value = 0)
  8. df_zero = imp_zero.fit_transform(df)
  9. # 分开训练集和测试集
  10. Ytrain = fillc[fillc.notnull()]
  11. Ytest = fillc[fillc.isnull()]
  12. Xtrain = df_zero[Ytrain.index, :]
  13. Xtest = df_zero[Ytest.index, :]
  14. # 用回归森林回归填补缺失值
  15. rfr = RandomForestRegressor(n_estimators = 100)
  16. rfr.fit(Xtrain, Ytrain)
  17. x_missing_reg.loc[x_missing_reg.iloc[:, i].isnull(), i] = rfr.predict(Xtest)
  1. x_missing_reg.head()
0 1 2 3 4 5 6 7 8 9 10 11 12
0 0.280164 18.00 6.6453 0.14 0.538000 6.59824 65.200 4.090000 1.00 296.00 18.213 389.5860 4.9800
1 0.027310 0.00 6.1291 0.00 0.469000 6.18508 78.900 4.967100 2.00 305.16 18.135 396.9000 9.1400
2 0.027290 11.38 7.0700 0.00 0.462928 7.18500 61.100 4.311964 2.00 242.00 17.824 388.4907 5.3172
3 0.096453 16.69 2.8925 0.00 0.458000 6.92689 45.800 4.784298 3.70 222.00 18.700 393.6865 5.9258
4 0.115528 0.00 2.1800 0.00 0.467328 7.14700 59.009 4.834263 3.74 247.62 18.700 392.3946 5.3300

4. 评估

  1. X = [x_missing_zero, x_missing_mean, x_missing_reg, x_full]
  2. mse = []
  3. for x in X:
  4. estimators = RandomForestRegressor(random_state = 0, n_estimators = 100)
  5. score = cross_val_score(estimators, x, y_full, scoring = 'neg_mean_squared_error', cv = 5).mean()
  6. mse.append(score * -1)
  1. [*zip(['x_missing_zero', 'x_missing_mean', 'x_missing_reg', 'x_full'], mse)]
  1. [('x_missing_zero', 49.50657028893417),
  2. ('x_missing_mean', 40.84405476955929),
  3. ('x_missing_reg', 21.372347360920205),
  4. ('x_full', 21.62860460743544)]
  1. x_labels = ['Zero Imputation', 'Mean Imputation', 'Regressor Imputation', 'Full Data']
  2. colors = ['r', 'g', 'b', 'orange']
  3. plt.figure(figsize = (12, 6)) # 创建画布
  4. ax = plt.subplot(111) # 创建子图
  5. for i in np.arange(len(mse)):
  6. ax.barh(i, mse[i], color = colors[i], alpha = .6, align = 'center')
  7. ax.set_title('Imputation Techniques with Boston Data')
  8. ax.set_xlim(left = np.min(mse) * 0.9, right = np.max(mse) * 1.1)
  9. ax.set_yticks(np.arange(len(mse)))
  10. ax.set_xlabel('MSE')
  11. ax.invert_yaxis() # y轴刻度倒过来
  12. ax.set_yticklabels(x_labels)
  1. [Text(0, 0, 'Zero Imputation'),
  2. Text(0, 0, 'Mean Imputation'),
  3. Text(0, 0, 'Regressor Imputation'),
  4. Text(0, 0, 'Full Data')]

output_70_1.png

回归的数据集比原始数据集的表现还要好,有过拟合的风险

三、 随机森林在乳腺癌数据上的调参

  1. from sklearn.datasets import load_breast_cancer
  2. from sklearn.model_selection import GridSearchCV
  1. data = load_breast_cancer()
  1. data.data.shape
  1. (569, 30)
  1. np.unique(data.target)
  1. array([0, 1])
  1. rfc = RandomForestClassifier(n_estimators = 100, random_state = 90) # 实例化
  2. score_pre = cross_val_score(rfc, data.data, data.target, cv = 10, scoring = 'accuracy').mean()
  3. score_pre
  1. 0.9666925935528475

1. n_estimators

  1. score = []
  2. for i in range(0, 200, 10):
  3. rfc = RandomForestClassifier(n_estimators = i + 1, random_state = 90)
  4. score_ = cross_val_score(rfc, data.data, data.target, cv = 10).mean()
  5. score.append(score_)
  6. plt.figure(figsize = (20, 5))
  7. plt.plot(range(1, 201, 10), score)
  1. [<matplotlib.lines.Line2D at 0x1e8e26bd9e8>]

output_79_1.png

  1. print(max(score), score.index(max(score)) * 10 + 1)
  1. 0.9684480598046841 41
  1. score = []
  2. for i in range(35, 45):
  3. rfc = RandomForestClassifier(n_estimators = i, n_jobs = -1, random_state = 90)
  4. score_pre = cross_val_score(rfc, data.data, data.target, cv = 10).mean()
  5. score.append(score_pre)
  6. plt.figure(figsize = (20, 5))
  7. plt.plot(range(35, 45), score)
  1. [<matplotlib.lines.Line2D at 0x1e8e09d7fd0>]

output_81_1.png

  1. print(max(score), [*range(35, 45)][score.index(max(score))])
  1. 0.9719568317345088 39

2. max_depth

  1. param_grid = {'max_depth':np.arange(1, 20, 1)}
  2. rfc = RandomForestClassifier(n_estimators = 39, random_state = 90)
  3. GS = GridSearchCV(rfc, param_grid, cv = 10)
  4. GS.fit(data.data, data.target)
  1. C:\anaconda\lib\site-packages\sklearn\model_selection\_search.py:813: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  2. DeprecationWarning)
  3. GridSearchCV(cv=10, error_score='raise-deprecating',
  4. estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
  5. criterion='gini', max_depth=None,
  6. max_features='auto',
  7. max_leaf_nodes=None,
  8. min_impurity_decrease=0.0,
  9. min_impurity_split=None,
  10. min_samples_leaf=1,
  11. min_samples_split=2,
  12. min_weight_fraction_leaf=0.0,
  13. n_estimators=39, n_jobs=None,
  14. oob_score=False, random_state=90,
  15. verbose=0, warm_start=False),
  16. iid='warn', n_jobs=None,
  17. param_grid={'max_depth': array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
  18. 18, 19])},
  19. pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
  20. scoring=None, verbose=0)
  1. GS.best_params_
  1. {'max_depth': 11}
  1. GS.best_score_
  1. 0.9718804920913884
  1. 准确率不升反降,泛化误差上升的同时,可能是复杂度不够
  2. minsamples_leaf 和 min_samples_split 是剪枝参数,不适合再调整
    泛化误差.png

3. max_features

  1. param_grid = {'max_features':np.arange(5, 30, 1)}
  2. rfc = RandomForestClassifier(n_estimators = 39, random_state = 90)
  3. GS = GridSearchCV(rfc, param_grid, cv = 10)
  4. GS.fit(data.data, data.target)
  1. C:\anaconda\lib\site-packages\sklearn\model_selection\_search.py:813: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  2. DeprecationWarning)
  3. GridSearchCV(cv=10, error_score='raise-deprecating',
  4. estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
  5. criterion='gini', max_depth=None,
  6. max_features='auto',
  7. max_leaf_nodes=None,
  8. min_impurity_decrease=0.0,
  9. min_impurity_split=None,
  10. min_samples_leaf=1,
  11. min_samples_split=2,
  12. min_weight_fraction_leaf=0.0,
  13. n_estimators=39, n_jobs=None,
  14. oob_score=False, random_state=90,
  15. verbose=0, warm_start=False),
  16. iid='warn', n_jobs=None,
  17. param_grid={'max_features': array([ 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,
  18. 22, 23, 24, 25, 26, 27, 28, 29])},
  19. pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
  20. scoring=None, verbose=0)
  1. GS.best_params_
  1. {'max_features': 5}
  1. GS.best_score_
  1. 0.9718804920913884
  1. 网格搜索给出的是设定的max_features的最小值,max_features升高也没有效果,可见模型目前可能已经处于泛化误差的最低点
  2. 随机森林分类的决策上界

4. min_samples_leaf

  1. param_grid = {'min_samples_leaf':np.arange(1, 1 + 10, 1)}
  2. rfc = RandomForestClassifier(n_estimators = 39, random_state = 90)
  3. GS = GridSearchCV(rfc, param_grid, cv = 10)
  4. GS.fit(data.data, data.target)
  1. C:\anaconda\lib\site-packages\sklearn\model_selection\_search.py:813: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  2. DeprecationWarning)
  3. GridSearchCV(cv=10, error_score='raise-deprecating',
  4. estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
  5. criterion='gini', max_depth=None,
  6. max_features='auto',
  7. max_leaf_nodes=None,
  8. min_impurity_decrease=0.0,
  9. min_impurity_split=None,
  10. min_samples_leaf=1,
  11. min_samples_split=2,
  12. min_weight_fraction_leaf=0.0,
  13. n_estimators=39, n_jobs=None,
  14. oob_score=False, random_state=90,
  15. verbose=0, warm_start=False),
  16. iid='warn', n_jobs=None,
  17. param_grid={'min_samples_leaf': array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])},
  18. pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
  19. scoring=None, verbose=0)
  1. GS.best_params_
  1. {'min_samples_leaf': 1}
  1. GS.best_score_
  1. 0.9718804920913884

5. min_samples_split

  1. param_grid = {'min_samples_split':np.arange(2, 2 + 20, 1)}
  2. rfc = RandomForestClassifier(n_estimators = 39, random_state = 90)
  3. GS = GridSearchCV(rfc, param_grid, cv = 10)
  4. GS.fit(data.data, data.target)
  1. C:\anaconda\lib\site-packages\sklearn\model_selection\_search.py:813: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  2. DeprecationWarning)
  3. GridSearchCV(cv=10, error_score='raise-deprecating',
  4. estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
  5. criterion='gini', max_depth=None,
  6. max_features='auto',
  7. max_leaf_nodes=None,
  8. min_impurity_decrease=0.0,
  9. min_impurity_split=None,
  10. min_samples_leaf=1,
  11. min_samples_split=2,
  12. min_weight_fraction_leaf=0.0,
  13. n_estimators=39, n_jobs=None,
  14. oob_score=False, random_state=90,
  15. verbose=0, warm_start=False),
  16. iid='warn', n_jobs=None,
  17. param_grid={'min_samples_split': array([ 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
  18. 19, 20, 21])},
  19. pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
  20. scoring=None, verbose=0)
  1. GS.best_params_
  1. {'min_samples_split': 2}
  1. GS.best_score_
  1. 0.9718804920913884

6. criterion

  1. rfc = RandomForestClassifier(criterion = 'entropy', n_estimators = 39, random_state = 90)
  2. score = cross_val_score(rfc, data.data, data.target, cv = 10).mean()
  3. score
  1. 0.9649997839426151
  1. rfc = RandomForestClassifier(criterion = 'gini', n_estimators = 39, random_state = 90)
  2. score = cross_val_score(rfc, data.data, data.target, cv = 10).mean()
  3. score
  1. 0.9719568317345088