Bagging是指的个体学习器间不存在强依赖关系、可同时生成的并行化方法。

Bagging基于自助采样法,给定包含Bagging - 图1个样本的数据集,我们随机取出一个样本放入采样集中,再把该样本放回初始数据集,使得下次采样时该样本仍有可能被选中,这样,经过Bagging - 图2次操作,我们得到含Bagging - 图3个样本的采样集,初始训练集里有的样本在采样集里多次出现,有的则从未出现。由Bagging - 图4可知初始训练集约有Bagging - 图5的样本出现在采样集中,Bagging - 图6未出现在采样集中。

照这样,我们可采集出Bagging - 图7个含Bagging - 图8个训练样本的训练集,然后基于每个采样集训练出一个基学习器,再将这些学习器进行结合。Bagging通常对分类任务使用简单投票法,对回归任务使用简单平均法。

值得一提的是,自助采样过程还给Bagging带来了另一个有点:由于每个基学习器只使用了初始训练集中约Bagging - 图9的样本,剩下的样本可用作验证集来对泛化性能进行“包外估计”。

随机森林(Random Forest)

随机森林在以决策树为基学习器构建Bagging集成的基础上,进一步在决策树的训练过程中引入了随机属性选择。具体来说,在随机森林中,对基决策树的每个结点,先从该结点的属性集合中随机选择一个包含Bagging - 图10个属性的子集,然后再从这个子集中选择一个最优属性用于划分。这里的参数Bagging - 图11控制了随机性的引入程度:若Bagging - 图12,则基决策树的构建与传统决策树相同;若令Bagging - 图13,则是随机选择一个属性用于划分;一般情况下,推荐值Bagging - 图14

随机森林的训练效率常常优于纯Bagging思想的决策树群,因为在个体决策树的构建过程中,Bagging使用的是“确定型”决策树,在选择划分属性时要对结点的所有属性进行考察,而随即森林使用的“随机型”决策树则只考察一个属性子集。

可以看出,随机森林对Bagging只做了小改动,但是与Bagging中基学习器的“多样性”仅通过样本扰动(通过对初始训练样本采样)而来不同,随机森林中基学习器的多样性不仅来自样本扰动,还来自属性扰动,这就使得集成的泛化性能可通过个体学习器之间差异度的增加而进一步提升。

Code实现

  1. # -*- coding: UTF-8 -*-
  2. import sys
  3. import math
  4. from DecisionTree import DecisionTreeClassifier
  5. def mymode(X, n_tree, n_samples):
  6. predictions = []
  7. for i in range(n_samples):
  8. d = {}
  9. for j in range(n_tree):
  10. x = X[j][i]
  11. if x not in d:
  12. d[x] = 0
  13. d[x] += 1
  14. res = [ (x, d[x]) for x in d ]
  15. res = sorted(res, key=lambda x:x[1], reverse=True)
  16. predictions.append( res[0][0] )
  17. return predictions
  18. def load_data(file_name):
  19. X = []
  20. y = []
  21. with open(file_name, "r") as f:
  22. for l in f:
  23. sp = l.strip("\n").split(" ")
  24. label = int(sp[0])
  25. y.append(label)
  26. fea = []
  27. for i in range(1, len(sp)):
  28. fea.append( float( sp[i].split(":")[1] ) )
  29. X.append(fea)
  30. return (X), (y)
  31. def shuffle_in_unison(a, b):
  32. import random
  33. random.seed(100)
  34. all = [ (a[i], b[i]) for i in range(len(a)) ]
  35. random.shuffle(all)
  36. na = [ x[0] for x in all ]
  37. nb = [ x[1] for x in all ]
  38. return na, nb
  39. def gini(Y):
  40. distribution = Counter(Y)
  41. s = 0.0
  42. total = len(Y)
  43. for y, num_y in distribution.items():
  44. probability_y = float (num_y/total)
  45. s += (probability_y)*math.log(probability_y)
  46. return -s
  47. def gini_gain(y, y_true, y_false):
  48. return gini(y) - (gini(y_true)*len(y_true) + gini(y_false)*len(y_false))/len(y)
  49. class RandomForestClassifier(object):
  50. def __init__(self, n_estimators=32, max_features=lambda x: x, max_depth=20,
  51. min_samples_split=2, bootstrap=0.632):
  52. self.n_estimators = n_estimators
  53. self.max_features = max_features
  54. self.max_depth = max_depth
  55. self.min_samples_split = min_samples_split
  56. self.bootstrap = bootstrap
  57. self.forest = []
  58. def fit(self, X, y):
  59. self.forest = []
  60. n_samples = len(y)
  61. n_sub_samples = int(round(n_samples*self.bootstrap))
  62. for i in xrange(self.n_estimators):
  63. X, y = shuffle_in_unison(X, y)
  64. X_subset = [ X[i] for i in range(n_sub_samples) ]
  65. y_subset = [ y[i] for i in range(n_sub_samples) ]
  66. tree = DecisionTreeClassifier(self.max_features, self.max_depth,
  67. self.min_samples_split)
  68. tree.fit(X_subset, y_subset)
  69. self.forest.append(tree)
  70. def predict(self, X):
  71. n_samples = len(X)
  72. n_trees = len(self.forest)
  73. predictions = [ [ 0 for i in range(n_samples) ] for j in range(n_trees) ]
  74. for i in xrange(n_trees):
  75. predictions[i] = self.forest[i].predict(X)
  76. return mymode(predictions, n_trees, n_samples)
  77. def score(self, X, y):
  78. y_predict = self.predict(X)
  79. n_samples = len(y)
  80. correct = 0
  81. for i in xrange(n_samples):
  82. if y_predict[i] == y[i]:
  83. correct = correct + 1
  84. accuracy = correct/float(n_samples)
  85. return accuracy
  86. if __name__ == "__main__":
  87. train_file = sys.argv[1]
  88. test_file = sys.argv[2]
  89. train_x, train_y = load_data( train_file )
  90. test_x, test_y = load_data(test_file)
  91. # print "Load data finish..."
  92. rf = RandomForestClassifier(n_estimators=32, max_depth=20)
  93. rf.fit(train_x, train_y)
  94. # print "test acc:", rf.score(test_x, test_y)
  95. preds = rf.predict(test_x)
  96. # print "confusion matrix:"
  97. class_num = 0
  98. class_map = {}
  99. class_index = 0
  100. for i in train_y:
  101. if i not in class_map:
  102. class_map[i] = class_index
  103. class_index += 1
  104. class_num += 1
  105. for i in preds:
  106. if i not in class_map:
  107. class_map[i] = class_index
  108. class_index += 1
  109. class_num += 1
  110. matrix = [ [ 0 for i in range(class_num) ] for j in range( class_num ) ]
  111. for i in range( len(test_y) ):
  112. actual = test_y[i]
  113. pred = preds[i]
  114. matrix[ class_map[ actual ] ] [ class_map[pred] ] += 1
  115. for i in matrix:
  116. print " ".join( [ str(x) for x in i ] )