Bagging (also called Bootstrap aggregation) 打包(也称为引导聚合)
Bagging uses a simple majority voting approach to combine multiple classifiers. 打包使用简单的多数投票方法来组合多个分类器。
It is closely related to the bootstrap technique presented earlier for boosting the evaluation of a classifier method when the dataset is small. 和bootstrap技术相关、用来提升小样本分类器技术。
Let us assume an original dataset with
tuples
To train classifiers, we first need
training data sets. We iterate over the following
times, where
is commonly set to 10. 我们希望将一个初始数据集分成10个
method 步骤:
- Construct training data 建立训练数据集
:
tuples are sampled with replacement (explained below) from the original set of tuples,
.
2. Train a classifier 训练分类器
Classification modelis trained on dataset
分类器Mi 在 数据集 Di上训练。
Then combine the classifiers for evaluation and run-time prediction 然后组合并评估分类器,并实现实时预测:
Given a test tuple, the classification result of each classifier is considered as one vote. The tuple is predicted to belong to the majority class. 每一个分类器的结果都被视作一票。 元组的预测结果属于多数类。
Bagging can also be applied to numerical prediction. Prediction of a continuous variable will be the average value of predictions made by each model. 同样可以应用于数值预测, 预测结果为各个模型的结果平均值。
Sampling with Replacement
Sampling in general selects some tuple from a data set according to some probability. If the probability is not specified, then the basic assumption is uniform probability (so the probability of sampling any particular tuple is
) 抽样通常根据某种概率从数据集中选择一些元组。如果没有指定概率,那么基本假设是均匀概率
- Sampling is called with replacement 替换采样 when a tuple selected at random from the population (dataset) is returned to the population before selecting the next tuple at random. A tuple can be selected more than once. There is no change at all in the size of the population available for sampling at any stage. Therefore a sample of any size can be selected from a given population of any size, even from a population smaller than the sample. 任何规模的样本都可以从任何规模的给定总体中选择,甚至可以从小于样本的总体中选择。
- Sampling without replacement differs because once a tuple is selected, that tuple cannot be selected again in the sampling sequence. Therefore if we sample
tuples from
without replacement, the resulting sample
is equal to
and
. 每个样本都不能被重复选择,因此如果我们不采用替换采用,Di将等于原数据集D。
Example:
Sample 3 objects from dataset . We can obtain 3 tuples {2,2,3} by sampling with replacement, but it is not possible to obtain this training set using sampling _without _replacement.