特征工程 - 特征工程——方差筛选 - 《算法》

区别于下一节（过滤法）讲的的方差分析。这里的方差筛选的对象是一个特征，而方差分析的对象是特征与label。

要使得模型有更好的划分效果，首先要保证特征的可分性强。特征可分性可以从特征的方差上入手，如果特征方差较大，说明样本分布较离散，有利于模型划分；如果特征方差趋于0，说明这个特征对模型划分没有多少用处。因此，我们先对特征进行方差计算，选择方差较大的特征做为候选特征。
使用sklearn.feature_selection.VarianceThreshold 来进行选择

from sklearn.feature_selection import VarianceThreshold
# 官网示例
>>> X = [[0, 2, 0, 3], [0, 1, 4, 3], [0, 1, 1, 3]]
>>> selector = VarianceThreshold()   # threshold=0.0
>>> selector.fit_transform(X)
array([[2, 0],
       [1, 4],
       [1, 1]])

从示例中看出，VarianceThreshold()只有一个参数threshold，默认0。即排除了方差为0的项。

from sklearn.feature_selection import VarianceThreshold
X = [[0, 2, 0, 3], [0, 1, 4, 3], [0, 1, 1, 3]]
selector = VarianceThreshold(threshold = 0.5)   # threshold=0.0
selector.fit_transform(X)
>>> array([[0],
           [4],
           [1]])
# 查看每列特征的方差值
selector.variances_
>>>array([0.        , 0.22222222, 2.88888889, 0.        ])
# 查看保留特征的索引
selector.get_support(indices=False)
>>>array([False, False,  True, False])
selector.get_support(indices=True)
>>>array([2])

将阈值调整至0.5后，只剩一列