一、 数据预处理 preprocessing & impute

1. 数据无量纲化

归一化 Normalization

  1. from sklearn.preprocessing import MinMaxScaler
  2. import numpy as np
  3. import pandas as pd

MinMaxScaler归一化

  1. data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
  2. type(data)
  1. list
  1. data
  1. [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
  1. data = pd.DataFrame(data)
  2. type(data)
  1. pandas.core.frame.DataFrame
  1. data
0 1
0 -1.0 2
1 -0.5 6
2 0.0 10
3 1.0 18
  1. norm = MinMaxScaler() # 实例化
  2. norm.fit(data) # 生成原始数据的 min 和 max
  3. result = norm.transform(data) # 通过接口导出结果
  4. result
  1. array([[0. , 0. ],
  2. [0.25, 0.25],
  3. [0.5 , 0.5 ],
  4. [1. , 1. ]])

归一化后两列的分布一致,说明两列传递的信息是相近的,甚至是一致的

  1. norm = MinMaxScaler()
  2. result_ = norm.fit_transform(data) # 训练和导出结果一步达成
  3. result_
  1. array([[0. , 0. ],
  2. [0.25, 0.25],
  3. [0.5 , 0.5 ],
  4. [1. , 1. ]])
  1. norm.inverse_transform(result_) # 逆转归一化的结果
  1. array([[-1. , 2. ],
  2. [-0.5, 6. ],
  3. [ 0. , 10. ],
  4. [ 1. , 18. ]])
  1. scaler = MinMaxScaler(feature_range = [5, 10])
  2. scaler.fit_transform(data)
  1. array([[ 5. , 5. ],
  2. [ 6.25, 6.25],
  3. [ 7.5 , 7.5 ],
  4. [10. , 10. ]])

fit 和 partial_fit

当特征数量过多时,fit会报错 此时使用partial_fit作为训练接口 scaler = scaler.partial_fit(data)

numpy归一化

  1. data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
  2. x = np.array(data)
  1. x.min()
  1. -1.0
  1. x.min(axis = 0)
  1. array([-1., 2.])
  1. x.min(axis = 1)
  1. array([-1. , -0.5, 0. , 1. ])
  1. x_nor = (x - x.min(axis = 0)) / (x.max(axis = 0) - x.min(axis = 0))
  2. x_nor
  1. array([[0. , 0. ],
  2. [0.25, 0.25],
  3. [0.5 , 0.5 ],
  4. [1. , 1. ]])
  1. x_nor * (x.max(axis = 0) - x.min(axis = 0)) + x.min(axis = 0)
  1. array([[-1. , 2. ],
  2. [-0.5, 6. ],
  3. [ 0. , 10. ],
  4. [ 1. , 18. ]])

标准化 Standardization

  1. data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
  1. from sklearn.preprocessing import StandardScaler
  2. scaler = StandardScaler()
  3. scaler.fit(data) # 生成原始数据的均值和方差
  1. StandardScaler(copy=True, with_mean=True, with_std=True)
  1. scaler.mean_
  1. array([-0.125, 9. ])
  1. scaler.var_
  1. array([ 0.546875, 35. ])
  1. x_std = scaler.transform(data)
  2. x_std
  1. array([[-1.18321596, -1.18321596],
  2. [-0.50709255, -0.50709255],
  3. [ 0.16903085, 0.16903085],
  4. [ 1.52127766, 1.52127766]])
  1. x_std.mean()
  1. 0.0
  1. x_std.var()
  1. 1.0
  1. x_std.std()
  1. 1.0
  1. scaler.inverse_transform(x_std)
  1. array([[-1. , 2. ],
  2. [-0.5, 6. ],
  3. [ 0. , 10. ],
  4. [ 1. , 18. ]])

无量纲化算法选择

  1. 大多数机器学习算法中,会选择StandardScaler进行特征缩放,而MinMaxScaler对异常值过于敏感
  2. MinMaxScaler在不涉及距离度量,梯度,协方差以及数据需要被压缩到特定区间的时候,应用广泛
  3. 在压缩数据却不希望影响数据的稀疏性(不影响矩阵中取值为0的个数)时,只缩放不进行中心化,选用MaxAbsScaler
  4. 在异常值多,噪声非常大的时候,选择分位数来无量纲化,选用RobustScaler

稀疏数据

在数据库中,稀疏数据是指在二维表中含有大量空值的数据;
即稀疏数据是指,在数据集中绝大多数数值缺失或者为零的数据。
稀疏数据绝对不是无用数据,只不过是信息不完全,通过适当的手段是可以挖掘出大量有用信息。

2. 缺失值处理

  1. data = pd.read_csv('Narrativedata.csv', index_col = 0)
  1. data.head()
Age Sex Embarked Survived
0 22.0 male S No
1 38.0 female C Yes
2 26.0 female S Yes
3 35.0 female S Yes
4 35.0 male S No
  1. data.info()
  1. <class 'pandas.core.frame.DataFrame'>
  2. Int64Index: 891 entries, 0 to 890
  3. Data columns (total 4 columns):
  4. Age 714 non-null float64
  5. Sex 891 non-null object
  6. Embarked 889 non-null object
  7. Survived 891 non-null object
  8. dtypes: float64(1), object(3)
  9. memory usage: 34.8+ KB
  1. Age = data.loc[:, 'Age'].values.reshape(-1, 1)
  2. # data.loc[:, 'Age'] 提取出Series,由一列数据和对应的索引组成
  3. # data.loc[:, 'Age'].values 将Series转换为array,才可以使用array的方法reshape将年龄数据从一维转为二维
  1. from sklearn.impute import SimpleImputer
  2. imp_mean = SimpleImputer()
  3. imp_median = SimpleImputer(strategy = 'median')
  4. imp_zero = SimpleImputer(strategy = 'constant', fill_value = 0)
  5. imp_mean = imp_mean.fit_transform(Age)
  6. imp_median = imp_median.fit_transform(Age)
  7. imp_zero = imp_zero.fit_transform(Age)
  1. imp_mean[:10]
  1. array([[22. ],
  2. [38. ],
  3. [26. ],
  4. [35. ],
  5. [35. ],
  6. [29.69911765],
  7. [54. ],
  8. [ 2. ],
  9. [27. ],
  10. [14. ]])
  1. imp_median[:10]
  1. array([[22.],
  2. [38.],
  3. [26.],
  4. [35.],
  5. [35.],
  6. [28.],
  7. [54.],
  8. [ 2.],
  9. [27.],
  10. [14.]])
  1. imp_zero[:10]
  1. array([[22.],
  2. [38.],
  3. [26.],
  4. [35.],
  5. [35.],
  6. [ 0.],
  7. [54.],
  8. [ 2.],
  9. [27.],
  10. [14.]])
  1. data['Age'] = imp_median
  2. # 年龄通常用均值填补,但在这里中位数和均值相差不大,且中位数为整数,所以选择中位数来填补
  1. data.info()
  1. <class 'pandas.core.frame.DataFrame'>
  2. Int64Index: 891 entries, 0 to 890
  3. Data columns (total 4 columns):
  4. Age 891 non-null float64
  5. Sex 891 non-null object
  6. Embarked 889 non-null object
  7. Survived 891 non-null object
  8. dtypes: float64(1), object(3)
  9. memory usage: 34.8+ KB
  1. Embarked = data['Embarked'].values.reshape(-1, 1)
  1. imp_mode = SimpleImputer(strategy = 'most_frequent')
  2. imp_mode = imp_mode.fit_transform(Embarked)
  1. data.Embarked = imp_mode
  1. data.info()
  1. <class 'pandas.core.frame.DataFrame'>
  2. Int64Index: 891 entries, 0 to 890
  3. Data columns (total 4 columns):
  4. Age 891 non-null float64
  5. Sex 891 non-null object
  6. Embarked 891 non-null object
  7. Survived 891 non-null object
  8. dtypes: float64(1), object(3)
  9. memory usage: 34.8+ KB

用pandas和numpy填补更方便

  1. data_ = pd.read_csv('Narrativedata.csv', index_col = 0)
  2. data_.head()
Age Sex Embarked Survived
0 22.0 male S No
1 38.0 female C Yes
2 26.0 female S Yes
3 35.0 female S Yes
4 35.0 male S No
  1. data_.info()
  1. <class 'pandas.core.frame.DataFrame'>
  2. Int64Index: 891 entries, 0 to 890
  3. Data columns (total 4 columns):
  4. Age 714 non-null float64
  5. Sex 891 non-null object
  6. Embarked 889 non-null object
  7. Survived 891 non-null object
  8. dtypes: float64(1), object(3)
  9. memory usage: 34.8+ KB
  1. data_.Age = data_.Age.fillna(data_.Age.median())
  1. data_.info()
  1. <class 'pandas.core.frame.DataFrame'>
  2. Int64Index: 891 entries, 0 to 890
  3. Data columns (total 4 columns):
  4. Age 891 non-null float64
  5. Sex 891 non-null object
  6. Embarked 889 non-null object
  7. Survived 891 non-null object
  8. dtypes: float64(1), object(3)
  9. memory usage: 34.8+ KB
  1. data_.dropna(axis = 0, inplace = True)
  1. data_.info()
  1. <class 'pandas.core.frame.DataFrame'>
  2. Int64Index: 889 entries, 0 to 890
  3. Data columns (total 4 columns):
  4. Age 889 non-null float64
  5. Sex 889 non-null object
  6. Embarked 889 non-null object
  7. Survived 889 non-null object
  8. dtypes: float64(1), object(3)
  9. memory usage: 34.7+ KB

3. 处理分类型特征和标签

LabelEncoder

  1. from sklearn.preprocessing import LabelEncoder
  1. y = data.iloc[:, -1]
  2. le = LabelEncoder() # 实例化
  3. le.fit(y) # 导入数据
  4. y = le.transform(y) # transform接口调取数据
  5. y[:5]
  1. array([0, 2, 2, 2, 0])
  1. le.classes_ # 查看类别
  1. array(['No', 'Unknown', 'Yes'], dtype=object)
  1. data.iloc[:, -1] = y
  2. data.head()
Age Sex Embarked Survived
0 22.0 male S 0
1 38.0 female C 2
2 26.0 female S 2
3 35.0 female S 2
4 35.0 male S 0
  1. # 简化写法
  2. from sklearn.preprocessing import LabelEncoder
  3. data.iloc[:, -1] = LabelEncoder().fit_transform(data.iloc[:, -1])

OrdinalEncoder

  1. from sklearn.preprocessing import OrdinalEncoder
  1. data_ = data.copy()
  2. data_.head()
Age Sex Embarked Survived
0 22.0 male S 0
1 38.0 female C 2
2 26.0 female S 2
3 35.0 female S 2
4 35.0 male S 0
  1. OrdinalEncoder().fit(data_.iloc[:, 1:-1]).categories_
  1. [array(['female', 'male'], dtype=object), array(['C', 'Q', 'S'], dtype=object)]
  1. data_.iloc[:, 1:-1] = OrdinalEncoder().fit_transform(data_.iloc[:, 1:-1])
  1. OrdinalEncoder().fit(data_.iloc[:, 1:-1]).categories_
  1. [array([0., 1.]), array([0., 1., 2.])]
  1. data_.head()
Age Sex Embarked Survived
0 22.0 1.0 2.0 0
1 38.0 0.0 0.0 2
2 26.0 0.0 2.0 2
3 35.0 0.0 2.0 2
4 35.0 1.0 2.0 0

OneHotEncoder

  1. from sklearn.preprocessing import OneHotEncoder
  2. result = OneHotEncoder(categories = 'auto').fit_transform(data.iloc[:, 1:-1]).toarray()
  3. result
  1. array([[0., 1., 0., 0., 1.],
  2. [1., 0., 1., 0., 0.],
  3. [1., 0., 0., 0., 1.],
  4. ...,
  5. [1., 0., 0., 0., 1.],
  6. [0., 1., 1., 0., 0.],
  7. [0., 1., 0., 1., 0.]])
  1. OneHotEncoder(categories = 'auto').fit(data.iloc[:, 1:-1]).get_feature_names()
  1. array(['x0_female', 'x0_male', 'x1_C', 'x1_Q', 'x1_S'], dtype=object)
  1. result.shape
  1. (891, 5)
  1. newdata = pd.concat([data, pd.DataFrame(result)], axis = 1)
  2. newdata.head()
Age Sex Embarked Survived 0 1 2 3 4
0 22.0 male S 0 0.0 1.0 0.0 0.0 1.0
1 38.0 female C 2 1.0 0.0 1.0 0.0 0.0
2 26.0 female S 2 1.0 0.0 0.0 0.0 1.0
3 35.0 female S 2 1.0 0.0 0.0 0.0 1.0
4 35.0 male S 0 0.0 1.0 0.0 0.0 1.0
  1. newdata.iloc[:, -5:].columns
  1. Index([0, 1, 2, 3, 4], dtype='object')
  1. newdata.drop(['Age', 'Embarked'], axis = 1, inplace = True)
  2. newdata.columns = ['Age', 'Survived', 'female', 'male', 'C', 'Q', 'S']
  3. newdata.head()
Age Survived female male C Q S
0 male 0 0.0 1.0 0.0 0.0 1.0
1 female 2 1.0 0.0 1.0 0.0 0.0
2 female 2 1.0 0.0 0.0 0.0 1.0
3 female 2 1.0 0.0 0.0 0.0 1.0
4 male 0 0.0 1.0 0.0 0.0 1.0

4. 处理连续型特征

在统计和机器学习中,离散化是指将连续属性,特征或变量转换或划分为离散或标称属性/特征/变量/间隔的过程。
它是一种离散化的形式,也可以是分组,如制作直方图。每当连续数据离散化时,总会存在一定程度的离散化误差
离散化的目标是将数量减少到建模可忽略不计的水平。

二值化 Binarizer

  1. from sklearn.preprocessing import Binarizer
  2. data_2 = data.copy()
  3. X = data_2.iloc[:, 0].values.reshape(-1, 1)
  4. transformer = Binarizer(threshold = 30).fit_transform(X)
  1. transformer[:10]
  1. array([[0.],
  2. [1.],
  3. [0.],
  4. [1.],
  5. [1.],
  6. [0.],
  7. [1.],
  8. [0.],
  9. [0.],
  10. [0.]])
  1. data_2.iloc[:, 0] = transformer
  1. data_2.head()
Age Sex Embarked Survived
0 0.0 male S 0
1 1.0 female C 2
2 0.0 female S 2
3 1.0 female S 2
4 1.0 male S 0
  1. y = data_2.iloc[:, -1].values.reshape(-1, 1)
  2. transformer = Binarizer(threshold = 1).fit_transform(y)
  1. data_2.iloc[:, -1] = transformer
  1. data_2.head()
Age Sex Embarked Survived
0 0.0 male S 0
1 1.0 female C 1
2 0.0 female S 1
3 1.0 female S 1
4 1.0 male S 0

分箱 KBinsDiscretizer

  1. from sklearn.preprocessing import KBinsDiscretizer
  2. X = data.iloc[:, 0].values.reshape(-1, 1)
  3. X_tran = KBinsDiscretizer(n_bins = 3, encode = 'ordinal', strategy = 'uniform').fit_transform(X)
  1. data_3 = pd.DataFrame()
  2. data_3 = pd.concat([data, pd.DataFrame(X_tran)], axis = 1)
  1. data_3.head()
Age Sex Embarked Survived 0
0 22.0 male S 0 0.0
1 38.0 female C 2 1.0
2 26.0 female S 2 0.0
3 35.0 female S 2 1.0
4 35.0 male S 0 1.0
  1. dim_reduce = X_tran.ravel()
  2. set(dim_reduce)
  1. {0.0, 1.0, 2.0}
  1. X_tran = KBinsDiscretizer(n_bins = 3, encode = 'onehot', strategy = 'uniform').fit_transform(X)
  1. X_tran
  1. <891x3 sparse matrix of type '<class 'numpy.float64'>'
  2. with 891 stored elements in Compressed Sparse Row format>
  1. X_tran.toarray()
  1. array([[1., 0., 0.],
  2. [0., 1., 0.],
  3. [1., 0., 0.],
  4. ...,
  5. [0., 1., 0.],
  6. [1., 0., 0.],
  7. [0., 1., 0.]])

二、 特征选择 feature_selection

  1. import pandas as pd
  2. data = pd.read_csv("digit recognizor.csv")
  1. C:\anaconda\lib\importlib\_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  2. return f(*args, **kwds)
  3. C:\anaconda\lib\importlib\_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  4. return f(*args, **kwds)
  1. data.head()
label pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 pixel774 pixel775 pixel776 pixel777 pixel778 pixel779 pixel780 pixel781 pixel782 pixel783
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

5 rows × 785 columns

  1. x = data.iloc[:, 1:]
  2. y = data.iloc[:, 0]
  1. x.shape
  1. (42000, 784)

数据量大,大在维度高,而非记录多,若不经处理直接带入模型,费时费力
以这个数据集为例,更能看出特征工程的重要性

  1. y.shape
  1. (42000,)

1. Filter过滤法

方差过滤 Variance Threshold

  1. from sklearn.feature_selection import VarianceThreshold
  1. x_var0 = VarianceThreshold().fit_transform(x)
  1. x_var0.shape
  1. (42000, 708)

过滤掉了76个方差为0的特征

  1. x.var().values[:10]
  1. array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
  1. np.median(x.var())
  1. 1352.286703180131
  1. x_var_half = VarianceThreshold(np.median(x.var())).fit_transform(x)
  1. x_var_half.shape
  1. (42000, 392)

过滤掉了392个方差小于原始数据的方差中位数的特征

  1. np.array([0, 1, 0, 1, 0, 0, 0, 1, 0, 1]).var()
  1. 0.24

当特征为二分类时,特征的取值就是伯努利随机变量(0, 1)
伯努利变量的方差计算公式为:$Var(X) = p * (1-p)$
其中X为二分类特征矩阵,p是二分类特征中某一类所占的概率

  1. # 若特征是伯努利随机变量,当p=0.8,即二分类特征中某种分类占到80%,此时方差为0.8*(1-0.8)=0.16
  2. # 令threshold=0.16,即删除有一分类占到80%以上的二分类特征
  3. x_bvar = VarianceThreshold(0.16).fit_transform(x)
  1. x_bvar.shape
  1. (42000, 685)

过滤掉了99个二分类特征

方差过滤特征是否提升了模型的效果?

  1. 模型运行的判别效果(准确率)
  2. 运行时间

对比KNN和随机森林在中位数方差过滤下的效果

KNN 方差过滤前,模型交叉验证效果:0.9658569700000001 方差过滤后,模型交叉验证效果:0.9659997459999999

随机森林 方差过滤前,模型交叉验证效果:0.9380003861799541 方差过滤后,模型交叉验证效果:0.9388098166696807

  1. # 导入模块并准备数据
  2. from sklearn.ensemble import RandomForestClassifier as RFC
  3. from sklearn.neighbors import KNeighborsClassifier as KNN
  4. from sklearn.model_selection import cross_val_score
  5. x = data.iloc[:, 1:]
  6. y = data.iloc[:, 0]
  7. x_var_half = VarianceThreshold(np.median(x.var())).fit_transform(x)
  1. cross_val_score(KNN(), x, y, cv = 5)
  1. array([0.96799524, 0.96548852, 0.96356709, 0.96332023, 0.96891377])
  1. np.array([0.96799524, 0.96548852, 0.96356709, 0.96332023, 0.96891377]).mean()
  1. 0.9658569700000001
  1. cross_val_score(KNN(), x_var_half, y, cv = 5)
  1. array([0.96799524, 0.96632155, 0.96368615, 0.96332023, 0.96867556])
  1. np.array([0.96799524, 0.96632155, 0.96368615, 0.96332023, 0.96867556]).mean()
  1. 0.9659997459999999
  1. rfc_score1 = cross_val_score(RFC(n_estimators = 10, random_state = 0), x, y, cv = 5)
  2. rfc_score1.mean()
  1. 0.9380003861799541
  1. %%timeit
  2. rfc_score1 = cross_val_score(RFC(n_estimators = 10, random_state = 0), x, y, cv = 5)
  1. 13.6 s ± 1.1 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
  1. rfc_score2 = cross_val_score(RFC(n_estimators = 10, random_state = 0), x_var_half, y, cv = 5)
  2. rfc_score2.mean()
  1. 0.9388098166696807
  1. %%timeit
  2. rfc_score2 = cross_val_score(RFC(n_estimators = 10, random_state = 0), x_var_half, y, cv = 5)
  1. 12.5 s ± 319 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

threshold

在机器学习的上下文中,超参数是在开始学习过程之前设置值的参数,而不是通过训练得到的参数数据。 通常情况下,需要对超参数进行优化,给学习机选择一组最优超参数,以提高学习的性能和效果。

相关性过滤

卡方过滤,针对离散型标签(分类问题)

feature_selection.chi2 计算每个非负特征和标签之间的卡方统计量 feature_selection.SelectKBest 按输入的评分标准选出前K个分数最高的特征

  1. from sklearn.feature_selection import chi2
  2. from sklearn.feature_selection import SelectKBest
  1. k = SelectKBest(chi2, k = 300).fit_transform(x_var_half, y)
  2. k.shape
  1. (42000, 300)
  1. cross_val_score(RFC(n_estimators = 10, random_state = 0), k, y, cv = 5).mean()
  1. 0.9333098667649198

在392个特征中保留300个特征,准确率下降

  1. from matplotlib import pyplot as plt
  2. %matplotlib inline
  1. score = []
  2. for i in range(390, 200, -10):
  3. k = SelectKBest(chi2, i).fit_transform(x_var_half, y)
  4. once = cross_val_score(RFC(n_estimators = 10, random_state = 0), k, y, cv = 5).mean()
  5. score.append(once)
  6. plt.plot(range(390, 200, -10), score)
  1. [<matplotlib.lines.Line2D at 0x2461e846a90>]

output_135_1.png

  1. chivalue, pvalue_chi = chi2(x_var_half, y)
  1. chivalue[:10]
  1. array([ 945664.84392643, 1244766.05139164, 1554872.30384525,
  2. 1834161.78305343, 1903618.94085294, 1845226.62427198,
  3. 1602117.23307537, 708535.17489837, 974050.20513718,
  4. 1188092.19961931])
  1. np.unique(pvalue_chi)
  1. array([0.])

所有特征的p值都小于0.05,说明所有的特征都和标签相关

  1. k = chivalue.shape[0] - (pvalue_chi > 0.05).sum()
  2. k
  1. 392

F检验

ANOVA,方差齐性检验,用来捕捉每个特征与标签之间的线性关系

feature_selection.f_classif feature_selection.f_regression

  1. from sklearn.feature_selection import f_classif
  2. F, pvalue_f = f_classif(x_var_half, y)
  3. k = F.shape[0] - (pvalue_f > 0.05).sum()
  1. k
  1. 392

互信息法

互信息法用来捕捉每个特征和标签之间的任意关系,包括 线性非线性 关系

feature_selection.mutual_info_classif feature_selection.mutual_info_regression

  1. from sklearn.feature_selection import mutual_info_classif
  1. result = mutual_info_classif(x_var_half, y)
  2. k = result.shape[0] - (result <= 0).sum()
  1. k
  1. 392

2. Embedded嵌入法

feature_selection.SelectFromModel

  1. from sklearn.feature_selection import SelectFromModel
  1. # 随机森林的实例化
  2. RFC_ = RFC(n_estimators = 10, random_state = 0)
  3. # SelectFromModel的实例化:SelectFromModel(RFC_, threshold = 0.005)
  4. X_embedded = SelectFromModel(RFC_, threshold = 0.005).fit_transform(x, y)
  1. X_embedded.shape
  1. (42000, 47)
  1. importances = RFC_.fit(x, y).feature_importances_
  2. importances.max()
  1. 0.01276360214820271
  1. threshold = np.linspace(0, importances.max(), 20)
  2. threshold
  1. array([0. , 0.00067177, 0.00134354, 0.00201531, 0.00268707,
  2. 0.00335884, 0.00403061, 0.00470238, 0.00537415, 0.00604592,
  3. 0.00671769, 0.00738945, 0.00806122, 0.00873299, 0.00940476,
  4. 0.01007653, 0.0107483 , 0.01142007, 0.01209183, 0.0127636 ])
  1. score = []
  2. for i in threshold:
  3. X_embedded = SelectFromModel(RFC_, threshold = i).fit_transform(x, y)
  4. once = cross_val_score(RFC_, X_embedded, y, cv = 5).mean()
  5. score.append(once)
  6. plt.plot(threshold, score)
  1. [<matplotlib.lines.Line2D at 0x24620a068d0>]

output_154_1.png

  1. [*zip(threshold, score)]
  1. [(0.0, 0.9380003861799541),
  2. (0.000671768534115932, 0.939905083368037),
  3. (0.001343537068231864, 0.9356900373288164),
  4. (0.002015305602347796, 0.9306673521719839),
  5. (0.002687074136463728, 0.9282624651248446),
  6. (0.0033588426705796603, 0.923095721100568),
  7. (0.004030611204695592, 0.9170958532189901),
  8. (0.0047023797388115246, 0.9015485971667836),
  9. (0.005374148272927456, 0.8915237372973654),
  10. (0.006045916807043388, 0.8517627553998419),
  11. (0.0067176853411593206, 0.8243101686902852),
  12. (0.007389453875275252, 0.7305249229105348),
  13. (0.008061222409391184, 0.6961659491147189),
  14. (0.008732990943507116, 0.6961659491147189),
  15. (0.009404759477623049, 0.6656903457724771),
  16. (0.01007652801173898, 0.5222374843202717),
  17. (0.010748296545854913, 0.2654045352411921),
  18. (0.011420065079970844, 0.18971438901493287),
  19. (0.012091833614086776, 0.18971438901493287),
  20. (0.01276360214820271, 0.18971438901493287)]
  1. X_embedded = SelectFromModel(RFC_, threshold = 0.000671768534115932).fit_transform(x, y)
  2. X_embedded.shape
  1. (42000, 324)
  1. cross_val_score(RFC_, X_embedded, y, cv = 5).mean()
  1. 0.939905083368037
  1. score2 = []
  2. for i in np.linspace(0, 0.00134, 20):
  3. X_embedded = SelectFromModel(RFC_, threshold = i).fit_transform(x, y)
  4. once = cross_val_score(RFC_, X_embedded, y, cv = 5).mean()
  5. score2.append(once)
  6. plt.figure(figsize = (20, 5))
  7. plt.plot(np.linspace(0, 0.00134, 20), score2)
  8. plt.xticks(np.linspace(0, 0.00134, 20))
  1. ([<matplotlib.axis.XTick at 0x2462083bdd8>,
  2. <matplotlib.axis.XTick at 0x2462083fcf8>,
  3. <matplotlib.axis.XTick at 0x2462083f668>,
  4. <matplotlib.axis.XTick at 0x24ad01a37f0>,
  5. <matplotlib.axis.XTick at 0x24ad01a3cc0>,
  6. <matplotlib.axis.XTick at 0x24ad01a91d0>,
  7. <matplotlib.axis.XTick at 0x24ad01a96a0>,
  8. <matplotlib.axis.XTick at 0x24ad01a9b70>,
  9. <matplotlib.axis.XTick at 0x24ad01ad0f0>,
  10. <matplotlib.axis.XTick at 0x24ad01ad550>,
  11. <matplotlib.axis.XTick at 0x24ad01ada20>,
  12. <matplotlib.axis.XTick at 0x24ad01adef0>,
  13. <matplotlib.axis.XTick at 0x24ad01ad898>,
  14. <matplotlib.axis.XTick at 0x24ad01a9b38>,
  15. <matplotlib.axis.XTick at 0x24ad01a3898>,
  16. <matplotlib.axis.XTick at 0x24ad01b26a0>,
  17. <matplotlib.axis.XTick at 0x24ad01b2b70>,
  18. <matplotlib.axis.XTick at 0x24ad01b60f0>,
  19. <matplotlib.axis.XTick at 0x24ad01b6550>,
  20. <matplotlib.axis.XTick at 0x24ad01b6a20>],
  21. <a list of 20 Text xticklabel objects>)

output_158_1.png

  1. X_embedded = SelectFromModel(RFC_, threshold = 0.000564).fit_transform(x, y)
  2. cross_val_score(RFC_, X_embedded, y, cv = 5).mean()
  1. 0.9408335415056387
  1. X_embedded.shape
  1. (42000, 340)
  1. X_embedded = SelectFromModel(RFC_, threshold = 0.000564).fit_transform(x, y)
  2. cross_val_score(RFC(n_estimators = 100, random_state = 0), X_embedded, y, cv = 5).mean()
  1. 0.9639525817795566

3. Wrapper包装法

feature_selection.RFE feature_selection.RFECV

  1. from sklearn.feature_selection import RFE
  2. from sklearn.ensemble import RandomForestClassifier as RFC
  3. from sklearn.model_selection import cross_val_score
  4. import matplotlib.pyplot as plt
  5. %matplotlib inline
  1. RFC_ = RFC(n_estimators = 10, random_state = 0)
  2. selector = RFE(RFC_, n_features_to_select = 340, step = 50).fit(x, y)
  1. selector.support_.sum()
  1. 340
  1. selector.ranking_[:10]
  1. array([10, 9, 8, 7, 6, 6, 6, 6, 6, 6])
  1. X_wrapper = selector.transform(x)
  1. cross_val_score(RFC_, X_wrapper, y, cv = 5).mean()
  1. 0.9389522459432109
  1. # 画出包装法的学习曲线
  2. score = []
  3. for i in range(1, 751, 50):
  4. X_wrapper = RFE(RFC_, n_features_to_select = i, step = 50).fit_transform(x, y)
  5. once = cross_val_score(RFC_, X_wrapper, y, cv = 5).mean()
  6. score.append(once)
  7. plt.figure(figsize = [20, 5])
  8. plt.plot(range(1, 751, 50), score)
  9. plt.xticks(range(1, 751, 50))
  1. ([<matplotlib.axis.XTick at 0x17a2a8b5898>,
  2. <matplotlib.axis.XTick at 0x17a2a8cd128>,
  3. <matplotlib.axis.XTick at 0x17a2a8cd320>,
  4. <matplotlib.axis.XTick at 0x17a0e0ce6d8>,
  5. <matplotlib.axis.XTick at 0x17a0e0ceb70>,
  6. <matplotlib.axis.XTick at 0x17a0e0cee80>,
  7. <matplotlib.axis.XTick at 0x17a0e0d9550>,
  8. <matplotlib.axis.XTick at 0x17a0e0d9ac8>,
  9. <matplotlib.axis.XTick at 0x17a0e0dc0f0>,
  10. <matplotlib.axis.XTick at 0x17a0e0dc5f8>,
  11. <matplotlib.axis.XTick at 0x17a0e0dcb70>,
  12. <matplotlib.axis.XTick at 0x17a0e0e0160>,
  13. <matplotlib.axis.XTick at 0x17a0e0e06a0>,
  14. <matplotlib.axis.XTick at 0x17a0e0dc9e8>,
  15. <matplotlib.axis.XTick at 0x17a0e0cef60>],
  16. <a list of 15 Text xticklabel objects>)

output_169_1.png

  1. [*zip(range(1, 751, 50), score)]
  1. [(1, 0.21014266614748944),
  2. (51, 0.9085484398226157),
  3. (101, 0.924928711799029),
  4. (151, 0.9310242511366686),
  5. (201, 0.9363097850232442),
  6. (251, 0.9353575512540064),
  7. (301, 0.9384290399778331),
  8. (351, 0.938905032072306),
  9. (401, 0.9372626252970612),
  10. (451, 0.9398097941390517),
  11. (501, 0.9387144536851132),
  12. (551, 0.9395716620079153),
  13. (601, 0.9385243603795994),
  14. (651, 0.9380951142673439),
  15. (701, 0.937214229660641)]

三、 特征创造:sklearn中的降维算法

PCA重要接口、参数和属性.png

1. 高维数据可视化

  1. from sklearn.datasets import load_iris
  2. from sklearn.decomposition import PCA
  1. iris = load_iris() # 数据集实例化
  2. x = iris.data
  3. y = iris.target
  4. x.shape
  1. (150, 4)

二维数组,四维特征

  1. pca = PCA(n_components = 2) # 实例化
  2. pca = pca.fit(x) # 拟合模型
  3. x_dr = pca.transform(x) # 获取新矩阵
  1. x_dr.shape
  1. (150, 2)
  1. plt.figure(figsize = (10, 6))
  2. for i in [0, 1, 2]:
  3. plt.scatter(x_dr[y == i, 0], x_dr[y == i, 1],
  4. label = iris.target_names[i])
  5. plt.legend()
  6. plt.title('PCA of IRIS dataset')
  1. Text(0.5, 1.0, 'PCA of IRIS dataset')

output_178_1.png

这种分布,距离类模型非常擅长

2. n_components

参照累积可解释方差贡献率曲线 最大似然估计自选 按信息量占比选

累积可解释方差贡献率曲线

  1. pca.explained_variance_
  2. # 查看降维后每个新特征向量上所带的信息量的大小
  3. # 即可解释性方差的大小
  1. array([4.22824171, 0.24267075])
  1. pca.explained_variance_ratio_
  2. # 查看降维后每个新特征向量所占的信息量,占原始数据总信息量的百分比
  3. # 又称可解释性方差贡献率
  1. array([0.92461872, 0.05306648])

大部分信息都被集中在第一个特征上

  1. pca.explained_variance_ratio_.sum()
  1. 0.9776852063187949
  1. pca_line = PCA().fit(x)
  2. pca_line.explained_variance_ratio_.sum()
  1. 1.0
  1. import numpy as np
  1. np.cumsum(pca_line.explained_variance_ratio_)
  1. array([0.92461872, 0.97768521, 0.99478782, 1. ])
  1. plt_line = PCA().fit(x)
  2. plt.plot([1, 2, 3, 4], # 避免x为y的索引 [0, 1, 2, 3]
  3. np.cumsum(pca_line.explained_variance_ratio_))
  4. plt.xticks([1, 2, 3, 4]) # 让坐标轴刻度显示为整数
  5. plt.xlabel('number of components after dimension reduction')
  6. plt.ylabel('cumulative explained variance ratio')
  1. Text(0, 0.5, 'cumulative explained variance ratio')

output_189_1.png

最大似然估计自选超参数

  1. pca_f = PCA(n_components = 'mle')
  2. pca_f = pca_f.fit(x)
  3. pca_f.explained_variance_ratio_.sum()
  1. 0.9947878161267246
  1. pca_f.explained_variance_ratio_
  1. array([0.92461872, 0.05306648, 0.01710261])

按信息量占比选

前提是,我们知道模型正常运行所需的最少信息量

  1. pca_f = PCA(n_components = 0.99)
  2. pca_f = pca_f.fit(x)
  3. pca_f.explained_variance_ratio_.sum()
  1. 0.9947878161267246
  1. pca_f.explained_variance_ratio_
  1. array([0.92461872, 0.05306648, 0.01710261])
  1. pca_f = PCA(n_components = 0.99, svd_solver = 'full')
  2. pca_f = pca_f.fit(x)
  3. pca_f.explained_variance_ratio_.sum()
  1. 0.9947878161267246
  1. pca_f.explained_variance_ratio_
  1. array([0.92461872, 0.05306648, 0.01710261])

3. svd_solver 奇异值分解器

获得特征空间V(k, n),保存在属性components_中

pca_f = PCA(n_components = k, svd_solver = ‘full’) pca_f = pca_f.fit(x)

将x映射到特征空间V(k, n)上,生成降维后的特征矩阵

x_f = pca_f.transform(x)

  1. PCA(n_components = 2, svd_solver = 'full').fit(x).components_
  1. array([[ 0.36138659, -0.08452251, 0.85667061, 0.3582892 ],
  2. [ 0.65658877, 0.73016143, -0.17337266, -0.07548102]])
  1. PCA(2).fit(x).components_
  1. array([[ 0.36138659, -0.08452251, 0.85667061, 0.3582892 ],
  2. [ 0.65658877, 0.73016143, -0.17337266, -0.07548102]])

svd_slover默认为auto

4. components_

PCA与特征选择的区别:

特征选择后的特征矩阵可以解读 PCA是压缩已存在的特征,降维后的特征不是原本特征矩阵中的任一特征,不可读

【注意】PCA的目标:在原有特征上找出让信息尽可能聚集的新特征向量
在sklearn使用的PCA和SVD联合的降维算法中,这些新特征向量组成的新特征空间就是V(k, n)

V(k, n)是数字时,无法判断V(k, n)和原有的特征间的关系 当原特征矩阵是图像,且V(k, n)可视化,通过两张图的对比,可以看出新特征空间如何从原始数据提取信息

  1. from sklearn.datasets import fetch_lfw_people
  2. from sklearn.decomposition import PCA
  1. faces = fetch_lfw_people(min_faces_per_person = 60)
  1. faces.data.shape
  1. (964, 2914)
  1. faces.images.shape
  1. (964, 62, 47)
  1. fig, ax = plt.subplots(4, 5, figsize = [8, 4])

output_207_0.png

  1. fig, axes = plt.subplots(4, 5, figsize = [8, 4],
  2. subplot_kw = {'xticks':[], 'yticks':[]} # 不显示坐标轴
  3. )

output_208_0.png

  1. fig

output_209_0.png

  1. axes
  1. array([[<matplotlib.axes._subplots.AxesSubplot object at 0x0000017A0F540F28>,
  2. <matplotlib.axes._subplots.AxesSubplot object at 0x0000017A0F7F9DD8>,
  3. <matplotlib.axes._subplots.AxesSubplot object at 0x0000017A0F7AB4E0>,
  4. <matplotlib.axes._subplots.AxesSubplot object at 0x0000017A11CD4EF0>,
  5. <matplotlib.axes._subplots.AxesSubplot object at 0x0000017A11D114E0>],
  6. [<matplotlib.axes._subplots.AxesSubplot object at 0x0000017A11D40A90>,
  7. <matplotlib.axes._subplots.AxesSubplot object at 0x0000017A11D80080>,
  8. <matplotlib.axes._subplots.AxesSubplot object at 0x0000017A11DAE5F8>,
  9. <matplotlib.axes._subplots.AxesSubplot object at 0x0000017A11DDFBE0>,
  10. <matplotlib.axes._subplots.AxesSubplot object at 0x0000017A11E1F1D0>],
  11. [<matplotlib.axes._subplots.AxesSubplot object at 0x0000017A11E4D780>,
  12. <matplotlib.axes._subplots.AxesSubplot object at 0x0000017A11E7CD30>,
  13. <matplotlib.axes._subplots.AxesSubplot object at 0x0000017A11EB9320>,
  14. <matplotlib.axes._subplots.AxesSubplot object at 0x0000017A11EEA8D0>,
  15. <matplotlib.axes._subplots.AxesSubplot object at 0x0000017A11F1CE80>],
  16. [<matplotlib.axes._subplots.AxesSubplot object at 0x0000017A11F58470>,
  17. <matplotlib.axes._subplots.AxesSubplot object at 0x0000017A11F89A20>,
  18. <matplotlib.axes._subplots.AxesSubplot object at 0x0000017A11FBCFD0>,
  19. <matplotlib.axes._subplots.AxesSubplot object at 0x0000017A11FF85C0>,
  20. <matplotlib.axes._subplots.AxesSubplot object at 0x0000017A12029B70>]],
  21. dtype=object)
  1. axes.shape
  1. (4, 5)
  1. axes[0][0].imshow(faces.images[0, :, :])
  1. <matplotlib.image.AxesImage at 0x17a1208a4a8>
  1. fig

output_213_0.png

  1. [*axes.flat] # 惰性对象的打开方式
  1. [<matplotlib.axes._subplots.AxesSubplot at 0x17a1208af28>,
  2. <matplotlib.axes._subplots.AxesSubplot at 0x17a120bcb38>,
  3. <matplotlib.axes._subplots.AxesSubplot at 0x17a120f1c50>,
  4. <matplotlib.axes._subplots.AxesSubplot at 0x17a121321d0>,
  5. <matplotlib.axes._subplots.AxesSubplot at 0x17a1215f780>,
  6. <matplotlib.axes._subplots.AxesSubplot at 0x17a12195320>,
  7. <matplotlib.axes._subplots.AxesSubplot at 0x17a121c58d0>,
  8. <matplotlib.axes._subplots.AxesSubplot at 0x17a121f8e48>,
  9. <matplotlib.axes._subplots.AxesSubplot at 0x17a12234470>,
  10. <matplotlib.axes._subplots.AxesSubplot at 0x17a12265a20>,
  11. <matplotlib.axes._subplots.AxesSubplot at 0x17a12299fd0>,
  12. <matplotlib.axes._subplots.AxesSubplot at 0x17a122d45c0>,
  13. <matplotlib.axes._subplots.AxesSubplot at 0x17a12305b70>,
  14. <matplotlib.axes._subplots.AxesSubplot at 0x17a12345160>,
  15. <matplotlib.axes._subplots.AxesSubplot at 0x17a12375710>,
  16. <matplotlib.axes._subplots.AxesSubplot at 0x17a123a6cc0>,
  17. <matplotlib.axes._subplots.AxesSubplot at 0x17a123e32b0>,
  18. <matplotlib.axes._subplots.AxesSubplot at 0x17a12414860>,
  19. <matplotlib.axes._subplots.AxesSubplot at 0x17a12443e10>,
  20. <matplotlib.axes._subplots.AxesSubplot at 0x17a12481400>]
  1. [*enumerate(axes.flat)]
  1. [(0, <matplotlib.axes._subplots.AxesSubplot at 0x17a1208af28>),
  2. (1, <matplotlib.axes._subplots.AxesSubplot at 0x17a120bcb38>),
  3. (2, <matplotlib.axes._subplots.AxesSubplot at 0x17a120f1c50>),
  4. (3, <matplotlib.axes._subplots.AxesSubplot at 0x17a121321d0>),
  5. (4, <matplotlib.axes._subplots.AxesSubplot at 0x17a1215f780>),
  6. (5, <matplotlib.axes._subplots.AxesSubplot at 0x17a12195320>),
  7. (6, <matplotlib.axes._subplots.AxesSubplot at 0x17a121c58d0>),
  8. (7, <matplotlib.axes._subplots.AxesSubplot at 0x17a121f8e48>),
  9. (8, <matplotlib.axes._subplots.AxesSubplot at 0x17a12234470>),
  10. (9, <matplotlib.axes._subplots.AxesSubplot at 0x17a12265a20>),
  11. (10, <matplotlib.axes._subplots.AxesSubplot at 0x17a12299fd0>),
  12. (11, <matplotlib.axes._subplots.AxesSubplot at 0x17a122d45c0>),
  13. (12, <matplotlib.axes._subplots.AxesSubplot at 0x17a12305b70>),
  14. (13, <matplotlib.axes._subplots.AxesSubplot at 0x17a12345160>),
  15. (14, <matplotlib.axes._subplots.AxesSubplot at 0x17a12375710>),
  16. (15, <matplotlib.axes._subplots.AxesSubplot at 0x17a123a6cc0>),
  17. (16, <matplotlib.axes._subplots.AxesSubplot at 0x17a123e32b0>),
  18. (17, <matplotlib.axes._subplots.AxesSubplot at 0x17a12414860>),
  19. (18, <matplotlib.axes._subplots.AxesSubplot at 0x17a12443e10>),
  20. (19, <matplotlib.axes._subplots.AxesSubplot at 0x17a12481400>)]
  1. for i, ax in enumerate(axes.flat):
  2. ax.imshow(faces.images[i, :, :], cmap = 'gray')
  1. fig

output_217_0.png

  1. pca = PCA(150).fit(faces.data)
  1. V = pca.components_
  1. V.shape
  1. (150, 2914)

V(k, n)乘原有的特征矩阵X,完成降维

  1. fig, axes = plt.subplots(4, 5, figsize = [8, 4],
  2. subplot_kw = {'xticks':[], 'yticks':[]} # 不显示坐标轴
  3. )
  4. for i, ax in enumerate(axes.flat):
  5. ax.imshow(V[i, :].reshape(62, 47), cmap = 'gray')

output_222_0.png

  1. x_tran = pca.transform(faces.data)
  1. x_tran.shape
  1. (964, 150)

5. inverse_transform

  1. import matplotlib.pyplot as plt
  2. %matplotlib inline
  3. import numpy as np
  4. import pandas as pd
  5. from sklearn.datasets import fetch_lfw_people
  6. from sklearn.decomposition import PCA
  1. faces = fetch_lfw_people(min_faces_per_person = 60)
  1. faces.data
  1. array([[ 50.666668, 62. , 90.333336, ..., 68.666664, 78. ,
  2. 80.333336],
  3. [117.666664, 106.666664, 96. , ..., 233.66667 , 234.33333 ,
  4. 227. ],
  5. [ 70.666664, 64. , 58.666668, ..., 115. , 118. ,
  6. 121.333336],
  7. ...,
  8. [139. , 148.33333 , 156.33333 , ..., 49. , 19.333334,
  9. 12. ],
  10. [126.666664, 118.666664, 133. , ..., 68.333336, 64.666664,
  11. 56. ],
  12. [ 65.333336, 86. , 105.666664, ..., 179. , 93.333336,
  13. 10.333333]], dtype=float32)
  1. faces.data.shape
  1. (964, 2914)
  1. faces.images.shape
  1. (964, 62, 47)
  1. pca = PCA(150) # 实例化
  2. x_dr = pca.fit_transform(daces.data) # 拟合 + 提取结果
  1. x_dr
  1. array([[ -687.52765 , 658.6859 , -224.83125 , ..., -5.8305807,
  2. 37.9641 , -68.2529 ],
  3. [ -520.3875 , -337.95184 , -85.0649 , ..., 56.23444 ,
  4. -7.104003 , -27.09345 ],
  5. [ -668.42316 , 177.38837 , 705.5544 , ..., -93.13145 ,
  6. -28.745838 , 41.614784 ],
  7. ...,
  8. [-1426.5477 , -316.7213 , -184.77708 , ..., 73.25698 ,
  9. -9.299059 , -5.395314 ],
  10. [ 1393.6271 , 936.3961 , 599.951 , ..., 38.447098 ,
  11. 25.35701 , -42.811676 ],
  12. [ 784.0877 , 469.102 , -400.20416 , ..., -21.948296 ,
  13. 47.49396 , -11.908577 ]], dtype=float32)
  1. x_dr.shape
  1. (964, 150)
  1. pca.explained_variance_ratio_.sum()
  1. 0.9528588
  1. x_inverse = pca.inverse_transform(x_dr)
  1. x_inverse
  1. array([[ 61.003284 , 65.75134 , 73.682495 , ..., 67.226166 ,
  2. 68.544495 , 72.683105 ],
  3. [102.78923 , 107.009964 , 110.27441 , ..., 233.90904 ,
  4. 227.82428 , 215.0698 ],
  5. [ 65.36822 , 59.9917 , 58.116863 , ..., 113.183395 ,
  6. 105.91263 , 102.84451 ],
  7. ...,
  8. [124.8531 , 130.30939 , 145.5615 , ..., 73.5932 ,
  9. 26.662682 , -2.8888931],
  10. [144.30975 , 138.76814 , 135.97636 , ..., 79.788666 ,
  11. 65.55131 , 53.822388 ],
  12. [ 89.81032 , 98.244835 , 114.291756 , ..., 156.79715 ,
  13. 75.214355 , 12.275055 ]], dtype=float32)
  1. x_inverse.shape
  1. (964, 2914)
  1. fig, axes = plt.subplots(2, 10, figsize = (10, 2.5),
  2. subplot_kw = {'xticks':[], 'yticks':[]} # 不显示坐标轴
  3. )
  4. for i in range(10):
  5. axes[0, i].imshow(faces.images[i], cmap = 'binary_r')
  6. axes[1, i].imshow(x_inverse[i].reshape(62, 47), cmap = 'binary_r')

output_238_0.png

  1. fit_transform,降维

(964, 2914)—>(964, 150)

  1. inverse_transform,不是复原,而是将被降维的数据升维

(964, 150)—>(964, 2914) 能在多大程度上恢复原有的信息,取决于第一步降维时舍弃的信息量

  1. reshape

(964, 2914)—>(964, 62, 47)

用PCA过滤噪音

使用inverse_transform保持维度,去掉噪音

  1. from sklearn.datasets import load_digits

查看原始数据图像

  1. load_digits = load_digits()
  1. load_digits.data.shape # 1797个样本,64个特征
  1. (1797, 64)
  1. load_digits.images.shape # 手写数字的本质是图像
  1. (1797, 8, 8)
  1. set(load_digits.target) # load_digits为手写数字识别数据,标签为0到9这十个数字
  1. {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
  1. def plot_digits(data):
  2. # data的结构必须为 (m, n),且 n 可被分为 (x, y) 的像素结构
  3. fig, axes = plt.subplots(4 ,10, figsize = (10, 4),
  4. subplot_kw = {'xticks':[], 'yticks':[]})
  5. for i, ax in enumerate(axes.flat):
  6. ax.imshow(data[i].reshape(8, 8) # 若输入images数据,则不需要reshape
  7. , cmap = 'binary')
  1. plot_digits(load_digits.images)

output_248_0.png

添加噪音

  1. rng = np.random.RandomState(42)
  2. noisy = rng.normal(load_digits.data # 从原数据集中抽取正态分布的数据
  3. , 2 # 产生的新数据集的方差为2
  4. )
  5. noisy.shape
  1. (1797, 64)
  1. plot_digits(noisy)

output_251_0.png

np.random.normal()的用法

loc:float 此概率分布的均值(对应着整个分布的中心centre)

scale:float 此概率分布的标准差(对应于分布的宽度,scale越大越矮胖,scale越小,越瘦高)

size:int or tuple of ints 输出的shape,默认为None,只输出一个值

概率密度函数:f(x)=\frac{1}{\sqrt{2π}\sigma}e2}{2\sigma^2}}

  1. # 采样:
  2. s = rng.normal(loc=0, scale=.1, size=1000)
  3. # 拟合:
  4. count, bins, _ = plt.hist(s, 30, normed=True)
  5. # normed是进行拟合的关键
  6. # count统计某一bin出现的次数,在Normed为True时,可能其值会略有不同
  7. plt.plot(bins, 1./(np.sqrt(2*np.pi)*.1)*np.exp(-(bins-0)**2/(2*.1**2)), lw=2, c='r')
  1. C:\anaconda\lib\site-packages\ipykernel_launcher.py:5: MatplotlibDeprecationWarning:
  2. The 'normed' kwarg was deprecated in Matplotlib 2.1 and will be removed in 3.1. Use 'density' instead.
  3. """
  4. [<matplotlib.lines.Line2D at 0x1442765a898>]

output_253_2.png

  1. pca = PCA(0.5).fit(noisy)
  2. x_dr = pca.transform(noisy)
  3. x_dr.shape
  1. (1797, 6)
  1. x_inverse = pca.inverse_transform(x_dr)
  2. x_inverse.shape
  1. (1797, 64)
  1. plot_digits(x_inverse)

output_256_0.png

6. 案例:PCA对手写数据集的降维

  1. from sklearn.ensemble import RandomForestClassifier as RFC
  2. from sklearn.model_selection import cross_val_score
  1. data = pd.read_csv('digit recognizor.csv')
  2. data.shape
  1. (42000, 785)
  1. data.info()
  1. <class 'pandas.core.frame.DataFrame'>
  2. RangeIndex: 42000 entries, 0 to 41999
  3. Columns: 785 entries, label to pixel783
  4. dtypes: int64(785)
  5. memory usage: 251.5 MB
  1. data.head()
label pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 pixel774 pixel775 pixel776 pixel777 pixel778 pixel779 pixel780 pixel781 pixel782 pixel783
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

5 rows × 785 columns

  1. data.columns
  1. Index(['label', 'pixel0', 'pixel1', 'pixel2', 'pixel3', 'pixel4', 'pixel5',
  2. 'pixel6', 'pixel7', 'pixel8',
  3. ...
  4. 'pixel774', 'pixel775', 'pixel776', 'pixel777', 'pixel778', 'pixel779',
  5. 'pixel780', 'pixel781', 'pixel782', 'pixel783'],
  6. dtype='object', length=785)
  1. y = data.iloc[:, 0]
  2. x = data.iloc[:, 1:]
  3. x.shape
  1. (42000, 784)

画累计方差贡献率曲线,找最佳的降维后维度的范围

  1. pca_line = PCA().fit(x)
  2. plt.figure(figsize = (20, 5))
  3. plt.plot(np.cumsum(pca_line.explained_variance_ratio_))
  4. plt.xlabel('number of components after dimension reduction')
  5. plt.ylabel('cumulative explained variance ratio')
  6. plt.yticks(np.arange(0, 1, 0.1))
  1. Text(0, 0.5, 'cumulative explained variance ratio')

output_265_1.png

画降维后的学习曲线,继续缩小最佳维度的范围

  1. score = []
  2. for i in range(1, 101, 10):
  3. x_dr = PCA(i).fit_transform(x)
  4. once = cross_val_score(RFC(n_estimators = 10, random_state = 0), x_dr, y, cv =5).mean()
  5. score.append(once)
  6. plt.figure(figsize = (20, 5))
  7. plt.plot(score)
  1. [<matplotlib.lines.Line2D at 0x14431964358>]

output_267_1.png

细化学习曲线,找出最佳降维维度

  1. score = []
  2. for i in range(10, 25):
  3. x_dr = PCA(i).fit_transform(x)
  4. once = cross_val_score(RFC(n_estimators = 10, random_state = 0), x_dr, y, cv =5).mean()
  5. score.append(once)
  6. plt.figure(figsize = (20, 5))
  7. plt.plot(range(10, 25), score)
  1. [<matplotlib.lines.Line2D at 0x14432c25240>]

output_269_1.png

  1. print(max(score), score.index(max(score)) + 10)
  1. 0.9193339375531199 22

导入最佳维度降维,查看模型效果

  1. x_dr = PCA(22).fit_transform(x)
  2. score = cross_val_score(RFC(n_estimators = 100, random_state = 0), x_dr, y, cv =5).mean()
  3. score
  1. 0.9441668641981771

KNN的K值学习曲线

  1. from sklearn.neighbors import KNeighborsClassifier as KNN
  2. x_dr = PCA(22).fit_transform(x)
  3. score = cross_val_score(KNN(), x_dr, y, cv =5).mean()
  4. score
  1. 0.9687611825661353
  1. score = []
  2. for i in range(1, 11):
  3. x_dr = PCA(22).fit_transform(x)
  4. once = cross_val_score(KNN(i), x_dr, y, cv =5).mean()
  5. score.append(once)
  6. plt.figure(figsize = (20, 5))
  7. plt.plot(range(1, 11), score)
  1. [<matplotlib.lines.Line2D at 0x144289b90f0>]

output_275_1.png

  1. cross_val_score(KNN(3), x_dr, y, cv =5).mean()
  1. 0.969023172398515