一、机器学习的一些概念

基本概念

  • 特征:一组数据的多个属性
  • 标签:人为指定特征
  • 监督学习:就像分类(离散化的标签),回归(连续性的标签)、【“有标准答案”】
  • 无监督学习:就像聚类【“无标准答案”】
  • 数据:是机器学习的命脉

基本框架图

机器学习scikit-learn库的使用 - 图1
机器学习scikit-learn库的使用 - 图2

二、机器学习的一些阶段/步骤

sklearn相关提及

官网scikit-learn

  1. 包含聚类、分类、回归等算法
    eg:随机森林、k-means、SVM等
  2. 包含模型筛选、降维、预处理等算法
  3. 要特别注意安装该包使用要注意的细节,具体参考上一篇博客

sklearn处理机器学习的一般化sop

  1. 准备数据集
    • 数据分析:(利用np.reshape()成二维(n_samples,n_features))
    • 划分数据集:train_test_split()
    • 特征工程:特征的提取、特征的归一化nomalization
  2. 选择模型
  3. 在训练集上训练模型,并调整参数
    • 经验选定参数
    • 交叉验证确定最优的参数cross validation
  4. 在测试集上测试模型
    • predict预测、score真实值预测值评分、etc
  5. 保存模型
    • import pickle

主成分分析:将特征降维

  • 统计学相关知识:方差(衡量在一个维度的偏差)、协方差(衡量一个维度是否对另一个维度有影响cov(x,y))
  • 线代相关知识:特征值、特征向量、协方差向量
  • PCA

相关代码html页面

三、通过scikit-learn认识机器学习

加载示例数据集

  1. from sklearn import datasets
  2. iris = datasets.load_iris()#用sklearn自身配带的数据
  3. digits = datasets.load_digits()
  4. # C:\Users\wztli\Anaconda3\pkgs\scikit-learn-0.21.3-py37h6288b17_0\Lib\site-packages\sklearn\datasets\data
  5. # 数据集在电脑中的位置
  1. # 查看数据集
  2. # iris
  3. print(iris.data[:5])
  4. print(iris.data.shape)
  5. print(iris.target_names)
  6. print(iris.target)
  1. [[5.1 3.5 1.4 0.2]
  2. [4.9 3. 1.4 0.2]
  3. [4.7 3.2 1.3 0.2]
  4. [4.6 3.1 1.5 0.2]
  5. [5. 3.6 1.4 0.2]]
  6. (150, 4)
  7. ['setosa' 'versicolor' 'virginica']
  8. [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  9. 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  10. 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
  11. 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
  12. 2 2]
  1. # digits
  2. print(digits.data)
  3. print(digits.data.shape)
  4. print(digits.target_names)
  5. print(digits.target)
  1. [[ 0. 0. 5. ... 0. 0. 0.]
  2. [ 0. 0. 0. ... 10. 0. 0.]
  3. [ 0. 0. 0. ... 16. 9. 0.]
  4. ...
  5. [ 0. 0. 1. ... 6. 0. 0.]
  6. [ 0. 0. 2. ... 12. 0. 0.]
  7. [ 0. 0. 10. ... 12. 1. 0.]]
  8. (1797, 64)
  9. [0 1 2 3 4 5 6 7 8 9]
  10. [0 1 2 ... 8 9 8]

在训练集上训练模型

  1. # 手动划分训练集、测试集
  2. n_test = 100 # 测试样本个数
  3. train_X = digits.data[:-n_test, :]
  4. train_y = digits.target[:-n_test]
  5. test_X = digits.data[-n_test:, :]
  6. y_true = digits.target[-n_test:]
  1. # 选择SVM模型
  2. from sklearn import svm
  3. svm_model = svm.SVC(gamma=0.001, C=100.)
  4. # svm_model = svm.SVC(gamma=100., C=1.)
  5. # 训练模型
  6. svm_model.fit(train_X, train_y)
  7. #训练要放入两个参数:样本的特征数据,样本的标签
  1. SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0,
  2. decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
  3. max_iter=-1, probability=False, random_state=None, shrinking=True,
  4. tol=0.001, verbose=False)
  1. # 选择LR(逻辑回归)模型
  2. from sklearn.linear_model import LogisticRegression
  3. lr_model = LogisticRegression()
  4. # 训练模型
  5. lr_model.fit(train_X, train_y)
  1. C:\Users\wztli\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  2. FutureWarning)
  3. C:\Users\wztli\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  4. "this warning.", FutureWarning)
  5. LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
  6. intercept_scaling=1, l1_ratio=None, max_iter=100,
  7. multi_class='warn', n_jobs=None, penalty='l2',
  8. random_state=None, solver='warn', tol=0.0001, verbose=0,
  9. warm_start=False)

在测试集上测试模型

  1. y_pred_svm = svm_model.predict(test_X)
  2. y_pred_lr = lr_model.predict(test_X)
  1. # 查看结果
  2. # 评价指标
  3. from sklearn.metrics import accuracy_score
  4. #print '预测标签:', y_pred
  5. #print '真实标签:', y_true
  6. print('SVM结果:', accuracy_score(y_true, y_pred_svm))
  7. print('LR结果:', accuracy_score(y_true, y_pred_lr))
  1. SVM结果: 0.98
  2. LR结果: 0.94

保存模型

  1. import pickle
  2. with open('svm_model.pkl', 'wb') as f:
  3. pickle.dump(svm_model, f)
  1. import numpy as np
  2. # 重新加载模型进行预测
  3. with open('svm_model.pkl', 'rb') as f:
  4. model = pickle.load(f)
  5. random_samples_index = np.random.randint(0, 1796, 5)
  6. random_samples = digits.data[random_samples_index, :]
  7. random_targets = digits.target[random_samples_index]
  8. random_predict = model.predict(random_samples)
  9. print(random_predict)
  10. print(random_targets)
  1. [2 2 1 3 8]
  2. [2 2 1 3 8]

四、scikit-learn入门

准备数据集

  1. import numpy as np
  2. from sklearn.model_selection import train_test_split
  1. X = np.random.randint(0, 100, (10, 4))
  2. y = np.random.randint(0, 4, 10)
  3. y.sort()
  4. print('样本:')
  5. print(X)
  6. print('标签:', y)
  1. 样本:
  2. [[43 43 18 78]
  3. [74 24 42 37]
  4. [36 69 84 47]
  5. [70 62 77 30]
  6. [87 38 3 96]
  7. [68 67 24 7]
  8. [66 36 72 72]
  9. [12 94 87 72]
  10. [66 5 92 6]
  11. [41 59 60 91]]
  12. 标签: [0 0 0 2 2 2 2 3 3 3]
  1. # 分割训练集、测试集
  2. # random_state确保每次随机分割得到相同的结果
  3. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/3., random_state=7)
  4. print('训练集:')
  5. print(X_train)
  6. print(y_train)
  7. print('测试集:')
  8. print(X_test)
  9. print(y_test)
  1. 训练集:
  2. [[63 56 7 42]
  3. [40 47 17 23]
  4. [41 31 26 8]
  5. [79 30 22 88]
  6. [54 85 48 54]
  7. [89 73 77 41]]
  8. [0 1 1 0 1 1]
  9. 测试集:
  10. [[ 3 0 42 86]
  11. [42 96 83 38]
  12. [33 45 8 37]
  13. [ 1 44 75 7]]
  14. [1 1 0 0]
  1. # 特征归一化
  2. from sklearn import preprocessing
  3. x1 = np.random.randint(0, 1000, 5).reshape(5,1)
  4. x2 = np.random.randint(0, 10, 5).reshape(5, 1)
  5. x3 = np.random.randint(0, 100000, 5).reshape(5, 1)
  6. X = np.concatenate([x1, x2, x3], axis=1)
  7. print(X)
  1. [[ 353 4 27241]
  2. [ 999 4 34684]
  3. [ 911 4 78606]
  4. [ 310 6 44593]
  5. [ 817 9 6356]]
  1. print(preprocessing.scale(X))
  1. [[-1.12443958 -0.71443451 -0.46550183]
  2. [ 1.11060033 -0.71443451 -0.15209341]
  3. [ 0.80613669 -0.71443451 1.69736578]
  4. [-1.27321159 0.30618622 0.26515287]
  5. [ 0.48091416 1.83711731 -1.34492342]]
  1. # 生成分类数据进行验证scale的必要性
  2. from sklearn.datasets import make_classification
  3. import matplotlib.pyplot as plt
  4. %matplotlib inline
  5. X, y = make_classification(n_samples=300, n_features=2, n_redundant=0, n_informative=2,
  6. random_state=25, n_clusters_per_class=1, scale=100)
  7. plt.scatter(X[:,0], X[:,1], c=y)
  8. plt.show()

机器学习scikit-learn库的使用 - 图3

  1. from sklearn import svm
  2. # 注释掉以下这句表示不进行特征归一化
  3. X = preprocessing.scale(X)
  4. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/3., random_state=7)
  5. svm_classifier = svm.SVC()
  6. svm_classifier.fit(X_train, y_train)
  7. svm_classifier.score(X_test, y_test)
  1. C:\Users\wztli\Anaconda3\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  2. "avoid this warning.", FutureWarning)
  3. 0.25

训练模型

  1. # 回归模型
  2. from sklearn import datasets
  3. boston_data = datasets.load_boston()
  4. X = boston_data.data
  5. y = boston_data.target
  6. print('样本:')
  7. print(X[:5, :])
  8. print('标签:')
  9. print(y[:5])
  1. 样本:
  2. [[6.3200e-03 1.8000e+01 2.3100e+00 0.0000e+00 5.3800e-01 6.5750e+00
  3. 6.5200e+01 4.0900e+00 1.0000e+00 2.9600e+02 1.5300e+01 3.9690e+02
  4. 4.9800e+00]
  5. [2.7310e-02 0.0000e+00 7.0700e+00 0.0000e+00 4.6900e-01 6.4210e+00
  6. 7.8900e+01 4.9671e+00 2.0000e+00 2.4200e+02 1.7800e+01 3.9690e+02
  7. 9.1400e+00]
  8. [2.7290e-02 0.0000e+00 7.0700e+00 0.0000e+00 4.6900e-01 7.1850e+00
  9. 6.1100e+01 4.9671e+00 2.0000e+00 2.4200e+02 1.7800e+01 3.9283e+02
  10. 4.0300e+00]
  11. [3.2370e-02 0.0000e+00 2.1800e+00 0.0000e+00 4.5800e-01 6.9980e+00
  12. 4.5800e+01 6.0622e+00 3.0000e+00 2.2200e+02 1.8700e+01 3.9463e+02
  13. 2.9400e+00]
  14. [6.9050e-02 0.0000e+00 2.1800e+00 0.0000e+00 4.5800e-01 7.1470e+00
  15. 5.4200e+01 6.0622e+00 3.0000e+00 2.2200e+02 1.8700e+01 3.9690e+02
  16. 5.3300e+00]]
  17. 标签:
  18. [24. 21.6 34.7 33.4 36.2]
  1. # 选择线性回顾模型
  2. from sklearn.linear_model import LinearRegression
  3. lr_model = LinearRegression()
  1. from sklearn.model_selection import train_test_split
  2. # 分割训练集、测试集
  3. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/3., random_state=7)
  1. # 训练模型
  2. lr_model.fit(X_train, y_train)
  1. LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
  1. # 返回参数
  2. lr_model.get_params()
  1. {'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'normalize': False}
  1. lr_model.score(X_train, y_train)
  1. 0.7598132492351114
  1. lr_model.score(X_test, y_test)
  1. 0.6693852753319398

交叉验证

  1. from sklearn import datasets
  2. from sklearn.model_selection import train_test_split, cross_val_score
  3. from sklearn.neighbors import KNeighborsClassifier
  4. import matplotlib.pyplot as plt
  5. %matplotlib inline
  6. iris = datasets.load_iris()
  7. X = iris.data
  8. y = iris.target
  9. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/3., random_state=10)
  10. k_range = range(1, 31)
  11. cv_scores = []
  12. for n in k_range:
  13. knn = KNeighborsClassifier(n)
  14. scores = cross_val_score(knn, X_train, y_train, cv=10, scoring='accuracy') # 分类问题使用
  15. #scores = cross_val_score(knn, X_train, y_train, cv=10, scoring='neg_mean_squared_error') # 回归问题使用
  16. cv_scores.append(scores.mean())
  17. plt.plot(k_range, cv_scores)
  18. plt.xlabel('K')
  19. plt.ylabel('Accuracy')
  20. plt.show()

机器学习scikit-learn库的使用 - 图4

  1. # 选择最优的K
  2. best_knn = KNeighborsClassifier(n_neighbors=5)
  3. best_knn.fit(X_train, y_train)
  4. print(best_knn.score(X_test, y_test))
  5. print(best_knn.predict(X_test))
  1. 0.96
  2. [1 2 0 1 0 1 2 1 0 1 1 2 1 0 0 2 1 0 0 0 2 2 2 0 1 0 1 1 1 2 1 1 2 2 2 0 2
  3. 2 2 2 0 0 1 0 1 0 1 2 2 2]

参考