作用:最大程度地遏制模型的过拟合和欠拟合。

验证数据集

为什么需要验证数据集:
1.由于测试数据并不是真实数据,因此训练出的模型在预测真实值时,可能出现过拟合和欠拟合。
2.训练出的模型依赖测试集,当测试集发生变化,模型需要重新训练
验证数据集的具体作用:模拟真实环境
1.充当测试数据集
2.真正的测试数据集变成了真实数据集
image.png

交叉验证

随机取验证数据集:避免验证数据有大量极端值
image.png

编码

  1. # 1.准备数据
  2. import numpy as np
  3. from sklearn import datasets
  4. digits = datasets.load_digits()
  5. X = digits.data
  6. y = digits.target
  7. # 2.分割数据
  8. from sklearn.model_selection import train_test_split
  9. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=666)
  10. # 3.建模
  11. from sklearn.neighbors import KNeighborsClassifier
  12. best_k, best_p, best_score = 0, 0, 0
  13. for k in range(2, 11):
  14. for p in range(1, 6):
  15. knn_clf = KNeighborsClassifier(weights="distance", n_neighbors=k, p=p)
  16. knn_clf.fit(X_train, y_train)
  17. score = knn_clf.score(X_test, y_test)
  18. if score > best_score:
  19. best_k, best_p, best_score = k, p, score
  20. print("Best K =", best_k) # 3
  21. print("Best P =", best_p) # 4
  22. print("Best Score =", best_score) # 0.986091794159
  23. # 4.使用交叉验证
  24. from sklearn.model_selection import cross_val_score
  25. knn_clf = KNeighborsClassifier()
  26. cross_val_score(knn_clf, X_train, y_train) # array([ 0.98895028, 0.97777778, 0.96629213])
  27. best_k, best_p, best_score = 0, 0, 0
  28. for k in range(2, 11):
  29. for p in range(1, 6):
  30. knn_clf = KNeighborsClassifier(weights="distance", n_neighbors=k, p=p)
  31. scores = cross_val_score(knn_clf, X_train, y_train) # 交叉验证:随机选取验证数据集和建模
  32. score = np.mean(scores)
  33. if score > best_score:
  34. best_k, best_p, best_score = k, p, score
  35. print("Best K =", best_k) # 2
  36. print("Best P =", best_p) # 2
  37. print("Best Score =", best_score) # 0.982359987401
  38. best_knn_clf = KNeighborsClassifier(weights="distance", n_neighbors=2, p=2)
  39. best_knn_clf.fit(X_train, y_train)
  40. best_knn_clf.score(X_test, y_test) # 0.98052851182197498
  41. # 5.网格搜索
  42. # 网格搜索内置了交叉验证
  43. from sklearn.model_selection import GridSearchCV
  44. param_grid = [
  45. {
  46. 'weights': ['distance'],
  47. 'n_neighbors': [i for i in range(2, 11)],
  48. 'p': [i for i in range(1, 6)]
  49. }
  50. ]
  51. grid_search = GridSearchCV(knn_clf, param_grid, verbose=1)
  52. grid_search.fit(X_train, y_train)
  53. # Fitting 3 folds for each of 45 candidates, totalling 135 fits
  54. # 训练集分成3组,一共45个模型,共训练135次
  55. grid_search.best_score_ # 0.98237476808905377
  56. grid_search.best_params_ # {'n_neighbors': 2, 'p': 2, 'weights': 'distance'}
  57. best_knn_clf = grid_search.best_estimator_
  58. best_knn_clf.score(X_test, y_test) # 0.98052851182197498
  59. # 6.cv参数
  60. # 训练集分成几组
  61. cross_val_score(knn_clf, X_train, y_train, cv=5)
  62. # array([ 0.99543379, 0.96803653, 0.98148148, 0.96261682, 0.97619048])
  63. grid_search = GridSearchCV(knn_clf, param_grid, verbose=1, cv=5)

k-folds交叉验证

  • 定义:把训练集分成k份,称为k-folds cross validation
  • 优点:模型更精确
  • [x] 缺点:每次训练k个模型,整体性能慢了k倍

    LOO-CV

    Leave-One-Out Cross Validation

  • [x] 定义:训练集分成m份,m-1份用于训练,剩下的1份用于验证。

  • 优点:完全不受随机影响,最接近模型真正性能指标,常用于学术、论文中来保证模型的准确度。
  • 缺点:计算量巨大