2.6 选择和训练模型

至此,你提出了问题,获得了数据,也进行了数据探索,然后对训练集和测试集进行了抽样并编写了转换流水线,从而可以自动清理和准备机器学习算法的数据!现在是时候选择机器学习模型并展开训练了。

2.6.1 训练和评估训练集

好消息是,在经过前面的步骤之后,事情现在变得比想象中容易很多。首先,先训练一个线性回归模型:

  1. from sklearn.linear_model import LinearRegression
  2. lin_reg = LinearRegression()
  3. lin_reg.fit(housing_prepared,housing_labels)

用几个训练集的实例试试:

  1. >>> some_data = housing.iloc[:5]
  2. >>> some_labels = housing_labels.iloc[:5]
  3. >>> some_data_prepared = full_pipeline.transform(some_data)
  4. >>> print("Predictions:",lin_reg.predict(some_data_prepared))
  5. Predictions: [210644.60459286 317768.80697211 210956.43331178 59218.98886849
  6. 189747.55849879]
  7. >>> print("Labels:",list(some_labels))
  8. Labels: [286600.0, 340600.0, 196900.0, 46300.0, 254500.0]

使用Scikit-Learn的mean_squared_error()函数来测量整个训练集上回归模型的RMSE:

  1. >>> from sklearn.metrics import mean_squared_error
  2. >>> housing_predictions = lin_reg.predict(housing_prepared)
  3. >>> lin_mse = mean_squared_error(housing_labels,housing_predictions)
  4. >>> lin_rmse = np.sqrt(lin_mse)
  5. >>> lin_rmse
  6. 68628.19819848922

预测误差达68628,这是一个典型的模型对训练数据欠拟合的案例。我们使用一个更复杂的模型,决策树-DecisionTreeRegressor来看看它是怎么工作的

  1. from sklearn.tree import DecisionTreeRegressor
  2. tree_reg = DecisionTreeRegressor()
  3. tree_reg.fit(housing_prepared,housing_labels)

我们用这个训练集来评估一下:

  1. >>> housing_predictions = tree_reg.predict(housing_prepared)
  2. >>> tree_mse = mean_squared_error(housing_labels,housing_predictions)
  3. >>> tree_rmse = np.sqrt(tree_mse)
  4. >>> tree_rmse
  5. 0.0

显示没有误差,这极有可能是模型对数据严重过拟合了。

2.6.2 使用交叉验证来更好地进行评估

  1. from sklearn.model_selection import cross_val_score
  2. scores = cross_val_score(tree_reg,housing_prepared,housing_labels,scoring="neg_mean_squared_error",cv=10)
  3. tree_rmse_scores = np.sqrt(-scores)

让我们看看效果:

  1. >>> def display_scores(scores):
  2. print("Scores:",scores)
  3. print("Mean:",scores.mean())
  4. print("Standard deviation:",scores.std())
  5. >>> display_scores(tree_rmse_scores)
  6. Scores: [69017.96294267 68337.01327495 71147.44499398 68936.07473709
  7. 70929.49079622 74332.00426349 70378.84961035 71515.00970514
  8. 76092.14204638 69396.31461086]
  9. Mean: 71008.23069811359
  10. Standard deviation: 2357.134350821029

这次的决策树模型好像不如之前表现好。事实上,它看起来比线性回归模型还要遭,保险起见,让我们也计算下线性模型的评分:

  1. >>> lin_scores = cross_val_score(lin_reg,housing_prepared,housing_labels,scoring="neg_mean_squared_error",cv=10)
  2. >>> lin_rmse_scores = np.sqrt(-lin_scores)
  3. >>> display_scores(lin_rmse_scores)
  4. Scores: [66782.73843989 66960.118071 70347.95244419 74739.57052552
  5. 68031.13388938 71193.84183426 64969.63056405 68281.61137997
  6. 71552.91566558 67665.10082067]
  7. Mean: 69052.46136345083
  8. Standard deviation: 2731.6740017983443

没错,决策树模型确实是严重过拟合了,以至于表现得比线性回归模型还要糟糕。
我们再来试试最后一个模型:RandomForestRegressor。在第7章中,我们将会了解到随机森林得工作原理。

  1. >>> from sklearn.ensemble import RandomForestRegressor
  2. >>> forest_reg = RandomForestRegressor()
  3. >>> forest_reg.fit(housing_prepared,housing_labels)
  4. >>> housing_predictions = forest_reg.predict(housing_prepared)
  5. >>> forest_mse = mean_squared_error(housing_labels,housing_predictions)
  6. >>> forest_rmse = np.sqrt(forest_mse)
  7. >>> scores = cross_val_score(forest_reg,housing_prepared,housing_labels,scoring="neg_mean_squared_error",cv=10)
  8. >>> forest_rmse_scores=np.sqrt(-scores)
  9. >>> print(forest_rmse)
  10. 22281.753258978104
  11. >>> display_scores(forest_rmse_scores)
  12. Scores: [52030.19447749 51014.69502971 51611.39781866 54314.93512734
  13. 52397.04482278 56113.47708197 50850.18305536 50014.30217687
  14. 55606.41546758 52413.36129019]
  15. Mean: 52636.60063479388
  16. Standard deviation: 1948.0652821459005

这个就好多了。但是请注意,训练集上得分数仍然远低于验证集,这意味着该模型任然对训练集过拟合。