2. 端到端的机器学习项目 - 2.6 选择和训练模型 - 《大数据》

2.6 选择和训练模型
- 2.6.1 训练和评估训练集
- 2.6.2 使用交叉验证来更好地进行评估

2.6 选择和训练模型

至此，你提出了问题，获得了数据，也进行了数据探索，然后对训练集和测试集进行了抽样并编写了转换流水线，从而可以自动清理和准备机器学习算法的数据！现在是时候选择机器学习模型并展开训练了。

2.6.1 训练和评估训练集

好消息是，在经过前面的步骤之后，事情现在变得比想象中容易很多。首先，先训练一个线性回归模型：

from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared,housing_labels)

用几个训练集的实例试试：

>>> some_data = housing.iloc[:5]
>>> some_labels = housing_labels.iloc[:5]
>>> some_data_prepared = full_pipeline.transform(some_data)
>>> print("Predictions:",lin_reg.predict(some_data_prepared))
Predictions: [210644.60459286 317768.80697211 210956.43331178  59218.98886849
 189747.55849879]
>>> print("Labels:",list(some_labels))
Labels: [286600.0, 340600.0, 196900.0, 46300.0, 254500.0]

使用Scikit-Learn的mean_squared_error()函数来测量整个训练集上回归模型的RMSE:

>>> from sklearn.metrics import mean_squared_error
>>> housing_predictions = lin_reg.predict(housing_prepared)
>>> lin_mse = mean_squared_error(housing_labels,housing_predictions)
>>> lin_rmse = np.sqrt(lin_mse)
>>> lin_rmse
68628.19819848922

预测误差达68628，这是一个典型的模型对训练数据欠拟合的案例。我们使用一个更复杂的模型，决策树-DecisionTreeRegressor来看看它是怎么工作的

from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_prepared,housing_labels)

我们用这个训练集来评估一下：

>>> housing_predictions = tree_reg.predict(housing_prepared)
>>> tree_mse = mean_squared_error(housing_labels,housing_predictions)
>>> tree_rmse = np.sqrt(tree_mse)
>>> tree_rmse
0.0

显示没有误差，这极有可能是模型对数据严重过拟合了。

2.6.2 使用交叉验证来更好地进行评估

from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg,housing_prepared,housing_labels,scoring="neg_mean_squared_error",cv=10)
tree_rmse_scores = np.sqrt(-scores)

让我们看看效果：

>>> def display_scores(scores):
        print("Scores:",scores)
        print("Mean:",scores.mean())
        print("Standard deviation:",scores.std())
>>> display_scores(tree_rmse_scores)
Scores: [69017.96294267 68337.01327495 71147.44499398 68936.07473709
 70929.49079622 74332.00426349 70378.84961035 71515.00970514
 76092.14204638 69396.31461086]
Mean: 71008.23069811359
Standard deviation: 2357.134350821029

这次的决策树模型好像不如之前表现好。事实上，它看起来比线性回归模型还要遭，保险起见，让我们也计算下线性模型的评分：

>>> lin_scores = cross_val_score(lin_reg,housing_prepared,housing_labels,scoring="neg_mean_squared_error",cv=10)
>>> lin_rmse_scores = np.sqrt(-lin_scores)
>>> display_scores(lin_rmse_scores)
Scores: [66782.73843989 66960.118071   70347.95244419 74739.57052552
 68031.13388938 71193.84183426 64969.63056405 68281.61137997
 71552.91566558 67665.10082067]
Mean: 69052.46136345083
Standard deviation: 2731.6740017983443

没错，决策树模型确实是严重过拟合了，以至于表现得比线性回归模型还要糟糕。
我们再来试试最后一个模型：RandomForestRegressor。在第7章中，我们将会了解到随机森林得工作原理。

>>> from sklearn.ensemble import RandomForestRegressor
>>> forest_reg = RandomForestRegressor()
>>> forest_reg.fit(housing_prepared,housing_labels)
>>> housing_predictions = forest_reg.predict(housing_prepared)
>>> forest_mse = mean_squared_error(housing_labels,housing_predictions)
>>> forest_rmse = np.sqrt(forest_mse)
>>> scores = cross_val_score(forest_reg,housing_prepared,housing_labels,scoring="neg_mean_squared_error",cv=10)
>>> forest_rmse_scores=np.sqrt(-scores)
>>> print(forest_rmse)
22281.753258978104
>>> display_scores(forest_rmse_scores)
Scores: [52030.19447749 51014.69502971 51611.39781866 54314.93512734
 52397.04482278 56113.47708197 50850.18305536 50014.30217687
 55606.41546758 52413.36129019]
Mean: 52636.60063479388
Standard deviation: 1948.0652821459005

这个就好多了。但是请注意，训练集上得分数仍然远低于验证集，这意味着该模型任然对训练集过拟合。