2.6 选择和训练模型
至此,你提出了问题,获得了数据,也进行了数据探索,然后对训练集和测试集进行了抽样并编写了转换流水线,从而可以自动清理和准备机器学习算法的数据!现在是时候选择机器学习模型并展开训练了。
2.6.1 训练和评估训练集
好消息是,在经过前面的步骤之后,事情现在变得比想象中容易很多。首先,先训练一个线性回归模型:
from sklearn.linear_model import LinearRegressionlin_reg = LinearRegression()lin_reg.fit(housing_prepared,housing_labels)
用几个训练集的实例试试:
>>> some_data = housing.iloc[:5]>>> some_labels = housing_labels.iloc[:5]>>> some_data_prepared = full_pipeline.transform(some_data)>>> print("Predictions:",lin_reg.predict(some_data_prepared))Predictions: [210644.60459286 317768.80697211 210956.43331178 59218.98886849189747.55849879]>>> print("Labels:",list(some_labels))Labels: [286600.0, 340600.0, 196900.0, 46300.0, 254500.0]
使用Scikit-Learn的mean_squared_error()函数来测量整个训练集上回归模型的RMSE:
>>> from sklearn.metrics import mean_squared_error>>> housing_predictions = lin_reg.predict(housing_prepared)>>> lin_mse = mean_squared_error(housing_labels,housing_predictions)>>> lin_rmse = np.sqrt(lin_mse)>>> lin_rmse68628.19819848922
预测误差达68628,这是一个典型的模型对训练数据欠拟合的案例。我们使用一个更复杂的模型,决策树-DecisionTreeRegressor来看看它是怎么工作的
from sklearn.tree import DecisionTreeRegressortree_reg = DecisionTreeRegressor()tree_reg.fit(housing_prepared,housing_labels)
我们用这个训练集来评估一下:
>>> housing_predictions = tree_reg.predict(housing_prepared)>>> tree_mse = mean_squared_error(housing_labels,housing_predictions)>>> tree_rmse = np.sqrt(tree_mse)>>> tree_rmse0.0
2.6.2 使用交叉验证来更好地进行评估
from sklearn.model_selection import cross_val_scorescores = cross_val_score(tree_reg,housing_prepared,housing_labels,scoring="neg_mean_squared_error",cv=10)tree_rmse_scores = np.sqrt(-scores)
让我们看看效果:
>>> def display_scores(scores):print("Scores:",scores)print("Mean:",scores.mean())print("Standard deviation:",scores.std())>>> display_scores(tree_rmse_scores)Scores: [69017.96294267 68337.01327495 71147.44499398 68936.0747370970929.49079622 74332.00426349 70378.84961035 71515.0097051476092.14204638 69396.31461086]Mean: 71008.23069811359Standard deviation: 2357.134350821029
这次的决策树模型好像不如之前表现好。事实上,它看起来比线性回归模型还要遭,保险起见,让我们也计算下线性模型的评分:
>>> lin_scores = cross_val_score(lin_reg,housing_prepared,housing_labels,scoring="neg_mean_squared_error",cv=10)>>> lin_rmse_scores = np.sqrt(-lin_scores)>>> display_scores(lin_rmse_scores)Scores: [66782.73843989 66960.118071 70347.95244419 74739.5705255268031.13388938 71193.84183426 64969.63056405 68281.6113799771552.91566558 67665.10082067]Mean: 69052.46136345083Standard deviation: 2731.6740017983443
没错,决策树模型确实是严重过拟合了,以至于表现得比线性回归模型还要糟糕。
我们再来试试最后一个模型:RandomForestRegressor。在第7章中,我们将会了解到随机森林得工作原理。
>>> from sklearn.ensemble import RandomForestRegressor>>> forest_reg = RandomForestRegressor()>>> forest_reg.fit(housing_prepared,housing_labels)>>> housing_predictions = forest_reg.predict(housing_prepared)>>> forest_mse = mean_squared_error(housing_labels,housing_predictions)>>> forest_rmse = np.sqrt(forest_mse)>>> scores = cross_val_score(forest_reg,housing_prepared,housing_labels,scoring="neg_mean_squared_error",cv=10)>>> forest_rmse_scores=np.sqrt(-scores)>>> print(forest_rmse)22281.753258978104>>> display_scores(forest_rmse_scores)Scores: [52030.19447749 51014.69502971 51611.39781866 54314.9351273452397.04482278 56113.47708197 50850.18305536 50014.3021768755606.41546758 52413.36129019]Mean: 52636.60063479388Standard deviation: 1948.0652821459005
这个就好多了。但是请注意,训练集上得分数仍然远低于验证集,这意味着该模型任然对训练集过拟合。
