1. Ridge回归的损失函数
2. 数据获取与预处理
数据介绍:http://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant
数据下载地址:http://archive.ics.uci.edu/ml/machine-learning-databases/00294/
里面是一个循环发电场的数据,共有9568个样本数据,每个数据有5列,分别是:AT(温度), V(压力), AP(湿度), RH(压强), PE(输出电力)。我们不用纠结于每项具体的意思。
我们的问题是得到一个线性的关系,对应PE是样本输出,而AT/V/AP/RH这4个是样本特征, 机器学习的目的就是通过调节超参数𝛼α得到一个线性回归模型,即:
3. 数据读取与训练集测试集划分
倒入库的声明
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
用pandas读取数据
data = pd.read_csv("./data/CCPP/CCPP.csv")
data.head()
我们用AT, V,AP和RH这4个列作为样本特征。用PE作为样本输出
X = data[["AT", "V", "AP", "RH"]]
y = data[["PE"]]
划分训练集和测试集
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)
4. 用scikit-learn运行Ridge回归
暂定参数a = 1,后面用交叉验证来选取最优化的参数
from sklearn.linear_model import Ridge
ridge = Ridge(alpha=1)
ridge.fit(X_train, y_train)
print(ridge.coef_)
print(ridge.intercept_)
# Output:
[[-1.96862642 -0.23930532 0.05685793 -0.15860993]]
[460.04983649]
5. 用scikit-learn选择Ridge回归超参数𝛼
from sklearn.linear_model import RidgeCV
ridgeCV = RidgeCV(alphas=[0.01, 0.1, 0.5, 1, 3, 5, 6, 6.5, 7, 10, 20, 100])
ridgeCV.fit(X_train, y_train)
ridgeCV.alpha_
# Output:
7
输出结果为:7.0,说明在我们给定的这组超参数中, 7是最优的α值。
6.模型评价
y_pred = ridgeCV.predict(X_test)
from sklearn import metrics
# 用scikit-learn计算MSE
print("MSE:%d",metrics.mean_squared_error(y_test, y_pred))
# 用scikit-learn计算RMSE
print("RMSE:%data",np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
# Output:
MSE:%d 20.83732091966483
RMSE:%data 4.5647914431729335
7.交叉验证
from sklearn.model_selection import cross_val_predict
predicted = cross_val_predict(ridgeCV, X, y, cv=10)
# 用scikit-learn计算MSE
print("MSE:%d",metrics.mean_squared_error(y, predicted))
# 用scikit-learn计算RMSE
print("RMSE:%data",np.sqrt(metrics.mean_squared_error(y, predicted)))
# Output:
MSE:%d 20.793674972051814
RMSE:%data 4.560008220612307
9. 画图观察结果
fig, ax = plt.subplots()
ax.scatter(y, predicted)
ax.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show()