1. Ridge回归的损失函数

损失函数:
Ridge回归实践 - 图2
Ridge回归实践 - 图32 为L2范数

2. 数据获取与预处理

数据介绍:http://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant
数据下载地址:http://archive.ics.uci.edu/ml/machine-learning-databases/00294/
里面是一个循环发电场的数据,共有9568个样本数据,每个数据有5列,分别是:AT(温度), V(压力), AP(湿度), RH(压强), PE(输出电力)。我们不用纠结于每项具体的意思。
我们的问题是得到一个线性的关系,对应PE是样本输出,而AT/V/AP/RH这4个是样本特征, 机器学习的目的就是通过调节超参数𝛼α得到一个线性回归模型,即:
Ridge回归实践 - 图4

3. 数据读取与训练集测试集划分

倒入库的声明

  1. import pandas as pd
  2. import matplotlib.pyplot as plt
  3. import numpy as np
  4. from sklearn import datasets, linear_model

用pandas读取数据

  1. data = pd.read_csv("./data/CCPP/CCPP.csv")
  2. data.head()

image.png
我们用AT, V,AP和RH这4个列作为样本特征。用PE作为样本输出

  1. X = data[["AT", "V", "AP", "RH"]]
  2. y = data[["PE"]]

划分训练集和测试集

  1. from sklearn.cross_validation import train_test_split
  2. X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)

4. 用scikit-learn运行Ridge回归

暂定参数a = 1,后面用交叉验证来选取最优化的参数

  1. from sklearn.linear_model import Ridge
  2. ridge = Ridge(alpha=1)
  3. ridge.fit(X_train, y_train)
  4. print(ridge.coef_)
  5. print(ridge.intercept_)
  6. # Output:
  7. [[-1.96862642 -0.23930532 0.05685793 -0.15860993]]
  8. [460.04983649]

Ridge回归实践 - 图6

5. 用scikit-learn选择Ridge回归超参数𝛼

  1. from sklearn.linear_model import RidgeCV
  2. ridgeCV = RidgeCV(alphas=[0.01, 0.1, 0.5, 1, 3, 5, 6, 6.5, 7, 10, 20, 100])
  3. ridgeCV.fit(X_train, y_train)
  4. ridgeCV.alpha_
  5. # Output:
  6. 7

输出结果为:7.0,说明在我们给定的这组超参数中, 7是最优的α值。

6.模型评价

  1. y_pred = ridgeCV.predict(X_test)
  2. from sklearn import metrics
  3. # 用scikit-learn计算MSE
  4. print("MSE:%d",metrics.mean_squared_error(y_test, y_pred))
  5. # 用scikit-learn计算RMSE
  6. print("RMSE:%data",np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
  7. # Output:
  8. MSE:%d 20.83732091966483
  9. RMSE:%data 4.5647914431729335

7.交叉验证

  1. from sklearn.model_selection import cross_val_predict
  2. predicted = cross_val_predict(ridgeCV, X, y, cv=10)
  3. # 用scikit-learn计算MSE
  4. print("MSE:%d",metrics.mean_squared_error(y, predicted))
  5. # 用scikit-learn计算RMSE
  6. print("RMSE:%data",np.sqrt(metrics.mean_squared_error(y, predicted)))
  7. # Output:
  8. MSE:%d 20.793674972051814
  9. RMSE:%data 4.560008220612307

9. 画图观察结果

  1. fig, ax = plt.subplots()
  2. ax.scatter(y, predicted)
  3. ax.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4)
  4. ax.set_xlabel('Measured')
  5. ax.set_ylabel('Predicted')
  6. plt.show()

image.png