多元:有多个自变量 x1 、 x2 、x3
解释:一个因素受到多个因素的影响的规律,X(x1,x2,x3…)可看作一个向量,y值呈显现规律。
公式:03 多元线性回归原理及实现 - 图1

0.一元线性回归理解多元线性回归

其中第二个就是理论回归方程。因为多元线性回归一个观测值就不再是一个标量而是一个向量了,所以可能自变量的观测值就变成了 image.png而对应的因变量的观测值不变,还是 image.png因此我们把这些观测值每一行每一行的叠加起来就成为了一个向量或者矩阵,所以引入矩阵的表示是必要的。
image.png
那么这个时候的多元线性回归的表示就变成了 image.png ,其中 03 多元线性回归原理及实现 - 图6 我们一般称为设计矩阵

1.数学原理

image.png
求解思路和一元线性回归法相同。
三维空间线性方程是 03 多元线性回归原理及实现 - 图8
由此推到多维空间的形式 03 多元线性回归原理及实现 - 图9

  • 目标:使03 多元线性回归原理及实现 - 图10尽可能小。

即找到03 多元线性回归原理及实现 - 图11,使得03 多元线性回归原理及实现 - 图12尽可能小。
03 多元线性回归原理及实现 - 图13
image.png
目标使 03 多元线性回归原理及实现 - 图15尽可能小,03 多元线性回归原理及实现 - 图16

  • 问题:时间复杂度高: 03 多元线性回归原理及实现 - 图17 优化 03 多元线性回归原理及实现 - 图18,必要时需要使用 梯度下降法
  • 优点:不存在量纲问题,不需要对数据做归一化处理。( 由公式03 多元线性回归原理及实现 - 图19 可知,原来的数据进行矩阵运算,因此不存在量纲的问题 )
  • image.png

2.封装模型

  1. # LinearRegression.py
  2. import numpy as np
  3. from .metrics import r2_score
  4. class LinearRegression:
  5. def __init__(self):
  6. """初始化Linear Regression模型"""
  7. self.coef_ = None
  8. self.intercept_ = None
  9. self._theta = None
  10. def fit_normal(self, X_train, y_train):
  11. """根据训练数据集X_train, y_train训练Linear Regression模型"""
  12. assert X_train.shape[0] == y_train.shape[0], \
  13. "the size of X_train must be equal to the size of y_train"
  14. X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
  15. self._theta = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y_train)
  16. self.intercept_ = self._theta[0]
  17. self.coef_ = self._theta[1:]
  18. return self
  19. def predict(self, X_predict):
  20. """给定待预测数据集X_predict,返回表示X_predict的结果向量"""
  21. assert self.intercept_ is not None and self.coef_ is not None, \
  22. "must fit before predict!"
  23. assert X_predict.shape[1] == len(self.coef_), \
  24. "the feature number of X_predict must be equal to X_train"
  25. X_b = np.hstack([np.ones((len(X_predict), 1)), X_predict])
  26. return X_b.dot(self._theta)
  27. def score(self, X_test, y_test):
  28. """根据测试数据集 X_test 和 y_test 确定当前模型的准确度"""
  29. y_predict = self.predict(X_test)
  30. return r2_score(y_test, y_predict)
  31. def __repr__(self):
  32. return "LinearRegression()"

3.编码实现

  1. import numpy as np
  2. import matplotlib.pyplot as plt
  3. from sklearn import datasets
  4. boston = datasets.load_boston()
  5. X = boston.data
  6. y = boston.target
  7. X = X[y < 50.0]
  8. y = y[y < 50.0]
  9. X.shape # (490, 13)
  10. # 切分数据集
  11. from playML.model_selection import train_test_split
  12. X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666)
  13. # 建模
  14. from playML.LinearRegression import LinearRegression
  15. reg = LinearRegression()
  16. reg.fit_normal(X_train, y_train)
  17. reg.coef_
  18. # array([-1.18919477e-01, 3.63991462e-02, -3.56494193e-02, 5.66737830e-02,
  19. # -1.16195486e+01, 3.42022185e+00, -2.31470282e-02, -1.19509560e+00,
  20. # 2.59339091e-01, -1.40112724e-02, -8.36521175e-01, 7.92283639e-03,
  21. # -3.81966137e-01])
  22. reg.intercept_ # 34.16143549625432
  23. reg.score(X_test, y_test) # 模型打分: 0.812980260265854

4.scikit-learn实现

  1. import numpy as np
  2. import matplotlib.pyplot as plt
  3. from sklearn import datasets
  4. # 加载数据
  5. boston = datasets.load_boston()
  6. X = boston.data
  7. y = boston.target
  8. # 数据预处理
  9. X = X[y < 50.0]
  10. y = y[y < 50.0]
  11. X.shape # (490, 13)
  12. from playML.model_selection import train_test_split
  13. X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666)
  14. # 建模
  15. from sklearn.linear_model import LinearRegression
  16. lin_reg = LinearRegression()
  17. lin_reg.fit(X_train, y_train)
  18. # LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
  19. lin_reg.coef_
  20. # array([-1.18919477e-01, 3.63991462e-02, -3.56494193e-02, 5.66737830e-02,
  21. # -1.16195486e+01, 3.42022185e+00, -2.31470282e-02, -1.19509560e+00,
  22. # 2.59339091e-01, -1.40112724e-02, -8.36521175e-01, 7.92283639e-03,
  23. # -3.81966137e-01])
  24. lin_reg.intercept_ # 34.16143549624639
  25. lin_reg.score(X_test, y_test) # 0.8129802602658499

5.KNN解决回归问题

  1. from sklearn.preprocessing import StandardScaler
  2. # 数据预处理
  3. standardScaler = StandardScaler()
  4. standardScaler.fit(X_train, y_train)
  5. X_train_standard = standardScaler.transform(X_train)
  6. X_test_standard = standardScaler.transform(X_test)
  7. # 建模
  8. from sklearn.neighbors import KNeighborsRegressor
  9. knn_reg = KNeighborsRegressor()
  10. knn_reg.fit(X_train_standard, y_train)
  11. knn_reg.score(X_test_standard, y_test)
  12. # 网格搜索
  13. from sklearn.model_selection import GridSearchCV
  14. param_grid = [
  15. {
  16. "weights": ["uniform"],
  17. "n_neighbors": [i for i in range(1, 11)]
  18. },
  19. {
  20. "weights": ["distance"],
  21. "n_neighbors": [i for i in range(1, 11)],
  22. "p": [i for i in range(1,6)]
  23. }
  24. ]
  25. knn_reg = KNeighborsRegressor()
  26. grid_search = GridSearchCV(knn_reg, param_grid, n_jobs=-1, verbose=1) # n_jobs=-1 代码使用所有CPU核
  27. grid_search.fit(X_train_standard, y_train)
  28. # Fitting 3 folds for each of 60 candidates, totalling 180 fits
  29. # GridSearchCV(cv=None, error_score='raise',
  30. # estimator=KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
  31. # metric_params=None, n_jobs=1, n_neighbors=5, p=2,
  32. # weights='uniform'),
  33. # fit_params=None, iid=True, n_jobs=-1,
  34. # param_grid=[{'weights': ['uniform'], 'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}, {'weights': ['distance'],
  35. # 'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'p': [1, 2, 3, 4, 5]}],
  36. # pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
  37. # scoring=None, verbose=1)
  38. grid_search.best_params_ # 最好结果 {'n_neighbors': 5, 'p': 1, 'weights': 'distance'}
  39. grid_search.best_score_ # 评分: 0.799179998909969
  40. grid_search.best_estimator_.score(X_test_standard, y_test) # 0.880996650994177