• 更多基础统计知识,可前往【统计学】专栏,欢迎补充更正哦~~
    1. #Simple LinearRegression
    2. # 导包
    3. import numpy as np
    4. from sklearn import datasets, linear_model
    5. from sklearn.metrics import mean_squared_error, r2_score
    6. from sklearn.model_selection import train_test_split
    7. import matplotlib
    8. import matplotlib.pyplot as plt
    9. import pandas as pd

    1. 数据集 train_test_split

  1. # 加载糖尿病数据集
  2. diabetes = datasets.load_diabetes()
  3. X = diabetes.data[:, np.newaxis, 2] # diabetes.data[:,2],:, np.newaxis变成二维
  4. y = diabetes.target
  5. #X=pd.DataFrame(X)
  6. #print(b)
  7. #test_size:test占比例,random_state为整数即每次输出的数字都一样
  8. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
  9. print('\\nShape of X_train :',X_train.shape) #二维
  10. print('Shape of y_train :',y_train.shape) #一维
  11. ###################################
  12. Shape of X_train : (353, 1)
  13. Shape of y_train : (353,)

train_test_split 参数

  • X_train,X_test,y_train,y_test=sklearn.model_selection.train_test_split(train_data,train_target,test_size=0.4, random_state=0,stratify=y_train) | train_data | 需划分的自变量 | | —- | —- | | train_target | 需划分的因变量 | | test_size | 划分比例,test的比例 | | random_state | 随机数种子,为0即每次运行划分不一样,整数即划分一样 | | stratify | 填X或y,按照其比例划分,如里面有A,B两个分类,结果其A,B比例和原始数据一样 |

2. 构建模型,预测 linear_model.LinearRegression

  • 无法像R语言一样得出整个式子,只能分别求截距和系数 ```python

    导入模型,模型参数默认

    LR = linear_model.LinearRegression()

    训练模型 LR.fit

    LR.fit(X_train, y_train)

    预测模型LR.predict(X_test),此时输出类别数据

    打印截距

    print(‘intercept:%.3f’ % LR.intercept)

    打印模型系数

    print(‘coef:%.3f’ % LR.coef)

    均方误差值

    print(‘Mean squared error: %.3f’ % mean_squared_error(y_test,LR.predict(X_test)))##((y_test-LR.predict(X_test))**2).mean()

    R^2 coefficient of determination; higer, better

    print(‘Variance score: %.3f’ % r2_score(y_test,LR.predict(X_test)))

    1-((y_test-LR.predict(X_test))2).sum()/((y_test - y_test.mean())2).sum

    准确率accuracy

    print(‘score: %.3f’ % LR.score(X_test,y_test))
#

intercept:152.003 coef:998.578 Mean squared error: 4061.826 Variance score: 0.233 score: 0.233

  1. - 主要看R^2,判断模型效果。越高表明模型拟合好
  2. <a name="Eg4L4"></a>
  3. ## 3. 可视化
  4. ```python
  5. # visualising the Traning set results
  6. plt.scatter(X_train, y_train, color = 'red')
  7. plt.plot(X_train, LR.predict(X_train), color = 'blue')
  8. plt.title('Traning set')
  9. #plt.xlabel('Year of Experience')
  10. #plt.ylabel('Salary')
  11. plt.show()
  12. #visualise,绿色点为test测试点,红色点为拟合直线,由LR.predict(X_test)得出绘制
  13. plt.scatter(X_test , y_test ,color ='green')
  14. plt.plot(X_test ,LR.predict(X_test) ,color='red',linewidth =3)
  15. plt.title('Test set')
  16. plt.show()
  • 蓝色是拟合线,也可以看出拟合很差

image.pngimage.png