- 更多基础统计知识,可前往【统计学】专栏,欢迎补充更正哦~~
#Simple LinearRegression# 导包import numpy as npfrom sklearn import datasets, linear_modelfrom sklearn.metrics import mean_squared_error, r2_scorefrom sklearn.model_selection import train_test_splitimport matplotlibimport matplotlib.pyplot as pltimport pandas as pd
1. 数据集 train_test_split
# 加载糖尿病数据集diabetes = datasets.load_diabetes()X = diabetes.data[:, np.newaxis, 2] # diabetes.data[:,2],:, np.newaxis变成二维y = diabetes.target#X=pd.DataFrame(X)#print(b)#test_size:test占比例,random_state为整数即每次输出的数字都一样X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)print('\\nShape of X_train :',X_train.shape) #二维print('Shape of y_train :',y_train.shape) #一维###################################Shape of X_train : (353, 1)Shape of y_train : (353,)
train_test_split 参数
- X_train,X_test,y_train,y_test=sklearn.model_selection.train_test_split(train_data,train_target,test_size=0.4, random_state=0,stratify=y_train) | train_data | 需划分的自变量 | | —- | —- | | train_target | 需划分的因变量 | | test_size | 划分比例,test的比例 | | random_state | 随机数种子,为0即每次运行划分不一样,整数即划分一样 | | stratify | 填X或y,按照其比例划分,如里面有A,B两个分类,结果其A,B比例和原始数据一样 |
2. 构建模型,预测 linear_model.LinearRegression
- 无法像R语言一样得出整个式子,只能分别求截距和系数
```python
导入模型,模型参数默认
LR = linear_model.LinearRegression()训练模型 LR.fit
LR.fit(X_train, y_train)预测模型LR.predict(X_test),此时输出类别数据
打印截距
print(‘intercept:%.3f’ % LR.intercept)打印模型系数
print(‘coef:%.3f’ % LR.coef)均方误差值
print(‘Mean squared error: %.3f’ % mean_squared_error(y_test,LR.predict(X_test)))##((y_test-LR.predict(X_test))**2).mean()R^2 coefficient of determination; higer, better
print(‘Variance score: %.3f’ % r2_score(y_test,LR.predict(X_test)))1-((y_test-LR.predict(X_test))2).sum()/((y_test - y_test.mean())2).sum
准确率accuracy
print(‘score: %.3f’ % LR.score(X_test,y_test))
#
intercept:152.003 coef:998.578 Mean squared error: 4061.826 Variance score: 0.233 score: 0.233
- 主要看R^2,判断模型效果。越高表明模型拟合好<a name="Eg4L4"></a>## 3. 可视化```python# visualising the Traning set resultsplt.scatter(X_train, y_train, color = 'red')plt.plot(X_train, LR.predict(X_train), color = 'blue')plt.title('Traning set')#plt.xlabel('Year of Experience')#plt.ylabel('Salary')plt.show()#visualise,绿色点为test测试点,红色点为拟合直线,由LR.predict(X_test)得出绘制plt.scatter(X_test , y_test ,color ='green')plt.plot(X_test ,LR.predict(X_test) ,color='red',linewidth =3)plt.title('Test set')plt.show()
- 蓝色是拟合线,也可以看出拟合很差


