- 前面说到机器学习中的任务主要是分类和回归,分类如前面的贝叶斯定理所展现的那样是输出一个类别(有限的离散数值),而回归的任务则是输出一个real value
3.1线性回归的基本思路
- 对于n维的样本找到一系列线性函数使得%3D%5Comega%5ET%5Cboldsymbol%20x%2Bb#card=math&code=f%28x%29%3D%5Comega%5ET%5Cboldsymbol%20x%2Bb),这个问题可以变形为n+1维使得%3D%5Comega%5ET%5Cboldsymbol%20x#card=math&code=f%28x%29%3D%5Comega%5ET%5Cboldsymbol%20x)其中而,这样以来表达式的形式更加统一了,而我们的目标就是训练出这个函数f,使得对于训练数据#card=math&code=%28x_i%2Cy_i%29),学习出一个函数f使得%3Dy_i#card=math&code=f%28x_i%29%3Dy_i),我们要求解的对象就是这个n+1维的向量
3.2线性回归的loss函数和解
3.2.1 推导过程
- 线性回归常见的loss函数定义是最小平方误差(MSE)
)%5E2%0A#card=math&code=MSE%3D%5Cfrac%7B1%7D%7Bn%7D%5Csum%5Climits_%7Bi%3D1%7D%5Climits%5E%7Bn%7D%28y_i-f%28x_i%2C%5Comega%29%29%5E2%0A)
- 也会使用残差平方和(RSS),其形式如下:
%3D%5Csum%5Climits%7Bi%3D1%7D%5Climits%5E%7Bn%7D(y_i-f(x_i%2C%5Comega))%5E2%3D(%5Cboldsymbol%20y-X%5ET%5Comega)%5ET(%5Cboldsymbol%20y-X%5ET%5Comega)%0A#card=math&code=J_n%28%5Calpha%29%3D%5Csum%5Climits%7Bi%3D1%7D%5Climits%5E%7Bn%7D%28y_i-f%28x_i%2C%5Comega%29%29%5E2%3D%28%5Cboldsymbol%20y-X%5ET%5Comega%29%5ET%28%5Cboldsymbol%20y-X%5ET%5Comega%29%0A)
- 我们对RSS的表达式求梯度有:
%3D-2X(%5Cboldsymbol%20y-X%5ET%5Calpha)%3D0%0A#card=math&code=%5Cnabla%20J_n%28%5Calpha%29%3D-2X%28%5Cboldsymbol%20y-X%5ET%5Calpha%29%3D0%0A)
- 因此可以得到使得RSS最小的线性回归的解是:
%5E%7B-1%7DXy%0A#card=math&code=%5Comega%3D%28XX%5ET%29%5E%7B-1%7DXy%0A)
- 我们需要注意到:
- 每一个样本x是维的向量,因此X是维的矩阵,而y也是维的向量
- 所以当样本数n小于特征数d的时候,是不满秩的,求不出逆矩阵,此时的线性回归有多个解,其实这也很好理解
- 因为要求解的是n+1维的向量,有n+1个变量,因此至少需要n+1个样本才能确定n+1个参数
- 出现上述情况的时候所有可能的解都可以使得均方误差最小化
- 此时可以考虑引入正则化项来筛选出需要的结果。
def linear_regression(X, y):
"""
LINEAR_REGRESSION Linear Regression.
INPUT: X: training sample features, P-by-N matrix.
y: training sample labels, 1-by-N row vector.
OUTPUT: w: learned perceptron parameters, (P+1)-by-1 column vector.
"""
P, N = X.shape
w = np.zeros((P + 1, 1))
# 原本的每个样本x是按列存放的,这里需要对矩阵进行转置并在开头加上一列1,成为N*(P+1)维的矩阵
new_x = np.column_stack((np.ones((N, 1)), X.T))
part1 = np.linalg.inv(np.matmul(new_x.T, new_x))
part2 = np.matmul(new_x.T, y.T)
w = np.matmul(part1, part2)
# end answer
return w
3.3线性回归的统计模型
- 真实情况下的数据样本往往会有噪声,比如%2B%5Cepsilon#card=math&code=y%3Df%28%5Cboldsymbol%20x%2C%5Comega%29%2B%5Cepsilon)其中是一个随机噪声,服从#card=math&code=N%280%2C%5Csigma%5E2%29)的正态分布,此时可以通过极大似然法来估计,定义
%3D%5Cfrac%7B1%7D%7B%5Csqrt%7B2%5Cpi%7D%5Csigma%7D%5Cexp%5B-%5Cfrac%7B1%7D%7B2%5Csigma%5E2%7D(y-f(%5Cboldsymbol%20x%2C%5Comega))%5E2%5D%0A#card=math&code=P%28y%7C%5Cboldsymbol%20x%2C%5Comega%2C%5Csigma%29%3D%5Cfrac%7B1%7D%7B%5Csqrt%7B2%5Cpi%7D%5Csigma%7D%5Cexp%5B-%5Cfrac%7B1%7D%7B2%5Csigma%5E2%7D%28y-f%28%5Cboldsymbol%20x%2C%5Comega%29%29%5E2%5D%0A)
根据极大似然法有
%3D%5Cprod%7Bi%3D1%7D%5Climits%5EnP(y_i%7C%5Cboldsymbol%20x_i%2C%5Comega%2C%5Csigma)%0A#card=math&code=L%28D%2C%5Comega%2C%5Csigma%29%3D%5Cprod%7Bi%3D1%7D%5Climits%5EnP%28y_i%7C%5Cboldsymbol%20x_i%2C%5Comega%2C%5Csigma%29%0A)
我们的求解目标变成了:
%3D%20%5Carg%20%5Cmax%5Cprod%7Bi%3D1%7D%5Climits%5EnP(y_i%7C%5Cboldsymbol%20x_i%2C%5Comega%2C%5Csigma)%0A#card=math&code=%5Comega%3D%5Carg%5Cmax%20L%28D%2C%5Comega%2C%5Csigma%29%3D%20%5Carg%20%5Cmax%5Cprod%7Bi%3D1%7D%5Climits%5EnP%28y_i%7C%5Cboldsymbol%20x_i%2C%5Comega%2C%5Csigma%29%0A)
取对数似然之后有
%3D-%5Cfrac%7B1%7D%7B2%5Csigma%5E2%7D%5Csum%7Bi%3D1%7D%5Climits%5E%7Bn%7D(y_i-f(x_i%2C%5Comega)%5E2)%2Bc(%5Csigma)%0A#card=math&code=l%28D%2C%5Comega%2C%5Csigma%29%3D-%5Cfrac%7B1%7D%7B2%5Csigma%5E2%7D%5Csum%7Bi%3D1%7D%5Climits%5E%7Bn%7D%28y_i-f%28x_i%2C%5Comega%29%5E2%29%2Bc%28%5Csigma%29%0A)
到这一步为止我们又回到了RSS,因此解的表达式依然和上面推出的是一样的。
3.4 岭回归 Ridge Regression
3.4.1 岭回归的推导
- 我们发现普通的线性回归很容易出现overfitting的情况,比如一些系数的值非常极端,仿佛就是为了拟合数据集中的点而生的,
- 对测试集中的数据表现就非常差。为了控制系数的size,让表达式看起来更像是阳间的求解结果,我们可以引入正则化的方法。
%5E2%2B%5Clambda%5Csum%7Bj%3D1%7D%5Climits%5E%7Bd%7Dw_j%5E2%0A#card=math&code=%5Comega%5E%7B%2A%7D%3D%5Carg%5Cmin%20%5Csum%7Bi%3D1%7D%5Climits%5E%7Bn%7D%28yi-w%5ETx_i%29%5E2%2B%5Clambda%5Csum%7Bj%3D1%7D%5Climits%5E%7Bd%7Dw_j%5E2%0A)
这实际上就是拉格朗日算子,则我们跟之前一样可以将其写成矩阵形式:
%5ET(%5Cboldsymbol%20y-X%5ET%5Comega)%2B%5Clambda%5Comega%5ET%5Comega%0A#card=math&code=%28%5Cboldsymbol%20y-X%5ET%5Comega%29%5ET%28%5Cboldsymbol%20y-X%5ET%5Comega%29%2B%5Clambda%5Comega%5ET%5Comega%0A)
对其求梯度可以得到
%3D-2X(%5Cboldsymbol%20y-X%5ET%5Calpha)%2B2%5Clambda%5Comega%3D0%0A#card=math&code=%5Cnabla%20J_n%28%5Calpha%29%3D-2X%28%5Cboldsymbol%20y-X%5ET%5Calpha%29%2B2%5Clambda%5Comega%3D0%0A)
则其最终的解就是:
%5E%7B-1%7DXy%0A#card=math&code=%5Comega%5E%7B%2A%7D%3D%28XX%5ET%2B%5Clambda%20I%29%5E%7B-1%7DXy%0A)
- 这就是岭回归(Ridge Regression)的方法,其中参数是可以自己确定的,我们可以保证矩阵是满秩的矩阵。
- 岭回归其实就是L2正则下的线性回归
- 值得注意的是矩阵不完全是单位矩阵,这个矩阵的对角线上第0行是0,其他的都是1
3.4.2 岭回归的代码实现
def ridge(X, y, lmbda):
"""
RIDGE Ridge Regression.
INPUT: X: training sample features, P-by-N matrix.
y: training sample labels, 1-by-N row vector.
lmbda: regularization parameter.
OUTPUT: w: learned parameters, (P+1)-by-1 column vector.
NOTE: You can use pinv() if the matrix is singular.
"""
P, N = X.shape
# 构造一个P+1维的单位矩阵
I = np.identity(P + 1)
I[0][0] = 0
# 在数据集X的第一行添加一行1表示常数项
new_x = np.concatenate((np.ones((1, N)), X), axis=0)
part1 = np.matmul(new_x, new_x.T) + lmbda * I
part2 = np.matmul(new_x, y.T)
w = np.matmul(linalg.pinv(part1), part2)
# end answer
return w
3.5贝叶斯线性回归
- 现在我们重新考虑实际样本中可能会出现的噪声,即%2B%5Cepsilon#card=math&code=y%3Df%28%5Cboldsymbol%20x%2C%5Comega%29%2B%5Cepsilon),其中服从#card=math&code=N%280%2C%5Csigma%5E2%29)的正态分布。根据贝叶斯定理可以得到:
%3D%5Cfrac%7BP(y%7C%5Comega%2Cx%2C%5Csigma)P(%5Comega%7Cx%2C%5Csigma)%7D%7BP(y%7Cx%2C%5Csigma)%7D%0A#card=math&code=P%28%5Comega%7Cy%2Cx%2C%5Csigma%29%3D%5Cfrac%7BP%28y%7C%5Comega%2Cx%2C%5Csigma%29P%28%5Comega%7Cx%2C%5Csigma%29%7D%7BP%28y%7Cx%2C%5Csigma%29%7D%0A)
- 即posterior正比于prior和likelihood的乘积,即%5Cpropto%20%5Cln(likelihood)%5Ctimes%5Cln(prior)#card=math&code=%5Cln%28posterior%29%5Cpropto%20%5Cln%28likelihood%29%5Ctimes%5Cln%28prior%29),而在线性回归问题中,我们已经知道了likelihood就是
%3D-%5Cfrac%7B1%7D%7B2%5Csigma%5E2%7D%5Csum%7Bi%3D1%7D%5Climits%5E%7Bn%7D(y_i-f(x_i%2C%5Comega)%5E2)%2Bc(%5Csigma)%0A#card=math&code=l%28D%2C%5Comega%2C%5Csigma%29%3D-%5Cfrac%7B1%7D%7B2%5Csigma%5E2%7D%5Csum%7Bi%3D1%7D%5Climits%5E%7Bn%7D%28y_i-f%28x_i%2C%5Comega%29%5E2%29%2Bc%28%5Csigma%29%0A)
我们可以用如下方法选择:
%20%3DN(w%7C0%2C%5Clambda%5E%7B-1%7DI)%26%3D%5Cfrac%7B1%7D%7B(2%5Cpi)%5E%7B%5Cfrac%7Bd%7D%7B2%7D%7D%7C%5Clambda%5E%7B-1%7DI%7C%5E%7B%5Cfrac%7B1%7D%7B2%7D%7D%7D%5Cexp%5E%7B-%5Cfrac%7B1%7D%7B2%7D%5Comega%5ET(%5Clambda%5E%7B-1%7DI)%5Comega%7D%5C%5C%0A%5Cln(p(%5Comega))%20%26%3D-%5Cfrac%7B%5Clambda%7D%7B2%7D%5Comega%5ET%5Comega%2Bc%5Cend%7Baligned%7D#card=math&code=%5Cbegin%7Baligned%7D%0Ap%28%5Comega%29%20%3DN%28w%7C0%2C%5Clambda%5E%7B-1%7DI%29%26%3D%5Cfrac%7B1%7D%7B%282%5Cpi%29%5E%7B%5Cfrac%7Bd%7D%7B2%7D%7D%7C%5Clambda%5E%7B-1%7DI%7C%5E%7B%5Cfrac%7B1%7D%7B2%7D%7D%7D%5Cexp%5E%7B-%5Cfrac%7B1%7D%7B2%7D%5Comega%5ET%28%5Clambda%5E%7B-1%7DI%29%5Comega%7D%5C%5C%0A%5Cln%28p%28%5Comega%29%29%20%26%3D-%5Cfrac%7B%5Clambda%7D%7B2%7D%5Comega%5ET%5Comega%2Bc%5Cend%7Baligned%7D)
因此在贝叶斯理论下的岭回归模型需要优化的目标可以等价于:
%5E2)%2Bc(%5Csigma)-%5Cfrac%7B%5Clambda%7D%7B2%7D%5Comega%5ET%5Comega%2Bc%0A#card=math&code=-%5Cfrac%7B1%7D%7B2%5Csigma%5E2%7D%5Csum_%7Bi%3D1%7D%5Climits%5E%7Bn%7D%28y_i-f%28x_i%2C%5Comega%29%5E2%29%2Bc%28%5Csigma%29-%5Cfrac%7B%5Clambda%7D%7B2%7D%5Comega%5ET%5Comega%2Bc%0A)
3.6逻辑回归 Logistic Regression
3.6.1逻辑回归的基本概念
- 逻辑回归往往选择通过一个sigmod函数将样本映射到某个区间上,以此来估计样本属于某种类别的概率从而达到分类的目的。我们经常选用的sigmod函数是:
%3D%5Cfrac%7B1%7D%7B1%2Be%5E%7B-z%7D%7D%0A#card=math&code=y%3D%5Csigma%28z%29%3D%5Cfrac%7B1%7D%7B1%2Be%5E%7B-z%7D%7D%0A)
- 比如对于二分类问题,可以将样本映射到区间(-1,1)上,此时如果计算结果为正数则说明样本属于正例,反之就是反例,可以表示为:
%3D%5Csigma(%5Comega%5ETx_i)%3D%5Cfrac%7B1%7D%7B1%2Be%5E%7B-%5Comega%5ETx_i%7D%7D%5C%5C%0AP(y_i%3D-1%7Cx_i%2C%5Comega)%3D1-%5Csigma(%5Comega%5ETx_i)%3D%5Cfrac%7B1%7D%7B1%2Be%5E%7B%5Comega%5ETx_i%7D%7D%5Cend%7Baligned%7D#card=math&code=%5Cbegin%7Baligned%7D%0AP%28y_i%3D1%7Cx_i%2C%5Comega%29%3D%5Csigma%28%5Comega%5ETx_i%29%3D%5Cfrac%7B1%7D%7B1%2Be%5E%7B-%5Comega%5ETx_i%7D%7D%5C%5C%0AP%28y_i%3D-1%7Cx_i%2C%5Comega%29%3D1-%5Csigma%28%5Comega%5ETx_i%29%3D%5Cfrac%7B1%7D%7B1%2Be%5E%7B%5Comega%5ETx_i%7D%7D%5Cend%7Baligned%7D)
上面的两个式子也可以统一写为:
%3D%5Csigma(y_i%5Comega%5ETx_i)%3D%5Cfrac%7B1%7D%7B1%2Be%5E%7B-y_i%5Comega%5ETx_i%7D%7D%5Cend%7Baligned%7D#card=math&code=%5Cbegin%7Baligned%7D%0AP%28y_i%3D1%7Cx_i%2C%5Comega%29%3D%5Csigma%28y_i%5Comega%5ETx_i%29%3D%5Cfrac%7B1%7D%7B1%2Be%5E%7B-y_i%5Comega%5ETx_i%7D%7D%5Cend%7Baligned%7D)
3.6.2参数估计
- 我们可以用极大似然法来估计参数,依然用D表示数据集,具体的过程如下所示:
%20%26%20%3D%5Cprod%7Bi%5Cin%20I%7D%5Csigma(y_i%5Comega%5ETx_i)%5C%5C%0Al(P(D))%20%26%20%3D%5Csum%7Bi%5Cin%20I%7D%5Cln(%5Csigma(yi%5Comega%5ETx_i))%3D-%5Csum%7Bi%5Cin%20I%7D%5Cln(1%2Be%5E%7Byi%5Comega%5ETx_i%7D)%0A%5Cend%7Baligned%7D#card=math&code=%5Cbegin%7Baligned%7D%0AP%28D%29%20%26%20%3D%5Cprod%7Bi%5Cin%20I%7D%5Csigma%28yi%5Comega%5ETx_i%29%5C%5C%0Al%28P%28D%29%29%20%26%20%3D%5Csum%7Bi%5Cin%20I%7D%5Cln%28%5Csigma%28yi%5Comega%5ETx_i%29%29%3D-%5Csum%7Bi%5Cin%20I%7D%5Cln%281%2Be%5E%7By_i%5Comega%5ETx_i%7D%29%0A%5Cend%7Baligned%7D)
因此我们可以将逻辑回归的极大似然法参数估计的loss函数定义成:
%3D%5Csum%7Bi%5Cin%20I%7D%20%5Cln(1%2Be%5E%7B-y_i%5Comega%5ETx_i%7D)%0A#card=math&code=E%28%5Comega%29%3D%5Csum%7Bi%5Cin%20I%7D%20%5Cln%281%2Be%5E%7B-y_i%5Comega%5ETx_i%7D%29%0A)
- 对于一个二分类问题,我们如果用0和1而不是+1和-1来代表两种分类,那么上面的表达式又可以写为:
- 我们要求的就是这个表达式的最小值
%26%3D%5Csum%7Bi%5Cin%20I%5Ccap%20y_i%3D1%7D%5Cln(1%2Be%5E%7B-%5Comega%5ETx_i%7D)%20%2B%20%5Csum%7Bi%5Cin%20I%5Ccap%20yi%3D0%7D%5Cln(1%2Be%5E%7B%5Comega%5ETx_i%7D)%5C%5C%0A%26%3D%5Csum%7Bi%5Cin%20I%5Ccap%20yi%3D1%7D%5Cln%20(e%5E%7B%5Comega%5ETx_i%7D)(1%2Be%5E%7B%5Comega%5ETx_i%7D)%20%2B%20%5Csum%7Bi%5Cin%20I%5Ccap%20yi%3D0%7D%5Cln%20(1%2Be%5E%7B-%5Comega%5ETx_i%7D)%5C%5C%0A%26%3D%5Csum%7Bi%5Cin%20I%7D%20%5Cln(1%2Be%5E%7B%5Comega%5ETxi%7D)-%5Csum%7Bi%5Cin%20I%20%5Ccap%20yi%3D1%7De%5E%7B%5Comega%5ETx_i%7D%20%5C%5C%0A%26%3D%5Csum%7Bi%5Cin%20I%7D%20(-yi%5Comega%5ETx_i%2B%5Cln(1%2Be%5E%7B%5Comega%5ETx_i%7D))%0A%5Cend%7Baligned%7D#card=math&code=%5Cbegin%7Baligned%7D%0AE%28%5Comega%29%26%3D%5Csum%7Bi%5Cin%20I%5Ccap%20yi%3D1%7D%5Cln%281%2Be%5E%7B-%5Comega%5ETx_i%7D%29%20%2B%20%5Csum%7Bi%5Cin%20I%5Ccap%20yi%3D0%7D%5Cln%281%2Be%5E%7B%5Comega%5ETx_i%7D%29%5C%5C%0A%26%3D%5Csum%7Bi%5Cin%20I%5Ccap%20yi%3D1%7D%5Cln%20%28e%5E%7B%5Comega%5ETx_i%7D%29%281%2Be%5E%7B%5Comega%5ETx_i%7D%29%20%2B%20%5Csum%7Bi%5Cin%20I%5Ccap%20yi%3D0%7D%5Cln%20%281%2Be%5E%7B-%5Comega%5ETx_i%7D%29%5C%5C%0A%26%3D%5Csum%7Bi%5Cin%20I%7D%20%5Cln%281%2Be%5E%7B%5Comega%5ETxi%7D%29-%5Csum%7Bi%5Cin%20I%20%5Ccap%20yi%3D1%7De%5E%7B%5Comega%5ETx_i%7D%20%5C%5C%0A%26%3D%5Csum%7Bi%5Cin%20I%7D%20%28-y_i%5Comega%5ETx_i%2B%5Cln%281%2Be%5E%7B%5Comega%5ETx_i%7D%29%29%0A%5Cend%7Baligned%7D)
- 我们可以证明#card=math&code=E%28%5Comega%29)是一个关于的凸函数,根据凸函数的可加性,我们只需要证明#card=math&code=-y_i%5Comega%5ETx_i%2B%5Cln%281%2Be%5E%7B%5Comega%5ETx_i%7D%29)是关于的凸函数,我们令%3D-y_i%5Comega%5ETx_i%2B%5Cln(1%2Be%5E%7B%5Comega%5ETx_i%7D)#card=math&code=g%28%5Comega%29%3D-y_i%5Comega%5ETx_i%2B%5Cln%281%2Be%5E%7B%5Comega%5ETx_i%7D%29)则对其求一阶梯度可以得到:
%7D%7B%5Cpartial%5Comega%7D%20%3D%20-y_ix_i%2B%5Cfrac%7Bx_ie%5E%7B%5Comega%5ETx_i%7D%7D%7B1%2Be%5E%7B%5Comega%5ETx_i%7D%7D%0A#card=math&code=%5Cfrac%7B%5Cpartial%20g%28%5Comega%29%7D%7B%5Cpartial%5Comega%7D%20%3D%20-y_ix_i%2B%5Cfrac%7Bx_ie%5E%7B%5Comega%5ETx_i%7D%7D%7B1%2Be%5E%7B%5Comega%5ETx_i%7D%7D%0A)
能得到这个结果是因为我们有这样一个结论:
- 对于一个n维的向量,我们有
进一步地,我们对上面求的一阶梯度再求二阶梯度,可以得到:
%7D%7B%5Cpartial%5Comega%5E2%7D%3D%5Cfrac%7Bx_i%5E2e%5E%7Bw%5ETx_i%7D%7D%7B(1%2Be%5E%7Bw%5ETx_i%7D)%5E2%7D%5Cgeq%200%0A#card=math&code=%5Cfrac%7B%5Cpartial%5E2g%28%5Comega%29%7D%7B%5Cpartial%5Comega%5E2%7D%3D%5Cfrac%7Bx_i%5E2e%5E%7Bw%5ETx_i%7D%7D%7B%281%2Be%5E%7Bw%5ETx_i%7D%29%5E2%7D%5Cgeq%200%0A)
因此我们证明了损失函数是一个凸函数,因此可以用随机梯度下降的方法求得其最优解,即:
%0A#card=math&code=%5Comega%5E%7B%2A%7D%3D%5Carg%5Cmin_%7B%5Comega%7D%20E%28%5Comega%29%0A)
- 根据上面求得的一阶梯度,可以得到基于梯度下降法的逻辑回归参数求解迭代方程:
%5Csum%7Bi%5Cin%20I%7D(-y_ix_i%2B%5Cfrac%7Bx_ie%5E%7B%5Comega_i%5ETx_i%7D%7D%7B1%2Be%5E%7B%5Comega_i%5ETx_i%7D%7D)%5C%5C%0A%26%3D%5Comega_i-%5Ceta(i)%5Csum%20x_i(%5Cfrac%7B1%7D%7B1%2Be%5E%7B-%5Comega_i%5ETx_i%7D%7D-y_i)%5C%5C%0A%26%3D%5Comega_i-%5Ceta(i)X(%5Csigma(%5Comega_i%2C%20X)-y)%0A%5Cend%7Baligned%7D#card=math&code=%5Cbegin%7Baligned%7D%0A%5Comega%7Bi%2B1%7D%26%3D%5Comega%7Bi%7D%2B%5Ceta%28i%29%5Csum%7Bi%5Cin%20I%7D%28-y_ix_i%2B%5Cfrac%7Bx_ie%5E%7B%5Comega_i%5ETx_i%7D%7D%7B1%2Be%5E%7B%5Comega_i%5ETx_i%7D%7D%29%5C%5C%0A%26%3D%5Comega_i-%5Ceta%28i%29%5Csum%20x_i%28%5Cfrac%7B1%7D%7B1%2Be%5E%7B-%5Comega_i%5ETx_i%7D%7D-y_i%29%5C%5C%0A%26%3D%5Comega_i-%5Ceta%28i%29X%28%5Csigma%28%5Comega_i%2C%20X%29-y%29%0A%5Cend%7Baligned%7D)
其中#card=math&code=%5Csigma%28x%29)是sigmod函数,而#card=math&code=%5Ceta%28i%29)是自己选择的学习率,一半是一个比较小的数字,并且应该不断减小。
3.6.3 逻辑回归的代码实现
def logistic(X, y):
"""
LR Logistic Regression.
INPUT: X: training sample features, P-by-N matrix.
y: training sample labels, 1-by-N row vector.
OUTPUT: w: learned parameters, (P+1)-by-1 column vector.
"""
P, N = X.shape
w = np.zeros((P + 1, 1))
def sigmoid(theta, x):
return 1.0 / (1 + np.exp(-np.squeeze(np.matmul(theta.T, x))))
X = np.concatenate((np.ones((1, N)), X), axis=0)
# 原本的y是以+1和-1作为label的,这里将其转化成了1和0
y = np.array(y == 1, dtype=np.float).reshape(N)
step = 0
max_step = 100 # 最大的学习步长
learning_rate = 0.99 # 学习率,也就是推导中的eta函数
while step < max_step:
# 计算梯度
grad = np.matmul(X, (sigmoid(w, X) - y).reshape((N, 1)))
learning_rate *= 0.99
w = w - learning_rate * grad
step += 1
return w
3.6.3 逻辑回归的正则化方法
- 可以在逻辑回归中的优化目标中加入正则项变成下面的形式:
%3D%5Csum%7Bi%5Cin%20I%7D%20(-y_i%5Comega%5ETx_i%2B%5Cln(1%2Be%5E%7B%5Comega%5ETx_i%7D))%2B%5Cfrac%20%5Clambda%202w%5E2%0A#card=math&code=E%28%5Comega%29%3D%5Csum%7Bi%5Cin%20I%7D%20%28-y_i%5Comega%5ETx_i%2B%5Cln%281%2Be%5E%7B%5Comega%5ETx_i%7D%29%29%2B%5Cfrac%20%5Clambda%202w%5E2%0A)
- 则其梯度就是:
%7D%7B%5Comega%7D%26%3D%5Csum%7Bi%5Cin%20D%7D(-y_ix_i%2B%5Cfrac%7Bx_ie%5E%7B%5Comega_i%5ETx_i%7D%7D%7B1%2Be%5E%7B%5Comega_i%5ETx_i%7D%7D)%2B%5Clambda%20w_i%5C%5C%26%3D%5Csum%20x_i(%5Cfrac%7B1%7D%7B1%2Be%5E%7B-%5Comega_i%5ETx_i%7D%7D-y_i)%2B%5Clambda%20w_i%5C%5C%26%3DX(%5Csigma(%5Comega_i%2C%20X)-y)%2B%5Clambda%20w_i%5Cend%7Baligned%7D%0A#card=math&code=%5Cbegin%7Baligned%7D%5Cfrac%7B%5Cpartial%20E%28%5Comega%29%7D%7B%5Comega%7D%26%3D%5Csum%7Bi%5Cin%20D%7D%28-y_ix_i%2B%5Cfrac%7Bx_ie%5E%7B%5Comega_i%5ETx_i%7D%7D%7B1%2Be%5E%7B%5Comega_i%5ETx_i%7D%7D%29%2B%5Clambda%20w_i%5C%5C%26%3D%5Csum%20x_i%28%5Cfrac%7B1%7D%7B1%2Be%5E%7B-%5Comega_i%5ETx_i%7D%7D-y_i%29%2B%5Clambda%20w_i%5C%5C%26%3DX%28%5Csigma%28%5Comega_i%2C%20X%29-y%29%2B%5Clambda%20w_i%5Cend%7Baligned%7D%0A)
- 因此每次的迭代过程就变成了:
%5Cleft(X(%5Csigma(%5Comegai%2C%20X)-y)%2B%5Clambda%20w_i%5Cright)%0A#card=math&code=w%7Bi%2B1%7D%3Dw_i-%5Ceta%28i%29%5Cleft%28X%28%5Csigma%28%5Comega_i%2C%20X%29-y%29%2B%5Clambda%20w_i%5Cright%29%0A)
def logistic_r(X, y, lmbda):
"""
LR Logistic Regression.
INPUT: X: training sample features, P-by-N matrix.
y: training sample labels, 1-by-N row vector.
lambda: regularization parameter.
OUTPUT: w : learned parameters, (P+1)-by-1 column vector.
"""
P, N = X.shape
w = np.zeros((P + 1, 1))
# YOUR CODE HERE
# begin answer
def sigmoid(theta, x):
return 1.0 / (1 + np.exp(-np.squeeze(np.matmul(theta.T, x))))
X = np.concatenate((np.ones((1, N)), X), axis=0)
y = np.array(y == 1, dtype=np.float).reshape(N)
step = 0
max_step = 100
learning_rate = 0.008
while step < max_step:
regular_item = w * lmbda
regular_item[0] = 0
grad = np.matmul(X, (sigmoid(w, X) - y).reshape((N, 1))) + regular_item
learning_rate *= 0.95
w = w - learning_rate * grad
step += 1
# end answer
return w