感知机

引入

给定样本S03P01-感知机 - 图1%5C%7D%7Bi%3D1%7D%5EN%2CX_i%5Cin%5Cmathbb%7BR%7D%5Ep#card=math&code=Data%3D%5C%7B%28X_i%2Cy_i%29%5C%7D%7Bi%3D1%7D%5EN%2CX_i%5Cin%5Cmathbb%7BR%7D%5Ep&height=23&width=220),假设这个样本分为两类并可以被一个超平面分开,这种情况称为线性可分。二维情况下如图,红点和蓝点分别表示两类
S03P01-感知机 - 图2
如果我们把两个类别分别用+1和-1来表示,于是可建立如下模型

S03P01-感知机 - 图3%3Dsign(W%5ETX)%2CX%5Cin%5Cmathbb%7BR%7D%5Ep%2CW%5Cin%5Cmathbb%7BR%7D%5Ep%5C%5C%0Asign(x)%3D%0A%5Cbegin%7Bcases%7D%0A%2B1%2C%26x%5Cge0%5C%5C%0A-1%2C%26x%5Clt0%0A%5Cend%7Bcases%7D%0A%5Cend%7Bgathered%7D%0A#card=math&code=%5Cbegin%7Bgathered%7D%0Af%28X%29%3Dsign%28W%5ETX%29%2CX%5Cin%5Cmathbb%7BR%7D%5Ep%2CW%5Cin%5Cmathbb%7BR%7D%5Ep%5C%5C%0Asign%28x%29%3D%0A%5Cbegin%7Bcases%7D%0A%2B1%2C%26x%5Cge0%5C%5C%0A-1%2C%26x%5Clt0%0A%5Cend%7Bcases%7D%0A%5Cend%7Bgathered%7D%0A&height=69&width=280)

我们的目的就是找到空间中一个超平面可将样本切分为两类。如图可看出,有些超平面并不能完全切分样本,会有样本分类错误,于是我们自然想到建立一个loss function为分类错误的样本数量

  • 先来看对样本S03P01-感知机 - 图4分类错误怎么表示

S03P01-感知机 - 图5

于是loss function可以写为

S03P01-感知机 - 图6%3D%5Csum%7Bi%3D1%7D%5ENI(-y_iW%5ETX_i)%5C%5C%0AI(x)%3D%0A%5Cbegin%7Bcases%7D%0A1%2C%26x%5Cge0%5C%5C%0A0%2C%26x%5Clt0%0A%5Cend%7Bcases%7D%0A%5Cend%7Bgathered%7D%0A#card=math&code=%5Cbegin%7Bgathered%7D%0AL%28W%29%3D%5Csum%7Bi%3D1%7D%5ENI%28-y_iW%5ETX_i%29%5C%5C%0AI%28x%29%3D%0A%5Cbegin%7Bcases%7D%0A1%2C%26x%5Cge0%5C%5C%0A0%2C%26x%5Clt0%0A%5Cend%7Bcases%7D%0A%5Cend%7Bgathered%7D%0A&height=98&width=193)

这个损失函数虽然很简单很直观,但是它不连续不可导,数学性质较差,不方便我们求解,所以需要寻找其他loss function。

误分类驱动

感知机的目的是找到一个超平面将样本分为两类,如果一个超平面将所有点都分类错误,那么误分类的点到超平面的距离最大,在误分类点减少时,误分类的点到超平面的距离也会减少,找到满足条件的超平面时这个距离为零,因此可以选择将误分类的样本点到超平面的距离总和作为loss function
S03P01-感知机 - 图7
设误分类样本集合为S03P01-感知机 - 图8,loss function为

S03P01-感知机 - 图9%26%3D%5Csum%7BX_i%5Cin%20D%7D%5Cfrac%7BW%5ETX_i%7D%7B%7C%7CW%7C%7C%7D%5C%5C%0A%26%3D%5Csum%7BXi%5Cin%20D%7D%5Cfrac%7B-y_iW%5ETX_i%7D%7B%7C%7CW%7C%7C%7D%0A%5Cend%7Baligned%7D%0A#card=math&code=%5Cbegin%7Baligned%7D%0AL%28W%29%26%3D%5Csum%7BXi%5Cin%20D%7D%5Cfrac%7BW%5ETX_i%7D%7B%7C%7CW%7C%7C%7D%5C%5C%0A%26%3D%5Csum%7BX_i%5Cin%20D%7D%5Cfrac%7B-y_iW%5ETX_i%7D%7B%7C%7CW%7C%7C%7D%0A%5Cend%7Baligned%7D%0A&height=105&width=185)

这个损失虽然可以使用得到结果,但是对其求导计算十分复杂,我们还可以进一步简化。考虑到我们是为了减少误分类的点,在最终找到超平面没有误分类点时损失函数为零,那么其实分母S03P01-感知机 - 图10对我们的目的是没有影响的,因为只取分子的情况下,在误分类点减少时函数值也是一直减小的直到零,因此可以采用如下loss function

S03P01-感知机 - 图11%3D%5Csum%7BX_i%5Cin%20D%7D-y_iW%5ETX_i%0A#card=math&code=L%28W%29%3D%5Csum%7BX_i%5Cin%20D%7D-y_iW%5ETX_i%0A&height=42&width=174)

学习算法

现在我们将问题简化为了最小化loss function。具体的方法是采用随机梯度下降法,梯度如下计算

S03P01-感知机 - 图12%3D-%5Csum%7BX_i%5Cin%20D%7D-y_iX_i%0A#card=math&code=%5Cnabla_WL%28W%29%3D-%5Csum%7BX_i%5Cin%20D%7D-y_iX_i%0A&height=42&width=190)

那么针对某一个错误样本S03P01-感知机 - 图13#card=math&code=%28X_i%2Cy_i%29&height=20&width=54),按如下方式对S03P01-感知机 - 图14进行更新

S03P01-感知机 - 图15

其中S03P01-感知机 - 图16为步长,也称为学习率(learning rate)
迭代算法如下

(1)给定任意初始S03P01-感知机 - 图17
(2)在训练数据集中选取数据(S03P01-感知机 - 图18,S03P01-感知机 - 图19)
(3)如果S03P01-感知机 - 图20则根据式(*)更新权值
(4)检查是否还有误分类样本点,若有转至(2),若无算法结束

收敛性

上述算法一定能够成功结束吗,我们会不会进入一个无限循环迭代呢?即是不是在有限迭代次数内一定能够找到一个超平面将样本完全正确的的划分。下面证明算法的收敛性
由于数据集线性可分,则一定可以找到一个S03P01-感知机 - 图21将数据集完美划分,不妨设S03P01-感知机 - 图22,对于所有的样本有

S03P01-感知机 - 图23

那么有如下结论

S03P01-感知机 - 图24

假设初始的S03P01-感知机 - 图25,对于迭代算法中的第S03P01-感知机 - 图26个误分类点,则S03P01-感知机 - 图27%7D(W%5E%7B(t)%7D)%5ETX%5E%7B(t)%7D%5Clt0#card=math&code=y%5E%7B%28t%29%7D%28W%5E%7B%28t%29%7D%29%5ETX%5E%7B%28t%29%7D%5Clt0&height=24&width=140),根据式S03P01-感知机 - 图28#card=math&code=%28%2A%29&height=20&width=21)有

S03P01-感知机 - 图29%7D%26%3DW%5E%7B(t-1)%7D%2B%5Ceta%20y%5E%7B(t-1)%7DX%5E%7B(t-1)%7D%5C%5C%0A%26%3DW%5E%7B(t-2)%7D%2B%5Ceta%20y%5E%7B(t-2)%7DX%5E%7B(t-2)%7D%2B%5Ceta%20y%5E%7B(t-1)%7DX%5E%7B(t-1)%7D%5C%5C%0A%26%3DW%5E%7B(0)%7D%2B%5Ceta%20y%5E%7B(0)%7DX%5E%7B(0)%7D%2B%5Ceta%20y%5E%7B(1)%7DX%5E%7B(1)%7D%2B%5Ccdots%2B%5Ceta%20y%5E%7B(t-1)%7DX%5E%7B(t-1)%7D%0A%5Cend%7Baligned%7D%5C%5C%0A%5C%5C%0A%5Cbegin%7Baligned%7D%0A%5CRightarrow%20W%7Bopt%7D%5ETW%5E%7B(t)%7D%26%3DW%7Bopt%7D%5ETW%5E%7B(0)%7D%2B%5Ceta%20y%5E%7B(0)%7DW%7Bopt%7D%5ETX%5E%7B(0)%7D%2B%5Ceta%20y%5E%7B(1)%7DW%7Bopt%7D%5ETX%5E%7B(1)%7D%2B%5Ccdots%2B%5Ceta%20y%5E%7B(t-1)%7DW%7Bopt%7D%5ETX%5E%7B(t-1)%7D%5C%5C%0A%26%5Cge%5Ceta%5Cgamma%2B%5Cge%5Ceta%5Cgamma%2B%5Ccdots%2B%5Ceta%5Cgamma%5C%5C%0A%26%3Dt%5Ceta%5Cgamma%0A%5Cend%7Baligned%7D%0A%5Cend%7Bgathered%7D%0A#card=math&code=%5Cbegin%7Bgathered%7D%0A%5Cbegin%7Baligned%7D%0AW%5E%7B%28t%29%7D%26%3DW%5E%7B%28t-1%29%7D%2B%5Ceta%20y%5E%7B%28t-1%29%7DX%5E%7B%28t-1%29%7D%5C%5C%0A%26%3DW%5E%7B%28t-2%29%7D%2B%5Ceta%20y%5E%7B%28t-2%29%7DX%5E%7B%28t-2%29%7D%2B%5Ceta%20y%5E%7B%28t-1%29%7DX%5E%7B%28t-1%29%7D%5C%5C%0A%26%3DW%5E%7B%280%29%7D%2B%5Ceta%20y%5E%7B%280%29%7DX%5E%7B%280%29%7D%2B%5Ceta%20y%5E%7B%281%29%7DX%5E%7B%281%29%7D%2B%5Ccdots%2B%5Ceta%20y%5E%7B%28t-1%29%7DX%5E%7B%28t-1%29%7D%0A%5Cend%7Baligned%7D%5C%5C%0A%5C%5C%0A%5Cbegin%7Baligned%7D%0A%5CRightarrow%20W%7Bopt%7D%5ETW%5E%7B%28t%29%7D%26%3DW%7Bopt%7D%5ETW%5E%7B%280%29%7D%2B%5Ceta%20y%5E%7B%280%29%7DW%7Bopt%7D%5ETX%5E%7B%280%29%7D%2B%5Ceta%20y%5E%7B%281%29%7DW%7Bopt%7D%5ETX%5E%7B%281%29%7D%2B%5Ccdots%2B%5Ceta%20y%5E%7B%28t-1%29%7DW%7Bopt%7D%5ETX%5E%7B%28t-1%29%7D%5C%5C%0A%26%5Cge%5Ceta%5Cgamma%2B%5Cge%5Ceta%5Cgamma%2B%5Ccdots%2B%5Ceta%5Cgamma%5C%5C%0A%26%3Dt%5Ceta%5Cgamma%0A%5Cend%7Baligned%7D%0A%5Cend%7Bgathered%7D%0A&height=165&width=616)

S03P01-感知机 - 图30,则有

S03P01-感知机 - 图31%7D%7C%7C%5E2%26%3D%7C%7CW%5E%7B(t-1)%7D%7C%7C%5E2%2B2%5Ceta%20y%5E%7B(t-1)%7DW%5E%7B(t-1)%7D%5Ccdot%20X%5E%7B(t-1)%7D%2B%5Ceta%5E2(y%5E%7B(t-1)%7D)%5E2%7C%7CX%5E%7B(t-1)%7D%7C%7C%5E2%5C%5C%0A%26%3D%7C%7CW%5E%7B(t-1)%7D%7C%7C%5E2%2B2%5Ceta%20y%5E%7B(t-1)%7D(W%5E%7B(t-1)%7D)%5ETX%5E%7B(t-1)%7D%2B%5Ceta%5E2%7C%7CX%5E%7B(t-1)%7D%7C%7C%5E2%5C%5C%0A%26%5Clt%7C%7CW%5E%7B(t-1)%7D%7C%7C%5E2%2B%5Ceta%5E2%7C%7CX%5E%7B(t-1)%7D%7C%7C%5E2%5C%5C%0A%26%5Clt%7C%7CW%5E%7B(t-1)%7D%7C%7C%5E2%2B%5Ceta%5E2R%5E2%5C%5C%0A%26%5Clt%7C%7CW%5E%7B(0)%7D%7C%7C%5E2%2B%5Ceta%5E2R%5E2%2B%5Ccdots%2B%5Ceta%5E2R%5E2%5C%5C%0A%26%3Dt%5Ceta%5E2R%5E2%0A%5Cend%7Baligned%7D%0A#card=math&code=%5Cbegin%7Baligned%7D%0A%7C%7CW%5E%7B%28t%29%7D%7C%7C%5E2%26%3D%7C%7CW%5E%7B%28t-1%29%7D%7C%7C%5E2%2B2%5Ceta%20y%5E%7B%28t-1%29%7DW%5E%7B%28t-1%29%7D%5Ccdot%20X%5E%7B%28t-1%29%7D%2B%5Ceta%5E2%28y%5E%7B%28t-1%29%7D%29%5E2%7C%7CX%5E%7B%28t-1%29%7D%7C%7C%5E2%5C%5C%0A%26%3D%7C%7CW%5E%7B%28t-1%29%7D%7C%7C%5E2%2B2%5Ceta%20y%5E%7B%28t-1%29%7D%28W%5E%7B%28t-1%29%7D%29%5ETX%5E%7B%28t-1%29%7D%2B%5Ceta%5E2%7C%7CX%5E%7B%28t-1%29%7D%7C%7C%5E2%5C%5C%0A%26%5Clt%7C%7CW%5E%7B%28t-1%29%7D%7C%7C%5E2%2B%5Ceta%5E2%7C%7CX%5E%7B%28t-1%29%7D%7C%7C%5E2%5C%5C%0A%26%5Clt%7C%7CW%5E%7B%28t-1%29%7D%7C%7C%5E2%2B%5Ceta%5E2R%5E2%5C%5C%0A%26%5Clt%7C%7CW%5E%7B%280%29%7D%7C%7C%5E2%2B%5Ceta%5E2R%5E2%2B%5Ccdots%2B%5Ceta%5E2R%5E2%5C%5C%0A%26%3Dt%5Ceta%5E2R%5E2%0A%5Cend%7Baligned%7D%0A&height=148&width=512)

由以上推导最终可得

S03P01-感知机 - 图32%7D%5Cle%7C%7CW%7Bopt%7D%7C%7C%5Ccdot%7C%7CW%5E%7B(t)%7D%7C%7C%3D%7C%7CW%5E%7B(t)%7D%7C%7C%5Clt%5Csqrt%7Bt%7D%5Ceta%20R%5C%5C%0A%5CRightarrow%20t%5Clt%5Cleft(%7BR%5Cover%5Cgamma%7D%5Cright)%5E2%0A%5Cend%7Bgathered%7D%0A#card=math&code=%5Cbegin%7Bgathered%7D%0At%5Ceta%5Cgamma%5Cle%20W%7Bopt%7D%5ETW%5E%7B%28t%29%7D%5Cle%7C%7CW_%7Bopt%7D%7C%7C%5Ccdot%7C%7CW%5E%7B%28t%29%7D%7C%7C%3D%7C%7CW%5E%7B%28t%29%7D%7C%7C%5Clt%5Csqrt%7Bt%7D%5Ceta%20R%5C%5C%0A%5CRightarrow%20t%5Clt%5Cleft%28%7BR%5Cover%5Cgamma%7D%5Cright%29%5E2%0A%5Cend%7Bgathered%7D%0A&height=76&width=395)

因为在给定样本集后S03P01-感知机 - 图33均为常数,则迭代次数S03P01-感知机 - 图34是有上界的,即数据集线性可分的情况下,在有限迭代次数内一定能够找到一个超平面将样本完全正确的的划分,感知机学习算法是收敛的。但当数据集不是线性可分时,感知机算法不收敛,迭代结果会发生震荡

几何解释

还有另外一种理解感知机的方式:对于空间中的超平面S03P01-感知机 - 图35,其法向量为S03P01-感知机 - 图36,那么对于一个样本点S03P01-感知机 - 图37S03P01-感知机 - 图38,显然决定S03P01-感知机 - 图39符号的关键就是向量S03P01-感知机 - 图40与向量S03P01-感知机 - 图41的夹角
S03P01-感知机 - 图42
如图左边所示,假设红点为正类蓝点为负类。在S03P01-感知机 - 图43%7D#card=math&code=W%5E%7B%28t%29%7D&height=20&width=33)时,我们随机找到一个误分类点如负类点M,从图中容易看到M本应该分类为-1,即S03P01-感知机 - 图44%7D)%5ETX_m%3D%5Cvec%7BW%7D%5E%7B(t)%7D%5Ccdot%20%5Cvec%7BX%7D_m%5Clt0#card=math&code=%28W%5E%7B%28t%29%7D%29%5ETX_m%3D%5Cvec%7BW%7D%5E%7B%28t%29%7D%5Ccdot%20%5Cvec%7BX%7D_m%5Clt0&height=28&width=207),那么S03P01-感知机 - 图45%7D#card=math&code=%5Cvec%7BW%7D%5E%7B%28t%29%7D&height=25&width=32)与S03P01-感知机 - 图46夹角应该大于90度,于是我们要想办法让S03P01-感知机 - 图47%7D#card=math&code=%5Cvec%7BW%7D%5E%7B%28t%2B1%29%7D&height=25&width=48)远离S03P01-感知机 - 图48,增大夹角。于是我们令S03P01-感知机 - 图49%7D%3D%5Cvec%7BW%7D%5E%7B(t)%7D-%5Cvec%7BX%7D_m#card=math&code=%5Cvec%7BW%7D%5E%7B%28t%2B1%29%7D%3D%5Cvec%7BW%7D%5E%7B%28t%29%7D-%5Cvec%7BX%7D_m&height=28&width=150)即可达到目的,又S03P01-感知机 - 图50

S03P01-感知机 - 图51%7D%26%3D%5Cvec%7BW%7D%5E%7B(t)%7D-%5Cvec%7BX%7D_m%5C%5C%0A%26%3DW%5E%7B(t)%7D%2By_mX_m%0A%5Cend%7Baligned%7D%0A#card=math&code=%5Cbegin%7Baligned%7D%0A%5Cvec%7BW%7D%5E%7B%28t%2B1%29%7D%26%3D%5Cvec%7BW%7D%5E%7B%28t%29%7D-%5Cvec%7BX%7D_m%5C%5C%0A%26%3DW%5E%7B%28t%29%7D%2By_mX_m%0A%5Cend%7Baligned%7D%0A&height=52&width=176)

如图右边所示,在S03P01-感知机 - 图52%7D#card=math&code=W%5E%7B%28t%2B1%29%7D&height=20&width=48)时仍有误分类点,因此需要继续优化。我们随机找到一个误分类点如正类点N,从图中容易看到N本应该分类为+1,即S03P01-感知机 - 图53%7D)%5ETX_n%3D%5Cvec%7BW%7D%5E%7B(t%2B1)%7D%5Ccdot%20%5Cvec%7BX%7D_n%5Cgt0#card=math&code=%28W%5E%7B%28t%2B1%29%7D%29%5ETX_n%3D%5Cvec%7BW%7D%5E%7B%28t%2B1%29%7D%5Ccdot%20%5Cvec%7BX%7D_n%5Cgt0&height=28&width=230),那么S03P01-感知机 - 图54%7D#card=math&code=%5Cvec%7BW%7D%5E%7B%28t%2B1%29%7D&height=25&width=48)与S03P01-感知机 - 图55夹角应该小于90度,于是我们要想办法让S03P01-感知机 - 图56%7D#card=math&code=%5Cvec%7BW%7D%5E%7B%28t%2B2%29%7D&height=25&width=48)接近S03P01-感知机 - 图57,减小夹角。于是我们令S03P01-感知机 - 图58%7D%3D%5Cvec%7BW%7D%5E%7B(t%2B1)%7D%2B%5Cvec%7BX%7D_n#card=math&code=%5Cvec%7BW%7D%5E%7B%28t%2B2%29%7D%3D%5Cvec%7BW%7D%5E%7B%28t%2B1%29%7D%2B%5Cvec%7BX%7D_n&height=28&width=162)即可达到目的,又S03P01-感知机 - 图59

S03P01-感知机 - 图60%7D%26%3D%5Cvec%7BW%7D%5E%7B(t%2B1)%7D%2B%5Cvec%7BX%7D_n%5C%5C%0A%26%3DW%5E%7B(t%2B1)%7D%2By_nX_n%0A%5Cend%7Baligned%7D%0A#card=math&code=%5Cbegin%7Baligned%7D%0A%5Cvec%7BW%7D%5E%7B%28t%2B2%29%7D%26%3D%5Cvec%7BW%7D%5E%7B%28t%2B1%29%7D%2B%5Cvec%7BX%7D_n%5C%5C%0A%26%3DW%5E%7B%28t%2B1%29%7D%2By_nX_n%0A%5Cend%7Baligned%7D%0A&height=52&width=185)

于是可以对于一个误分类点S03P01-感知机 - 图61#card=math&code=%28X_i%2Cy_i%29&height=20&width=54)统一为如下迭代公式

S03P01-感知机 - 图62

实际在迭代过程中不用每次加上整个向量S03P01-感知机 - 图63,可以令S03P01-感知机 - 图64,每次迭代加上向量S03P01-感知机 - 图65,即

S03P01-感知机 - 图66

这个形式与感知机算法等价