期望最大算法的目的是解决具有隐变量的混合模型的参数估计(极大似然估计)。MLE 对 #card=math&code=p%28x%7C%5Ctheta%29) 参数的估计记为:#card=math&code=%5Ctheta%7BMLE%7D%3D%5Cmathop%7Bargmax%7D%5Climits%5Ctheta%5Clog%20p%28x%7C%5Ctheta%29)。EM 算法对这个问题的解决方法是采用迭代的方法:
这个公式包含了迭代的两步:
- E step:计算 #card=math&code=%5Clog%20p%28x%2Cz%7C%5Ctheta%29) 在概率分布 #card=math&code=p%28z%7Cx%2C%5Ctheta%5Et%29) 下的期望
- M step:计算使这个期望最大化的参数得到下一个 EM 步骤的输入
求证:%5Cle%5Clog%20p(x%7C%5Ctheta%5E%7Bt%2B1%7D)#card=math&code=%5Clog%20p%28x%7C%5Ctheta%5Et%29%5Cle%5Clog%20p%28x%7C%5Ctheta%5E%7Bt%2B1%7D%29),收敛性,即迭代会使最大似然越来越大 证明:%3D%5Clog%20p(z%2Cx%7C%5Ctheta)-%5Clog%20p(z%7Cx%2C%5Ctheta)#card=math&code=%5Clog%20p%28x%7C%5Ctheta%29%3D%5Clog%20p%28z%2Cx%7C%5Ctheta%29-%5Clog%20p%28z%7Cx%2C%5Ctheta%29),对左右两边求积分[对求期望]: %5Clog%20p(x%7C%5Ctheta)dz%3D%5Clog%20p(x%7C%5Ctheta)%0A#card=math&code=Left%3A%5Cintzp%28z%7Cx%2C%5Ctheta%5Et%29%5Clog%20p%28x%7C%5Ctheta%29dz%3D%5Clog%20p%28x%7C%5Ctheta%29%0A) %5Clog%20p(x%2Cz%7C%5Ctheta)dz-%5Cint_zp(z%7Cx%2C%5Ctheta%5Et)%5Clog%20p(z%7Cx%2C%5Ctheta)dz%3DQ(%5Ctheta%2C%5Ctheta%5Et)-H(%5Ctheta%2C%5Ctheta%5Et)%0A#card=math&code=Right%3A%5Cint_zp%28z%7Cx%2C%5Ctheta%5Et%29%5Clog%20p%28x%2Cz%7C%5Ctheta%29dz-%5Cint_zp%28z%7Cx%2C%5Ctheta%5Et%29%5Clog%20p%28z%7Cx%2C%5Ctheta%29dz%3DQ%28%5Ctheta%2C%5Ctheta%5Et%29-H%28%5Ctheta%2C%5Ctheta%5Et%29%0A) 所以: %3DQ(%5Ctheta%2C%5Ctheta%5Et)-H(%5Ctheta%2C%5Ctheta%5Et)%0A#card=math&code=%5Clog%20p%28x%7C%5Ctheta%29%3DQ%28%5Ctheta%2C%5Ctheta%5Et%29-H%28%5Ctheta%2C%5Ctheta%5Et%29%0A) 由于 %3D%5Cint_zp(z%7Cx%2C%5Ctheta%5Et)%5Clog%20p(x%2Cz%7C%5Ctheta)dz#card=math&code=Q%28%5Ctheta%2C%5Ctheta%5Et%29%3D%5Cint_zp%28z%7Cx%2C%5Ctheta%5Et%29%5Clog%20p%28x%2Cz%7C%5Ctheta%29dz),而 ![](https://g.yuque.com/gr/latex?%5Ctheta%5E%7Bt%2B1%7D%3D%5Cmathop%7Bargmax%7D%5Climits%7B%5Ctheta%7D%5Cintz%5Clog%20%5Bp(x%2Cz%7C%5Ctheta)%5Dp(z%7Cx%2C%5Ctheta%5Et)dz#card=math&code=%5Ctheta%5E%7Bt%2B1%7D%3D%5Cmathop%7Bargmax%7D%5Climits%7B%5Ctheta%7D%5Cint_z%5Clog%20%5Bp%28x%2Cz%7C%5Ctheta%29%5Dp%28z%7Cx%2C%5Ctheta%5Et%29dz),所以 %5Cge%20Q(%5Ctheta%5Et%2C%5Ctheta%5Et)#card=math&code=Q%28%5Ctheta%5E%7Bt%2B1%7D%2C%5Ctheta%5Et%29%5Cge%20Q%28%5Ctheta%5Et%2C%5Ctheta%5Et%29)。要证 %5Cle%5Clog%20p(x%7C%5Ctheta%5E%7Bt%2B1%7D)#card=math&code=%5Clog%20p%28x%7C%5Ctheta%5Et%29%5Cle%5Clog%20p%28x%7C%5Ctheta%5E%7Bt%2B1%7D%29),需证:%5Cge%20H(%5Ctheta%5E%7Bt%2B1%7D%2C%5Ctheta%5Et)#card=math&code=H%28%5Ctheta%5Et%2C%5Ctheta%5Et%29%5Cge%20H%28%5Ctheta%5E%7Bt%2B1%7D%2C%5Ctheta%5Et%29):
或者,log函数在p上的期望,根据Jensen不等式, 综合上面的结果: %5Cle%5Clog%20p(x%7C%5Ctheta%5E%7Bt%2B1%7D)%0A#card=math&code=%5Clog%20p%28x%7C%5Ctheta%5Et%29%5Cle%5Clog%20p%28x%7C%5Ctheta%5E%7Bt%2B1%7D%29%0A)
根据上面的证明,我们看到,似然函数在每一步都会增大。进一步的,我们看 EM 迭代过程中的式子是怎么来的:
%3D%5Clog%20p(z%2Cx%7C%5Ctheta)-%5Clog%20p(z%7Cx%2C%5Ctheta)%3D%5Clog%20%5Cfrac%7Bp(z%2Cx%7C%5Ctheta)%7D%7Bq(z)%7D-%5Clog%20%5Cfrac%7Bp(z%7Cx%2C%5Ctheta)%7D%7Bq(z)%7D%0A#card=math&code=%5Clog%20p%28x%7C%5Ctheta%29%3D%5Clog%20p%28z%2Cx%7C%5Ctheta%29-%5Clog%20p%28z%7Cx%2C%5Ctheta%29%3D%5Clog%20%5Cfrac%7Bp%28z%2Cx%7C%5Ctheta%29%7D%7Bq%28z%29%7D-%5Clog%20%5Cfrac%7Bp%28z%7Cx%2C%5Ctheta%29%7D%7Bq%28z%29%7D%0A)
分别对两边求期望 :
%5Clog%20p(x%7C%5Ctheta)dz%3D%5Clog%20p(x%7C%5Ctheta)%5C%5C%0A%26Right%3A%5Cint_zq(z)%5Clog%20%5Cfrac%7Bp(z%2Cx%7C%5Ctheta)%7D%7Bq(z)%7Ddz-%5Cint_zq(z)%5Clog%20%5Cfrac%7Bp(z%7Cx%2C%5Ctheta)%7D%7Bq(z)%7Ddz%3DELBO%2BKL(q(z)%2Cp(z%7Cx%2C%5Ctheta))%0A%5Cend%7Balign%7D%0A#card=math&code=%5Cbegin%7Balign%7D%0A%26Left%3A%5Cint_zq%28z%29%5Clog%20p%28x%7C%5Ctheta%29dz%3D%5Clog%20p%28x%7C%5Ctheta%29%5C%5C%0A%26Right%3A%5Cint_zq%28z%29%5Clog%20%5Cfrac%7Bp%28z%2Cx%7C%5Ctheta%29%7D%7Bq%28z%29%7Ddz-%5Cint_zq%28z%29%5Clog%20%5Cfrac%7Bp%28z%7Cx%2C%5Ctheta%29%7D%7Bq%28z%29%7Ddz%3DELBO%2BKL%28q%28z%29%2Cp%28z%7Cx%2C%5Ctheta%29%29%0A%5Cend%7Balign%7D%0A&height=90&width=624)
上式中,Evidence Lower Bound(ELBO),是一个下界,所以 %5Cge%20ELBO#card=math&code=%5Clog%20p%28x%7C%5Ctheta%29%5Cge%20ELBO),等于号取在 KL 散度为0是,即:%3Dp(z%7Cx%2C%5Ctheta)#card=math&code=q%28z%29%3Dp%28z%7Cx%2C%5Ctheta%29),EM 算法的目的是将 ELBO 最大化,根据上面的证明过程,在每一步 EM 后,求得了最大的ELBO,并根据这个使 ELBO 最大的参数代入下一步中:
%5Clog%5Cfrac%7Bp(x%2Cz%7C%5Ctheta)%7D%7Bq(z)%7Ddz%0A#card=math&code=%5Chat%7B%5Ctheta%7D%3D%5Cmathop%7Bargmax%7D%7B%5Ctheta%7DELBO%3D%5Cmathop%7Bargmax%7D%5Ctheta%5Cint_zq%28z%29%5Clog%5Cfrac%7Bp%28x%2Cz%7C%5Ctheta%29%7D%7Bq%28z%29%7Ddz%0A)
由于 %3Dp(z%7Cx%2C%5Ctheta%5Et)#card=math&code=%C2%A0q%28z%29%3Dp%28z%7Cx%2C%5Ctheta%5Et%29) 的时候,这一步的最大值才能取等号,所以:
这个式子就是上面 EM 迭代过程中的式子。
从 Jensen 不等式出发,也可以导出这个式子:
%3D%5Clog%5Cintzp(x%2Cz%7C%5Ctheta)dz%3D%5Clog%5Cint_z%5Cfrac%7Bp(x%2Cz%7C%5Ctheta)q(z)%7D%7Bq(z)%7Ddz%5C%5C%0A%3D%5Clog%20%5Cmathbb%7BE%7D%7Bq(z)%7D%5B%5Cfrac%7Bp(x%2Cz%7C%5Ctheta)%7D%7Bq(z)%7D%5D%5Cge%20%5Cmathbb%7BE%7D%7Bq(z)%7D%5B%5Clog%5Cfrac%7Bp(x%2Cz%7C%5Ctheta)%7D%7Bq(z)%7D%5D%0A#card=math&code=%5Clog%20p%28x%7C%5Ctheta%29%3D%5Clog%5Cint_zp%28x%2Cz%7C%5Ctheta%29dz%3D%5Clog%5Cint_z%5Cfrac%7Bp%28x%2Cz%7C%5Ctheta%29q%28z%29%7D%7Bq%28z%29%7Ddz%5C%5C%0A%3D%5Clog%20%5Cmathbb%7BE%7D%7Bq%28z%29%7D%5B%5Cfrac%7Bp%28x%2Cz%7C%5Ctheta%29%7D%7Bq%28z%29%7D%5D%5Cge%20%5Cmathbb%7BE%7D_%7Bq%28z%29%7D%5B%5Clog%5Cfrac%7Bp%28x%2Cz%7C%5Ctheta%29%7D%7Bq%28z%29%7D%5D%0A)
其中,右边的式子就是 ELBO,等号在 %3DCq(z)#card=math&code=%C2%A0p%28x%2Cz%7C%5Ctheta%29%3DCq%28z%29) 时成立。于是:
dz%3D%5Cfrac%7B1%7D%7BC%7D%5Cint_zp(x%2Cz%7C%5Ctheta)dz%3D%5Cfrac%7B1%7D%7BC%7Dp(x%7C%5Ctheta)%3D1%5C%5C%0A%5CRightarrow%20q(z)%3D%5Cfrac%7B1%7D%7Bp(x%7C%5Ctheta)%7Dp(x%2Cz%7C%5Ctheta)%3Dp(z%7Cx%2C%5Ctheta)%0A#card=math&code=%5Cint_zq%28z%29dz%3D%5Cfrac%7B1%7D%7BC%7D%5Cint_zp%28x%2Cz%7C%5Ctheta%29dz%3D%5Cfrac%7B1%7D%7BC%7Dp%28x%7C%5Ctheta%29%3D1%5C%5C%0A%5CRightarrow%20q%28z%29%3D%5Cfrac%7B1%7D%7Bp%28x%7C%5Ctheta%29%7Dp%28x%2Cz%7C%5Ctheta%29%3Dp%28z%7Cx%2C%5Ctheta%29%0A)
我们发现,这个过程就是上面的最大值取等号的条件。
广义 EM
EM 模型解决了概率生成模型的参数估计的问题,通过引入隐变量 ,来学习 ,具体的模型对 有不同的假设。对学习任务 #card=math&code=p%28x%7C%5Ctheta%29),就是学习任务 %7D%7Bp(z%7Cx%2C%5Ctheta)%7D#card=math&code=%5Cfrac%7Bp%28x%2Cz%7C%5Ctheta%29%7D%7Bp%28z%7Cx%2C%5Ctheta%29%7D)。在这个式子中,我们假定了在 E 步骤中,%3Dp(z%7Cx%2C%5Ctheta)#card=math&code=q%28z%29%3Dp%28z%7Cx%2C%5Ctheta%29),但是这个#card=math&code=p%28z%7Cx%2C%5Ctheta%29) 如果无法求解,那么必须使用采样(MCMC)或者变分推断等方法来近似推断这个后验。我们观察 KL 散度的表达式,为了最大化 ELBO,在固定的 时,我们需要最小化 KL 散度,于是:
%3D%5Cmathop%7Bargmin%7D_qKL(p%2Cq)%3D%5Cmathop%7Bargmax%7D_qELBO%0A#card=math&code=%5Chat%7Bq%7D%28z%29%3D%5Cmathop%7Bargmin%7D_qKL%28p%2Cq%29%3D%5Cmathop%7Bargmax%7D_qELBO%0A)
这就是广义 EM 的基本思路:
- E step:%3D%5Cmathop%7Bargmax%7D_q%5Cint_zq%5Et(z)%5Clog%5Cfrac%7Bp(x%2Cz%7C%5Ctheta)%7D%7Bq%5Et(z)%7Ddz%2Cfixed%5C%20%5Ctheta%0A#card=math&code=%5Chat%7Bq%7D%5E%7Bt%2B1%7D%28z%29%3D%5Cmathop%7Bargmax%7D_q%5Cint_zq%5Et%28z%29%5Clog%5Cfrac%7Bp%28x%2Cz%7C%5Ctheta%29%7D%7Bq%5Et%28z%29%7Ddz%2Cfixed%5C%20%5Ctheta%0A)
- M step:%5Clog%5Cfrac%7Bp(x%2Cz%7C%5Ctheta)%7D%7Bq%5E%7Bt%2B1%7D(z)%7Ddz%2Cfixed%5C%20%5Chat%7Bq%7D%0A#card=math&code=%5Chat%7B%5Ctheta%7D%3D%5Cmathop%7Bargmax%7D_%5Ctheta%20%5Cint_zq%5E%7Bt%2B1%7D%28z%29%5Clog%5Cfrac%7Bp%28x%2Cz%7C%5Ctheta%29%7D%7Bq%5E%7Bt%2B1%7D%28z%29%7Ddz%2Cfixed%5C%20%5Chat%7Bq%7D%0A)
对于上面的积分:
%5Clog%5Cfrac%7Bp(x%2Cz%7C%5Ctheta)%7D%7Bq(z)%7Ddz%3D%5Cmathbb%7BE%7D%7Bq(z)%7D%5Bp(x%2Cz%7C%5Ctheta)%5D%2BEntropy(q(z))%0A#card=math&code=ELBO%3D%5Cint_zq%28z%29%5Clog%5Cfrac%7Bp%28x%2Cz%7C%5Ctheta%29%7D%7Bq%28z%29%7Ddz%3D%5Cmathbb%7BE%7D%7Bq%28z%29%7D%5Bp%28x%2Cz%7C%5Ctheta%29%5D%2BEntropy%28q%28z%29%29%0A)
因此,我们看到,广义 EM 相当于在原来的式子中加入熵这一项。
EM 的推广
EM 算法类似于坐标上升法,固定部分坐标,优化其他坐标,再一遍一遍的迭代。如果在 EM 框架中,无法求解 后验概率,那么需要采用一些变种的 EM 来估算这个后验。
- 基于平均场的变分推断,VBEM/VEM
- 基于蒙特卡洛的EM,MCEM