指数族是一类分布,包括高斯分布、伯努利分布、二项分布、泊松分布、Beta 分布、Dirichlet 分布、Gamma 分布等一系列分布。指数族分布可以写为统一的形式:
%3Dh(x)%5Cexp(%5Ceta%5ET%5Cphi(x)-A(%5Ceta))%3D%5Cfrac%7B1%7D%7B%5Cexp(A(%5Ceta))%7Dh(x)%5Cexp(%5Ceta%5ET%5Cphi(x))%0A#card=math&code=p%28x%7C%5Ceta%29%3Dh%28x%29%5Cexp%28%5Ceta%5ET%5Cphi%28x%29-A%28%5Ceta%29%29%3D%5Cfrac%7B1%7D%7B%5Cexp%28A%28%5Ceta%29%29%7Dh%28x%29%5Cexp%28%5Ceta%5ET%5Cphi%28x%29%29%0A#crop=0&crop=0&crop=1&crop=1&id=gI3X3&originHeight=54&originWidth=590&originalType=binary&ratio=1&rotation=0&showTitle=false&status=done&style=none&title=)
其中, 是参数向量,
#card=math&code=A%28%5Ceta%29#crop=0&crop=0&crop=1&crop=1&id=C0IKH&originHeight=26&originWidth=42&originalType=binary&ratio=1&rotation=0&showTitle=false&status=done&style=none&title=) 是对数配分函数(归一化因子)。
在这个式子中,#card=math&code=%C2%A0%5Cphi%28x%29#crop=0&crop=0&crop=1&crop=1&id=xY7XT&originHeight=26&originWidth=41&originalType=binary&ratio=1&rotation=0&showTitle=false&status=done&style=none&title=) 叫做充分统计量,包含样本集合所有的信息,例如高斯分布中的均值和方差。充分统计量在在线学习中有应用,对于一个数据集,只需要记录样本的充分统计量即可。
对于一个模型分布假设(似然),那么我们在求解中,常常需要寻找一个共轭先验,使得先验与后验的形式相同,例如选取似然是二项分布,可取先验是 Beta 分布,那么后验也是 Beta 分布。指数族分布常常具有共轭的性质,于是我们在模型选择以及推断具有很大的便利。
共轭先验的性质便于计算,同时,指数族分布满足最大熵的思想(无信息先验),也就是说对于经验分布利用最大熵原理导出的分布就是指数族分布。
观察到指数族分布的表达式类似线性模型,事实上,指数族分布很自然地导出广义线性模型:
%5C%5C%0Ay%7Cx%5Csim%20Exp%20Family%0A#card=math&code=y%3Df%28w%5ETx%29%5C%5C%0Ay%7Cx%5Csim%20Exp%20Family%0A#crop=0&crop=0&crop=1&crop=1&id=M2vTb&originHeight=59&originWidth=900&originalType=binary&ratio=1&rotation=0&showTitle=false&status=done&style=none&title=)
在更复杂的概率图模型中,例如在无向图模型中如受限玻尔兹曼机中,指数族分布也扮演着重要作用。
在推断的算法中,例如变分推断中,指数族分布也会大大简化计算。
一维高斯分布
一维高斯分布可以写成:
%3D%5Cfrac%7B1%7D%7B%5Csqrt%7B2%5Cpi%7D%5Csigma%7D%5Cexp(-%5Cfrac%7B(x-%5Cmu)%5E2%7D%7B2%5Csigma%5E2%7D)%0A#card=math&code=p%28x%7C%5Ctheta%29%3D%5Cfrac%7B1%7D%7B%5Csqrt%7B2%5Cpi%7D%5Csigma%7D%5Cexp%28-%5Cfrac%7B%28x-%5Cmu%29%5E2%7D%7B2%5Csigma%5E2%7D%29%0A#crop=0&crop=0&crop=1&crop=1&id=tfUKr&originHeight=62&originWidth=290&originalType=binary&ratio=1&rotation=0&showTitle=false&status=done&style=none&title=)
将这个式子改写:
)%5C%5C%0A%3D%5Cexp(%5Clog(2%5Cpi%5Csigma%5E2)%5E%7B-1%2F2%7D)%5Cexp(-%5Cfrac%7B1%7D%7B2%5Csigma%5E2%7D%5Cbegin%7Bpmatrix%7D-2%5Cmu%261%5Cend%7Bpmatrix%7D%5Cbegin%7Bpmatrix%7Dx%5C%5Cx%5E2%5Cend%7Bpmatrix%7D-%5Cfrac%7B%5Cmu%5E2%7D%7B2%5Csigma%5E2%7D)%0A#card=math&code=%5Cfrac%7B1%7D%7B%5Csqrt%7B2%5Cpi%5Csigma%5E2%7D%7D%5Cexp%28-%5Cfrac%7B1%7D%7B2%5Csigma%5E2%7D%28x%5E2-2%5Cmu%20x%2B%5Cmu%5E2%29%29%5C%5C%0A%3D%5Cexp%28%5Clog%282%5Cpi%5Csigma%5E2%29%5E%7B-1%2F2%7D%29%5Cexp%28-%5Cfrac%7B1%7D%7B2%5Csigma%5E2%7D%5Cbegin%7Bpmatrix%7D-2%5Cmu%261%5Cend%7Bpmatrix%7D%5Cbegin%7Bpmatrix%7Dx%5C%5Cx%5E2%5Cend%7Bpmatrix%7D-%5Cfrac%7B%5Cmu%5E2%7D%7B2%5Csigma%5E2%7D%29%0A#crop=0&crop=0&crop=1&crop=1&id=Q8OtZ&originHeight=117&originWidth=900&originalType=binary&ratio=1&rotation=0&showTitle=false&status=done&style=none&title=)
所以:
于是 #card=math&code=A%28%5Ceta%29#crop=0&crop=0&crop=1&crop=1&id=KNuTQ&originHeight=26&originWidth=42&originalType=binary&ratio=1&rotation=0&showTitle=false&status=done&style=none&title=):
%3D-%5Cfrac%7B%5Ceta_1%5E2%7D%7B4%5Ceta_2%7D%2B%5Cfrac%7B1%7D%7B2%7D%5Clog(-%5Cfrac%7B%5Cpi%7D%7B%5Ceta_2%7D)%0A#card=math&code=A%28%5Ceta%29%3D-%5Cfrac%7B%5Ceta_1%5E2%7D%7B4%5Ceta_2%7D%2B%5Cfrac%7B1%7D%7B2%7D%5Clog%28-%5Cfrac%7B%5Cpi%7D%7B%5Ceta_2%7D%29%0A#crop=0&crop=0&crop=1&crop=1&id=TJCUE&originHeight=57&originWidth=255&originalType=binary&ratio=1&rotation=0&showTitle=false&status=done&style=none&title=)
充分统计量和对数配分函数的关系
对概率密度函数求积分:
)%26%3D%5Cint%20h(x)%5Cexp(%5Ceta%5ET%5Cphi(x))dx%5Cnonumber%0A%5Cend%7Balign%7D%0A#card=math&code=%5Cbegin%7Balign%7D%0A%5Cexp%28A%28%5Ceta%29%29%26%3D%5Cint%20h%28x%29%5Cexp%28%5Ceta%5ET%5Cphi%28x%29%29dx%5Cnonumber%0A%5Cend%7Balign%7D%0A#crop=0&crop=0&crop=1&crop=1&id=wMKKS&originHeight=51&originWidth=327&originalType=binary&ratio=1&rotation=0&showTitle=false&status=done&style=none&title=)
两边对参数求导:
)A’(%5Ceta)%3D%5Cint%20h(x)%5Cexp(%5Ceta%5ET%5Cphi(x))%5Cphi(x)dx%5C%5C%0A%5CLongrightarrow%20A’(%5Ceta)%3D%5Cmathbb%7BE%7D%7Bp(x%7C%5Ceta)%7D%5B%5Cphi(x)%5D%0A#card=math&code=%5Cexp%28A%28%5Ceta%29%29A%27%28%5Ceta%29%3D%5Cint%20h%28x%29%5Cexp%28%5Ceta%5ET%5Cphi%28x%29%29%5Cphi%28x%29dx%5C%5C%0A%5CLongrightarrow%20A%27%28%5Ceta%29%3D%5Cmathbb%7BE%7D%7Bp%28x%7C%5Ceta%29%7D%5B%5Cphi%28x%29%5D%0A#crop=0&crop=0&crop=1&crop=1&id=OewH5&originHeight=84&originWidth=900&originalType=binary&ratio=1&rotation=0&showTitle=false&status=done&style=none&title=)
类似的:
%3DVar%7Bp(x%7C%5Ceta)%7D%5B%5Cphi(x)%5D%0A#card=math&code=A%27%27%28%5Ceta%29%3DVar%7Bp%28x%7C%5Ceta%29%7D%5B%5Cphi%28x%29%5D%0A#crop=0&crop=0&crop=1&crop=1&id=JE2Kw&originHeight=30&originWidth=211&originalType=binary&ratio=1&rotation=0&showTitle=false&status=done&style=none&title=)
由于方差为正,于是 #card=math&code=A%28%5Ceta%29#crop=0&crop=0&crop=1&crop=1&id=TsNBS&originHeight=26&originWidth=42&originalType=binary&ratio=1&rotation=0&showTitle=false&status=done&style=none&title=) 一定是凸函数。
充分统计量和极大似然估计
对于独立全同采样得到的数据集 。
%5Cnonumber%5C%5C%0A%26%3D%5Cmathop%7Bargmax%7D%5Ceta%5Csum%5Climits%7Bi%3D1%7D%5EN(%5Ceta%5ET%5Cphi(xi)-A(%5Ceta))%5Cnonumber%5C%5C%0A%26%5CLongrightarrow%20A’(%5Ceta%7BMLE%7D)%3D%5Cfrac%7B1%7D%7BN%7D%5Csum%5Climits%7Bi%3D1%7D%5EN%5Cphi(x_i)%0A%0A%5Cend%7Balign%7D%0A#card=math&code=%5Cbegin%7Balign%7D%5Ceta%7BMLE%7D%26%3D%5Cmathop%7Bargmax%7D%5Ceta%5Csum%5Climits%7Bi%3D1%7D%5EN%5Clog%20p%28xi%7C%5Ceta%29%5Cnonumber%5C%5C%0A%26%3D%5Cmathop%7Bargmax%7D%5Ceta%5Csum%5Climits%7Bi%3D1%7D%5EN%28%5Ceta%5ET%5Cphi%28x_i%29-A%28%5Ceta%29%29%5Cnonumber%5C%5C%0A%26%5CLongrightarrow%20A%27%28%5Ceta%7BMLE%7D%29%3D%5Cfrac%7B1%7D%7BN%7D%5Csum%5Climits_%7Bi%3D1%7D%5EN%5Cphi%28x_i%29%0A%0A%5Cend%7Balign%7D%0A#crop=0&crop=0&crop=1&crop=1&id=sv8Ab&originHeight=200&originWidth=345&originalType=binary&ratio=1&rotation=0&showTitle=false&status=done&style=none&title=)
由此可以看到,为了估算参数,只需要知道充分统计量就可以了。
最大熵
信息熵记为:
%5Clog(p(x))dx%0A#card=math&code=Entropy%3D%5Cint-p%28x%29%5Clog%28p%28x%29%29dx%0A#crop=0&crop=0&crop=1&crop=1&id=MFOk0&originHeight=51&originWidth=291&originalType=binary&ratio=1&rotation=0&showTitle=false&status=done&style=none&title=)
一般地,对于完全随机的变量(等可能),信息熵最大。
我们的假设为最大熵原则,假设数据是离散分布的,
个特征的概率分别为
,最大熵原理可以表述为:
%5C%7D%3D%5Cmin%5C%7B%5Csum%5Climits%7Bk%3D1%7D%5EKp_k%5Clog%20p_k%5C%7D%5C%20s.t.%5C%20%5Csum%5Climits%7Bk%3D1%7D%5EKpk%3D1%0A#card=math&code=%5Cmax%5C%7BH%28p%29%5C%7D%3D%5Cmin%5C%7B%5Csum%5Climits%7Bk%3D1%7D%5EKpk%5Clog%20p_k%5C%7D%5C%20s.t.%5C%20%5Csum%5Climits%7Bk%3D1%7D%5EKp_k%3D1%0A#crop=0&crop=0&crop=1&crop=1&id=CuteL&originHeight=66&originWidth=439&originalType=binary&ratio=1&rotation=0&showTitle=false&status=done&style=none&title=)
利用 Lagrange 乘子法:
%3D%5Csum%5Climits%7Bk%3D1%7D%5EKp_k%5Clog%20p_k%2B%5Clambda(1-%5Csum%5Climits%7Bk%3D1%7D%5EKpk)%0A#card=math&code=L%28p%2C%5Clambda%29%3D%5Csum%5Climits%7Bk%3D1%7D%5EKpk%5Clog%20p_k%2B%5Clambda%281-%5Csum%5Climits%7Bk%3D1%7D%5EKp_k%29%0A#crop=0&crop=0&crop=1&crop=1&id=IrJXI&originHeight=66&originWidth=343&originalType=binary&ratio=1&rotation=0&showTitle=false&status=done&style=none&title=)
于是可得:
因此等可能的情况熵最大。
一个数据集 ,在这个数据集上的经验分布为
%3D%5Cfrac%7BCount(x)%7D%7BN%7D#card=math&code=%5Chat%7Bp%7D%28x%29%3D%5Cfrac%7BCount%28x%29%7D%7BN%7D#crop=0&crop=0&crop=1&crop=1&id=BTSXi&originHeight=51&originWidth=163&originalType=binary&ratio=1&rotation=0&showTitle=false&status=done&style=none&title=),实际不可能满足所有的经验概率相同,于是在上面的最大熵原理中还需要加入这个经验分布的约束。
对任意一个函数,经验分布的经验期望可以求得为:
%5D%3D%5CDelta%0A#card=math&code=%5Cmathbb%7BE%7D_%5Chat%7Bp%7D%5Bf%28x%29%5D%3D%5CDelta%0A#crop=0&crop=0&crop=1&crop=1&id=Rm8Jp&originHeight=27&originWidth=122&originalType=binary&ratio=1&rotation=0&showTitle=false&status=done&style=none&title=)
于是:
%5C%7D%3D%5Cmin%5C%7B%5Csum%5Climits%7Bk%3D1%7D%5ENp_k%5Clog%20p_k%5C%7D%5C%20s.t.%5C%20%5Csum%5Climits%7Bk%3D1%7D%5ENpk%3D1%2C%5Cmathbb%7BE%7D_p%5Bf(x)%5D%3D%5CDelta%0A#card=math&code=%5Cmax%5C%7BH%28p%29%5C%7D%3D%5Cmin%5C%7B%5Csum%5Climits%7Bk%3D1%7D%5ENpk%5Clog%20p_k%5C%7D%5C%20s.t.%5C%20%5Csum%5Climits%7Bk%3D1%7D%5ENp_k%3D1%2C%5Cmathbb%7BE%7D_p%5Bf%28x%29%5D%3D%5CDelta%0A#crop=0&crop=0&crop=1&crop=1&id=NsJ45&originHeight=66&originWidth=569&originalType=binary&ratio=1&rotation=0&showTitle=false&status=done&style=none&title=)
Lagrange 函数为:
%3D%5Csum%5Climits%7Bk%3D1%7D%5ENp_k%5Clog%20p_k%2B%5Clambda_0(1-%5Csum%5Climits%7Bk%3D1%7D%5ENpk)%2B%5Clambda%5ET(%5CDelta-%5Cmathbb%7BE%7D_p%5Bf(x)%5D)%0A#card=math&code=L%28p%2C%5Clambda_0%2C%5Clambda%29%3D%5Csum%5Climits%7Bk%3D1%7D%5ENpk%5Clog%20p_k%2B%5Clambda_0%281-%5Csum%5Climits%7Bk%3D1%7D%5ENp_k%29%2B%5Clambda%5ET%28%5CDelta-%5Cmathbb%7BE%7D_p%5Bf%28x%29%5D%29%0A#crop=0&crop=0&crop=1&crop=1&id=hgsQQ&originHeight=66&originWidth=567&originalType=binary&ratio=1&rotation=0&showTitle=false&status=done&style=none&title=)
求导得到:
%7DL%3D%5Csum%5Climits%7Bk%3D1%7D%5EN(%5Clog%20p(x)%2B1)-%5Csum%5Climits%7Bk%3D1%7D%5EN%5Clambda0-%5Csum%5Climits%7Bk%3D1%7D%5EN%5Clambda%5ETf(x)%5C%5C%0A%5CLongrightarrow%5Csum%5Climits%7Bk%3D1%7D%5EN%5Clog%20p(x)%2B1-%5Clambda_0-%5Clambda%5ETf(x)%3D0%0A#card=math&code=%5Cfrac%7B%5Cpartial%7D%7B%5Cpartial%20p%28x%29%7DL%3D%5Csum%5Climits%7Bk%3D1%7D%5EN%28%5Clog%20p%28x%29%2B1%29-%5Csum%5Climits%7Bk%3D1%7D%5EN%5Clambda_0-%5Csum%5Climits%7Bk%3D1%7D%5EN%5Clambda%5ETf%28x%29%5C%5C%0A%5CLongrightarrow%5Csum%5Climits_%7Bk%3D1%7D%5EN%5Clog%20p%28x%29%2B1-%5Clambda_0-%5Clambda%5ETf%28x%29%3D0%0A#crop=0&crop=0&crop=1&crop=1&id=Ge3W1&originHeight=138&originWidth=900&originalType=binary&ratio=1&rotation=0&showTitle=false&status=done&style=none&title=)
由于数据集是任意的,对数据集求和也意味着求和项里面的每一项都是0:
%3D%5Cexp(%5Clambda%5ETf(x)%2B%5Clambda_0-1)%0A#card=math&code=p%28x%29%3D%5Cexp%28%5Clambda%5ETf%28x%29%2B%5Clambda_0-1%29%0A#crop=0&crop=0&crop=1&crop=1&id=c96fm&originHeight=29&originWidth=263&originalType=binary&ratio=1&rotation=0&showTitle=false&status=done&style=none&title=)
这就是指数族分布。
