基础定义
空间
- 输入空间
- 输出空间
它们共同构成了一个样本空间。
真实函数
对于样本空间中的样本 %5Cin%20%5Cmathcal%7BX%7D%5Ctimes%5Cmathcal%7BY%7D#card=math&code=%28x%2Cy%29%5Cin%20%5Cmathcal%7BX%7D%5Ctimes%5Cmathcal%7BY%7D) ,我们假定
和
之间的关系都可以通过一个未知的真实函数
#card=math&code=y%3Dg%28x%29) 或真实条件概率分布
#card=math&code=p_r%28y%5Cmid%20x%29) 来描述。
机器学习的目的就是找到这个真实函数 #card=math&code=y%3Dg%28x%29) 或真实条件概率分布
#card=math&code=p_r%28y%5Cmid%20x%29) .
假设空间
但由于我们不知道真实函数的具体形式,而所有函数的组合有无穷无尽种,所以我们需要缩小范围,我们根据经验(关于问题的先验知识)假设一个函数集合 ,称为假设空间( Hypothesis Space ), 然后在假设空间中寻找最优的函数。
假设空间 通常是一个参数化的函数族
%5Cmid%20%5Ctheta%5Cin%20%5Cmathbb%7BR%7D%5Em%5C%7D%0A#card=math&code=%5Cmathcal%7BF%7D%3D%5C%7Bf%28x%3B%5Ctheta%29%5Cmid%20%5Ctheta%5Cin%20%5Cmathbb%7BR%7D%5Em%5C%7D%0A)
训练集
训练集 %2C%5Cldots%2C(x%7Bm%7D%2Cy%7Bm%7D)%5C%7D#card=math&code=D%3D%5C%7B%28x1%2Cy_1%29%2C%5Cldots%2C%28x%7Bm%7D%2Cy_%7Bm%7D%29%5C%7D) 是由
个独立同分布(i.i.d.)的样本组成,其中每个样本
%5Cin%20%5Cmathcal%7BX%7D%5Ctimes%5Cmathcal%7BY%7D#card=math&code=%28x%2Cy%29%5Cin%20%5Cmathcal%7BX%7D%5Ctimes%5Cmathcal%7BY%7D) 都是由关于某个未知的联合分布
产生,其真实函数为
#card=math&code=p_r%28x%2Cy%29),我们可以记为
问题
机器学习的目标就是:在假设空间 中选出最好的假设( Hypothesis)
,使得泛化误差最小
误差
泛化误差
模型 #card=math&code=f%28x%3B%5Ctheta%29) 对服从真实分布的数据的预测误差的期望即为泛化误差(Generalization error):
%3D%5Cunderset%7B(x%2C%20y)%20%5Csim%20%5Cmathcal%7BD%7D%7D%7B%5Cmathbb%7BE%7D%7DL(y%2C%20f(%5Cboldsymbol%7Bx%7D%20%3B%20%5Ctheta))%0A#card=math&code=R%28f%29%3D%5Cunderset%7B%28x%2C%20y%29%20%5Csim%20%5Cmathcal%7BD%7D%7D%7B%5Cmathbb%7BE%7D%7DL%28y%2C%20f%28%5Cboldsymbol%7Bx%7D%20%3B%20%5Ctheta%29%29%0A)
其中 #card=math&code=L%28%5Ccdot%2C%5Ccdot%29) 为损失函数(Loss function)
经验误差
然而计算泛化误差需要已知真实函数 #card=math&code=y%3Dg%28x%29) 或联合分布
#card=math&code=pr%28y%5Cmid%20x%29) ,我们通常只能计算经验误差( Empirical error ),即在训练集 %7D%2Cy%7B(n)%7D%5C%7D%7Bn%3D1%7D%5Em#card=math&code=D%3D%5C%7B%28x%7B%28n%29%7D%2Cy%7B%28n%29%7D%5C%7D_%7Bn%3D1%7D%5Em) 上的平均误差
%3D%5Cfrac%7B1%7D%7Bm%7D%20%5Csum%7Bn%3D1%7D%5E%7Bm%7D%20L%5Cleft(y_n%2C%20f%5Cleft(%5Cboldsymbol%7Bx%7D_n%3B%20%5Ctheta%5Cright)%5Cright)%0A#card=math&code=%5Cwidehat%7BR%7D%7BD%7D%28f%29%3D%5Cfrac%7B1%7D%7Bm%7D%20%5Csum_%7Bn%3D1%7D%5E%7Bm%7D%20L%5Cleft%28y_n%2C%20f%5Cleft%28%5Cboldsymbol%7Bx%7D_n%3B%20%5Ctheta%5Cright%29%5Cright%29%0A)
我们可以很直观地感受到:经验误差的期望 = 泛化误差
%20%5Cright%5D%20%3DR%5Cleft(%20f%20%5Cright)%0A#card=math&code=%5Cunderset%7BD%5Csim%20%5Cmathcal%7BD%7D%5Em%7D%7B%5Cmathbb%7BE%7D%7D%5Cleft%5B%20%5Cwidehat%7BR%7D_D%5Cleft%28%20f%20%5Cright%29%20%5Cright%5D%20%3DR%5Cleft%28%20f%20%5Cright%29%0A)
实际上也确实如此:
%20%5Cright%5D%20%26%3D%5Cunderset%7B%5Cbegin%7Barray%7D%7Bc%7D%0A%09D%5Csim%20%5Cmathcal%7BD%7D%5Em%5C%5C%0A%09%5Cleft(%20x%2Cy%20%5Cright)%20%5Cin%20D%5C%5C%0A%5Cend%7Barray%7D%7D%7B%5Cmathbb%7BE%7D%7D%5Cleft%5B%20%5Cfrac%7B1%7D%7Bm%7D%5Csum%7Bn%3D1%7D%5Em%7BL%5Cleft(%20y_n%2Cf%5Cleft(%20x_n%20%5Cright)%20%5Cright)%7D%20%5Cright%5D%20%0A%5C%5C%0A%26%3D%5Cunderset%7B%5Cleft(%20x%2Cy%20%5Cright)%20%5Csim%20%5Cmathcal%7BD%7D%7D%7B%5Cmathbb%7BE%7D%7D%5Cleft%5B%20%5Cfrac%7B1%7D%7Bm%7D%5Csum%7Bn%3D1%7D%5Em%7BL%5Cleft(%20y%2Cf%5Cleft(%20x%20%5Cright)%20%5Cright)%7D%20%5Cright%5D%20%0A%5C%5C%0A%26%3D%5Cunderset%7B%5Cleft(%20x%2Cy%20%5Cright)%20%5Csim%20%5Cmathcal%7BD%7D%7D%7B%5Cmathbb%7BE%7D%7D%5Cleft%5B%20L%5Cleft(%20y%2Cf%5Cleft(%20x%20%5Cright)%20%5Cright)%20%5Cright%5D%20%0A%5C%5C%0A%26%3DR%5Cleft(%20f%20%5Cright)%0A%5Cend%7Balign*%7D%0A#card=math&code=%5Cbegin%7Balign%2A%7D%0A%5Cunderset%7BD%5Csim%20%5Cmathcal%7BD%7D%5Em%7D%7B%5Cmathbb%7BE%7D%7D%5Cleft%5B%20%5Cwidehat%7BR%7DD%5Cleft%28%20f%20%5Cright%29%20%5Cright%5D%20%26%3D%5Cunderset%7B%5Cbegin%7Barray%7D%7Bc%7D%0A%09D%5Csim%20%5Cmathcal%7BD%7D%5Em%5C%5C%0A%09%5Cleft%28%20x%2Cy%20%5Cright%29%20%5Cin%20D%5C%5C%0A%5Cend%7Barray%7D%7D%7B%5Cmathbb%7BE%7D%7D%5Cleft%5B%20%5Cfrac%7B1%7D%7Bm%7D%5Csum%7Bn%3D1%7D%5Em%7BL%5Cleft%28%20yn%2Cf%5Cleft%28%20x_n%20%5Cright%29%20%5Cright%29%7D%20%5Cright%5D%20%0A%5C%5C%0A%26%3D%5Cunderset%7B%5Cleft%28%20x%2Cy%20%5Cright%29%20%5Csim%20%5Cmathcal%7BD%7D%7D%7B%5Cmathbb%7BE%7D%7D%5Cleft%5B%20%5Cfrac%7B1%7D%7Bm%7D%5Csum%7Bn%3D1%7D%5Em%7BL%5Cleft%28%20y%2Cf%5Cleft%28%20x%20%5Cright%29%20%5Cright%29%7D%20%5Cright%5D%20%0A%5C%5C%0A%26%3D%5Cunderset%7B%5Cleft%28%20x%2Cy%20%5Cright%29%20%5Csim%20%5Cmathcal%7BD%7D%7D%7B%5Cmathbb%7BE%7D%7D%5Cleft%5B%20L%5Cleft%28%20y%2Cf%5Cleft%28%20x%20%5Cright%29%20%5Cright%29%20%5Cright%5D%20%0A%5C%5C%0A%26%3DR%5Cleft%28%20f%20%5Cright%29%0A%5Cend%7Balign%2A%7D%0A)
贝叶斯误差
贝叶斯误差(Bayes error)是任意可测函数 能达到的最小泛化误差,注意此处的
不一定在假设空间
中.
%0A#card=math&code=R%5E%7B%2A%7D%3D%5Cinf_%7Bh%7D%20R%28f%29%0A)
噪声
大部分情况下,贝叶斯误差的存在是由噪声造成的,例如对于最简单的回归问题:
其中噪声 服从高斯分布,即
#card=math&code=%5Cepsilon%20%5Csim%20N%280%2C%5Csigma%5E2%29) ,那么我们知道最优模型为
%3Dx#card=math&code=f%5E%7B%5Cstar%7D%28x%29%3Dx),其泛化误差为:
%20%26%3D%5Cunderset%7By-x%5Csim%20N%5Cleft(%20%5Ctext%7B0%2C%7D%5Csigma%20%5E2%20%5Cright)%7D%7B%5Cmathbb%7BE%7D%7D%5Cleft%5B%20%5Cleft(%20y-f%5E%7B%7D%5Cleft(%20x%20%5Cright)%20%5Cright)%20%5E2%20%5Cright%5D%20%0A%5C%5C%0A%26%3D%5Cunderset%7By-x%5Csim%20N%5Cleft(%20%5Ctext%7B0%2C%7D%5Csigma%20%5E2%20%5Cright)%7D%7B%5Cmathbb%7BE%7D%7D%5Cleft%5B%20%5Cleft(%20y-x%20%5Cright)%20%5E2%20%5Cright%5D%20%0A%5C%5C%0A%26%3D%5Cunderset%7B%5Cepsilon%20%5Csim%20N%5Cleft(%20%5Ctext%7B0%2C%7D%5Csigma%20%5E2%20%5Cright)%7D%7B%5Cmathbb%7BE%7D%7D%5Cleft%5B%20%5Cepsilon%20%5E2%20%5Cright%5D%20%0A%5C%5C%0A%26%3D%5Csigma%20%5E2%0A%5Cend%7Balign%7D%0A#card=math&code=%5Cbegin%7Balign%2A%7D%0AR%5Cleft%28%20f%5E%7B%2A%7D%20%5Cright%29%20%26%3D%5Cunderset%7By-x%5Csim%20N%5Cleft%28%20%5Ctext%7B0%2C%7D%5Csigma%20%5E2%20%5Cright%29%7D%7B%5Cmathbb%7BE%7D%7D%5Cleft%5B%20%5Cleft%28%20y-f%5E%7B%2A%7D%5Cleft%28%20x%20%5Cright%29%20%5Cright%29%20%5E2%20%5Cright%5D%20%0A%5C%5C%0A%26%3D%5Cunderset%7By-x%5Csim%20N%5Cleft%28%20%5Ctext%7B0%2C%7D%5Csigma%20%5E2%20%5Cright%29%7D%7B%5Cmathbb%7BE%7D%7D%5Cleft%5B%20%5Cleft%28%20y-x%20%5Cright%29%20%5E2%20%5Cright%5D%20%0A%5C%5C%0A%26%3D%5Cunderset%7B%5Cepsilon%20%5Csim%20N%5Cleft%28%20%5Ctext%7B0%2C%7D%5Csigma%20%5E2%20%5Cright%29%7D%7B%5Cmathbb%7BE%7D%7D%5Cleft%5B%20%5Cepsilon%20%5E2%20%5Cright%5D%20%0A%5C%5C%0A%26%3D%5Csigma%20%5E2%0A%5Cend%7Balign%2A%7D%0A)
也就是说:
No Free Lunch 定理
对于离散的样本空间 ,我们使用算法
在训练集
上产生假设函数
的概率为
#card=math&code=P%28h%5Cmid%20X%2C%5Cmathfrak%7BL%7D_%7Ba%7D%29),令
表示真实的目标函数.
那么 在训练集之外的所有样本
上的误差期望为:
%20%5Cneq%20f(%5Cboldsymbol%7Bx%7D))%20%7C%20X%2C%20f%2C%5Cmathfrak%7BL%7D%7Ba%7D%5Cright%5D%3D%5Csum%7Bh%7D%20%5Csum%7B%5Cboldsymbol%7Bx%7D%20%5Cin%20%5Cmathcal%7BX%7D-X%7D%20P(%5Cboldsymbol%7Bx%7D)%20%5Cmathbb%7BI%7D(h(%5Cboldsymbol%7Bx%7D)%20%5Cneq%20f(%5Cboldsymbol%7Bx%7D))%20P%5Cleft(h%20%7C%20X%2C%20%5Cmathfrak%7BL%7D%7Ba%7D%5Cright)%0A#card=math&code=%5Cmathbb%7BE%7D%5Cleft%5B%5Cmathbb%7BI%7D%28h%28%5Cboldsymbol%7Bx%7D%29%20%5Cneq%20f%28%5Cboldsymbol%7Bx%7D%29%29%20%7C%20X%2C%20f%2C%5Cmathfrak%7BL%7D%7Ba%7D%5Cright%5D%3D%5Csum%7Bh%7D%20%5Csum%7B%5Cboldsymbol%7Bx%7D%20%5Cin%20%5Cmathcal%7BX%7D-X%7D%20P%28%5Cboldsymbol%7Bx%7D%29%20%5Cmathbb%7BI%7D%28h%28%5Cboldsymbol%7Bx%7D%29%20%5Cneq%20f%28%5Cboldsymbol%7Bx%7D%29%29%20P%5Cleft%28h%20%7C%20X%2C%20%5Cmathfrak%7BL%7D%7Ba%7D%5Cright%29%0A)
为了简化起见,我们考虑二分类问题,认为真实目标函数是任何一个将 映射到
上的函数,例如样本空间只有两个样本时:
,真实函数
可以是以下几种之一,且每一种的可能性都相等:
%3D0%2Cf_1(%5Cboldsymbol%7Bx%7D_2)%3D0%3B%5C%5C%0Af_2%3Af_2(%5Cboldsymbol%7Bx%7D_1)%3D0%2Cf_2(%5Cboldsymbol%7Bx%7D_2)%3D1%3B%5C%5C%0Af_3%3Af_3(%5Cboldsymbol%7Bx%7D_1)%3D1%2Cf_3(%5Cboldsymbol%7Bx%7D_2)%3D0%3B%5C%5C%0Af_4%3Af_4(%5Cboldsymbol%7Bx%7D_1)%3D1%2Cf_4(%5Cboldsymbol%7Bx%7D_2)%3D1%3B%0A%5Cend%7Baligned%7D%0A#card=math&code=%5Cbegin%7Baligned%7D%0Af_1%3Af_1%28%5Cboldsymbol%7Bx%7D_1%29%3D0%2Cf_1%28%5Cboldsymbol%7Bx%7D_2%29%3D0%3B%5C%5C%0Af_2%3Af_2%28%5Cboldsymbol%7Bx%7D_1%29%3D0%2Cf_2%28%5Cboldsymbol%7Bx%7D_2%29%3D1%3B%5C%5C%0Af_3%3Af_3%28%5Cboldsymbol%7Bx%7D_1%29%3D1%2Cf_3%28%5Cboldsymbol%7Bx%7D_2%29%3D0%3B%5C%5C%0Af_4%3Af_4%28%5Cboldsymbol%7Bx%7D_1%29%3D1%2Cf_4%28%5Cboldsymbol%7Bx%7D_2%29%3D1%3B%0A%5Cend%7Baligned%7D%0A)
然后我们对所有可能的 的期望误差求和:
%20%5Cneq%20f(%5Cboldsymbol%7Bx%7D))%20%7C%20X%2C%20f%2C%5Cmathfrak%7BL%7D%7Ba%7D%5Cright%5D%5Cright%5D%20%26%3D%5Ccfrac%7B1%7D%7B2%5E%7B%7C%5Cmathcal%7BX%7D%7C%7D%7D%5Csum_f%5Csum_h%5Csum%7B%5Cboldsymbol%7Bx%7D%5Cin%5Cmathcal%7BX%7D-X%7DP(%5Cboldsymbol%7Bx%7D)%5Cmathbb%7BI%7D(h(%5Cboldsymbol%7Bx%7D)%5Cneq%20f(%5Cboldsymbol%7Bx%7D))P(h%5Cvert%20X%2C%5Cmathfrak%7BL%7Da)%20%5C%5C%0A%26%3D%5Ccfrac%7B1%7D%7B2%5E%7B%7C%5Cmathcal%7BX%7D%7C%7D%7D%5Csum%7B%5Cboldsymbol%7Bx%7D%5Cin%5Cmathcal%7BX%7D-X%7DP(%5Cboldsymbol%7Bx%7D)%20%5CsumhP(h%5Cvert%20X%2C%5Cmathfrak%7BL%7D_a)%5Csum_f%5Cmathbb%7BI%7D(h(%5Cboldsymbol%7Bx%7D)%5Cneq%20f(%5Cboldsymbol%7Bx%7D))%20%5C%5C%0A%26%3D%5Ccfrac%7B1%7D%7B2%5E%7B%7C%5Cmathcal%7BX%7D%7C%7D%7D%5Csum%7B%5Cboldsymbol%7Bx%7D%5Cin%5Cmathcal%7BX%7D-X%7DP(%5Cboldsymbol%7Bx%7D)%20%5CsumhP(h%5Cvert%20X%2C%5Cmathfrak%7BL%7D_a)%5Ccfrac%7B1%7D%7B2%7D2%5E%7B%5Cvert%20%5Cmathcal%7BX%7D%20%5Cvert%7D%20%5C%5C%0A%26%3D%5Ccfrac%7B1%7D%7B2%7D%5Csum%7B%5Cboldsymbol%7Bx%7D%5Cin%5Cmathcal%7BX%7D-X%7DP(%5Cboldsymbol%7Bx%7D)%20%5CsumhP(h%5Cvert%20X%2C%5Cmathfrak%7BL%7D_a)%20%5C%5C%0A%26%3D%5Ccfrac%7B1%7D%7B2%7D%5Csum%7B%5Cboldsymbol%7Bx%7D%5Cin%5Cmathcal%7BX%7D-X%7DP(%5Cboldsymbol%7Bx%7D)%20%5Ccdot%201%5C%5C%0A%5Cend%7Baligned%7D%0A#card=math&code=%5Cbegin%7Baligned%7D%0A%5Cmathbb%7BE%7D%7Bf%7D%5Cleft%20%5B%5Cmathbb%7BE%7D%5Cleft%5B%5Cmathbb%7BI%7D%28h%28%5Cboldsymbol%7Bx%7D%29%20%5Cneq%20f%28%5Cboldsymbol%7Bx%7D%29%29%20%7C%20X%2C%20f%2C%5Cmathfrak%7BL%7D%7Ba%7D%5Cright%5D%5Cright%5D%20%26%3D%5Ccfrac%7B1%7D%7B2%5E%7B%7C%5Cmathcal%7BX%7D%7C%7D%7D%5Csumf%5Csum_h%5Csum%7B%5Cboldsymbol%7Bx%7D%5Cin%5Cmathcal%7BX%7D-X%7DP%28%5Cboldsymbol%7Bx%7D%29%5Cmathbb%7BI%7D%28h%28%5Cboldsymbol%7Bx%7D%29%5Cneq%20f%28%5Cboldsymbol%7Bx%7D%29%29P%28h%5Cvert%20X%2C%5Cmathfrak%7BL%7Da%29%20%5C%5C%0A%26%3D%5Ccfrac%7B1%7D%7B2%5E%7B%7C%5Cmathcal%7BX%7D%7C%7D%7D%5Csum%7B%5Cboldsymbol%7Bx%7D%5Cin%5Cmathcal%7BX%7D-X%7DP%28%5Cboldsymbol%7Bx%7D%29%20%5CsumhP%28h%5Cvert%20X%2C%5Cmathfrak%7BL%7D_a%29%5Csum_f%5Cmathbb%7BI%7D%28h%28%5Cboldsymbol%7Bx%7D%29%5Cneq%20f%28%5Cboldsymbol%7Bx%7D%29%29%20%5C%5C%0A%26%3D%5Ccfrac%7B1%7D%7B2%5E%7B%7C%5Cmathcal%7BX%7D%7C%7D%7D%5Csum%7B%5Cboldsymbol%7Bx%7D%5Cin%5Cmathcal%7BX%7D-X%7DP%28%5Cboldsymbol%7Bx%7D%29%20%5CsumhP%28h%5Cvert%20X%2C%5Cmathfrak%7BL%7D_a%29%5Ccfrac%7B1%7D%7B2%7D2%5E%7B%5Cvert%20%5Cmathcal%7BX%7D%20%5Cvert%7D%20%5C%5C%0A%26%3D%5Ccfrac%7B1%7D%7B2%7D%5Csum%7B%5Cboldsymbol%7Bx%7D%5Cin%5Cmathcal%7BX%7D-X%7DP%28%5Cboldsymbol%7Bx%7D%29%20%5CsumhP%28h%5Cvert%20X%2C%5Cmathfrak%7BL%7D_a%29%20%5C%5C%0A%26%3D%5Ccfrac%7B1%7D%7B2%7D%5Csum%7B%5Cboldsymbol%7Bx%7D%5Cin%5Cmathcal%7BX%7D-X%7DP%28%5Cboldsymbol%7Bx%7D%29%20%5Ccdot%201%5C%5C%0A%5Cend%7Baligned%7D%0A)
第2行到第3行:由于是二分类问题,所以对于任意的 ,会有一半的
将其分错,一半的
将其分对。因此
%5Cneq%20f(%5Cboldsymbol%7Bx%7D))%3D%5Ccfrac%7B1%7D%7B2%7D2%5E%7B%5Cvert%20%5Cmathcal%7BX%7D%20%5Cvert%7D#card=math&code=%5Csum_f%5Cmathbb%7BI%7D%28h%28%5Cboldsymbol%7Bx%7D%29%5Cneq%20f%28%5Cboldsymbol%7Bx%7D%29%29%3D%5Ccfrac%7B1%7D%7B2%7D2%5E%7B%5Cvert%20%5Cmathcal%7BX%7D%20%5Cvert%7D)
那么对于任意两个学习算法 ,我们都有:
%20%7C%20X%2C%20f%2C%5Cmathfrak%7BL%7D%7Ba%7D%5Cright%5D%5Cright%5D%3D%5Cmathbb%7BE%7D%7Bf%7D%5Cleft%20%5B%5Cmathbb%7BE%7D%5Cleft%5B%5Cmathbb%7BI%7D(h%20%5Cneq%20f)%20%7C%20X%2C%20f%2C%5Cmathfrak%7BL%7D%7Bb%7D%5Cright%5D%5Cright%5D%0A#card=math&code=%5Cmathbb%7BE%7D%7Bf%7D%5Cleft%20%5B%5Cmathbb%7BE%7D%5Cleft%5B%5Cmathbb%7BI%7D%28h%20%5Cneq%20f%29%20%7C%20X%2C%20f%2C%5Cmathfrak%7BL%7D%7Ba%7D%5Cright%5D%5Cright%5D%3D%5Cmathbb%7BE%7D%7Bf%7D%5Cleft%20%5B%5Cmathbb%7BE%7D%5Cleft%5B%5Cmathbb%7BI%7D%28h%20%5Cneq%20f%29%20%7C%20X%2C%20f%2C%5Cmathfrak%7BL%7D_%7Bb%7D%5Cright%5D%5Cright%5D%0A)
也就是说,在这个问题中,无论哪个学习算法,其总误差的期望是相等的!
这也就是NFL定理的核心观点:
:::info
对于任意两个学习算法 ,算法
不可能在所有问题上都比算法
强。
:::
