论文笔记 - RETRIEVE: Coreset Selection for Efficient and Robust Semi-Supervised Learning - 《Machine Learning》

Highlights
Details
Reference

Highlights

本文提出了一种基于子集选择的半监督学习方法，通过在模型训练的过程中动态选择用于训练的无标记样本子集，从而达到 (1) 加速模型训练（降低功耗、减少碳排） (2) 模型能够在开放环境（无标记样本存在OOD样本 / 类别比例不平衡）中稳健。
本文类似 DS3L[1]，将模型训练与无标记样本选择抽象成双层优化问题，内层优化利用一步梯度近似训练模型，外层选择无标记样本最小化标记样本上损失。
外层无标记样本选择时，利用泰勒展开提出一种相对高效的样本选择方法，并通过证明子集的价值函数是 $RETRIEVE: Coreset Selection for Efficient and Robust Semi-Supervised Learning - 图1$ ，得到样本选择的理论下界。

Details

Methods

与 DS3L[1] 类似，本文同样假设标记数据是理想的，通过双层优化的方式将无标记数据的分布同标记数据分布对齐。只不过，本文实现无标记数据分布对齐的方式是通过样本选择而非样本赋权，这在原有 DS3L的基础上带来了能够降低半监督模型计算开销的好处。
具体来说，本文的目标式如下公式所示：

$RETRIEVE: Coreset Selection for Efficient and Robust Semi-Supervised Learning - 图2$ %2B%5Clambda%7Bt%7D%20%5Csum%7Bj%20%5Cin%20%5Cmathcal%7BS%7D%7D%20%5Cboldsymbol%7Bm%7D%7Bj%20t%7D%20l%7Bu%7D%5Cleft(x%7Bj%7D%2C%20%5Ctheta%7Bt%7D%5Cright)%5Cright)%5Cright)%0A#card=math&code=%5Cmathcal%7BS%7D%7Bt%7D%3D%5Cunderset%7B%5Cmathcal%7BS%7D%20%5Csubseteq%20%5Cmathcal%7BU%7D%3A%7C%5Cmathcal%7BS%7D%7C%20%5Cleq%20k%7D%7B%5Coperatorname%7Bargmin%7D%7D%20L%7BS%7D%5Cleft%28%5Cmathcal%7BD%7D%2C%20%5Cunderset%7B%5Ctheta%7D%7B%5Coperatorname%7Bargmin%7D%7D%5Cleft%28L%7BS%7D%5Cleft%28%5Cmathcal%7BD%7D%2C%20%5Ctheta%7Bt%7D%5Cright%29%2B%5Clambda%7Bt%7D%20%5Csum%7Bj%20%5Cin%20%5Cmathcal%7BS%7D%7D%20%5Cboldsymbol%7Bm%7D%7Bj%20t%7D%20l%7Bu%7D%5Cleft%28x%7Bj%7D%2C%20%5Ctheta%7Bt%7D%5Cright%29%5Cright%29%5Cright%29%0A&id=XAtku)

在内层优化中，模型需要利用标记数据与所选择的无标记数据寻找能够最小化损失函数的最优参数 $RETRIEVE: Coreset Selection for Efficient and Robust Semi-Supervised Learning - 图3$ ；在外层优化中，算法需要选择大小不超过 $RETRIEVE: Coreset Selection for Efficient and Robust Semi-Supervised Learning - 图4$ 的样本子集 $RETRIEVE: Coreset Selection for Efficient and Robust Semi-Supervised Learning - 图5$ 最小化模型在标记数据上的损失。
类似的，对于内层优化，寻找一个最优的参数 $RETRIEVE: Coreset Selection for Efficient and Robust Semi-Supervised Learning - 图6$ 耗时且消耗显存，使用一步梯度近似方法进行优化（即内层优化仅更新一次）。将内层优化近似后，目标式变化为：

$RETRIEVE: Coreset Selection for Efficient and Robust Semi-Supervised Learning - 图7$ -%5Calpha%7Bt%7D%20%5Clambda%7Bt%7D%20%5Csum%7Bj%20%5Cin%20%5Cmathcal%7BS%7D%7D%20%5Cboldsymbol%7Bm%7D%7Bj%20t%7D%20%5Cnabla%7B%5Ctheta%7D%20l%7Bu%7D%5Cleft(x%7Bj%7D%2C%20%5Ctheta%7Bt%7D%5Cright)%5Cright)%0A#card=math&code=%5Cmathcal%7BS%7D%7Bt%7D%3D%5Cunderset%7B%5Cmathcal%7BS%7D%20%5Csubseteq%20%5Cmathcal%7BU%7D%3A%7C%5Cmathcal%7BS%7D%7C%20%5Cleq%20k%7D%7B%5Coperatorname%7Bargmin%7D%7D%20L%7BS%7D%5Cleft%28%5Cmathcal%7BD%7D%2C%20%5Ctheta%7Bt%7D-%5Calpha%7Bt%7D%20%5Cnabla%7B%5Ctheta%7D%20L%7BS%7D%5Cleft%28%5Cmathcal%7BD%7D%2C%20%5Ctheta%7Bt%7D%5Cright%29-%5Calpha%7Bt%7D%20%5Clambda%7Bt%7D%20%5Csum%7Bj%20%5Cin%20%5Cmathcal%7BS%7D%7D%20%5Cboldsymbol%7Bm%7D%7Bj%20t%7D%20%5Cnabla%7B%5Ctheta%7D%20l%7Bu%7D%5Cleft%28x%7Bj%7D%2C%20%5Ctheta_%7Bt%7D%5Cright%29%5Cright%29%0A&id=Xwpao)

基于目标式中标记样本上的损失，定义子集选择的价值函数：

$RETRIEVE: Coreset Selection for Efficient and Robust Semi-Supervised Learning - 图8$ %3D-L%7BS%7D%5Cleft(%5Cmathcal%7BD%7D%2C%20%5Ctheta%7Bt%7D-%5Calpha%7Bt%7D%20%5Cnabla%7B%5Ctheta%7D%20L%7BS%7D%5Cleft(%5Cmathcal%7BD%7D%2C%20%5Ctheta%7Bt%7D%5Cright)-%5Calpha%7Bt%7D%20%5Clambda%7Bt%7D%20%5Csum%7Bj%20%5Cin%20%5Cmathcal%7BS%7D%7D%20%5Cboldsymbol%7Bm%7D%7Bj%20t%7D%20%5Cnabla%7B%5Ctheta%7D%20l%7Bu%7D%5Cleft(x%7Bj%7D%2C%20%5Ctheta%7Bt%7D%5Cright)%5Cright)%0A#card=math&code=f%5Cleft%28%5Ctheta%7Bt%7D%2C%20%5Cmathcal%7BS%7D%5Cright%29%3D-L%7BS%7D%5Cleft%28%5Cmathcal%7BD%7D%2C%20%5Ctheta%7Bt%7D-%5Calpha%7Bt%7D%20%5Cnabla%7B%5Ctheta%7D%20L%7BS%7D%5Cleft%28%5Cmathcal%7BD%7D%2C%20%5Ctheta%7Bt%7D%5Cright%29-%5Calpha%7Bt%7D%20%5Clambda%7Bt%7D%20%5Csum%7Bj%20%5Cin%20%5Cmathcal%7BS%7D%7D%20%5Cboldsymbol%7Bm%7D%7Bj%20t%7D%20%5Cnabla%7B%5Ctheta%7D%20l%7Bu%7D%5Cleft%28x%7Bj%7D%2C%20%5Ctheta_%7Bt%7D%5Cright%29%5Cright%29%0A&id=FuHsi)

基于上述价值函数，在样本选择的时候，针对每一个样本 $RETRIEVE: Coreset Selection for Efficient and Robust Semi-Supervised Learning - 图9$ ，都需要重新计算 $RETRIEVE: Coreset Selection for Efficient and Robust Semi-Supervised Learning - 图10$ #card=math&code=f%5Cleft%28%5Ctheta%7Bt%7D%2C%20%5Cmathcal%7BS%7D%5Ccup%20%5C%7Be%5C%7D%5Cright%29&id=iDai5)。每一次计算中都需要先更新参数再计算损失，计算开销十分大。因此，将价值函数在子集选择为 $RETRIEVE: Coreset Selection for Efficient and Robust Semi-Supervised Learning - 图11$ 处展开，也就是子集选择为 $RETRIEVE: Coreset Selection for Efficient and Robust Semi-Supervised Learning - 图12$ 时模型的参数 ![](https://g.yuque.com/gr/latex?%5Ctheta%7B%5Cmathcal%7BS%7D%7D%3D%5Ctheta%7Bt%7D-%5Calpha%7Bt%7D%20%5Cnabla%7B%5Ctheta%7D%20L%7BS%7D%5Cleft(%5Cmathcal%7BD%7D%2C%20%5Ctheta%7Bt%7D%5Cright)-%5Calpha%7Bt%7D%20%5Clambda%7Bt%7D%20%5Csum%7Bj%20%5Cin%20%5Cmathcal%7BS%7D%7D%20%5Cboldsymbol%7Bm%7D%7Bj%20t%7D%20%5Cnabla%7B%5Ctheta%7D%20l%7Bu%7D%5Cleft(x%7Bj%7D%2C%20%5Ctheta%7Bt%7D%5Cright)#card=math&code=%5Ctheta%7B%5Cmathcal%7BS%7D%7D%3D%5Ctheta%7Bt%7D-%5Calpha%7Bt%7D%20%5Cnabla%7B%5Ctheta%7D%20L%7BS%7D%5Cleft%28%5Cmathcal%7BD%7D%2C%20%5Ctheta%7Bt%7D%5Cright%29-%5Calpha%7Bt%7D%20%5Clambda%7Bt%7D%20%5Csum%7Bj%20%5Cin%20%5Cmathcal%7BS%7D%7D%20%5Cboldsymbol%7Bm%7D%7Bj%20t%7D%20%5Cnabla%7B%5Ctheta%7D%20l%7Bu%7D%5Cleft%28x%7Bj%7D%2C%20%5Ctheta%7Bt%7D%5Cright%29&id=gWdQi) 处展开，将价值函数分为仅与 ![](https://g.yuque.com/gr/latex?%5Ctheta%7B%5Cmathcal%7BS%7D%7D#card=math&code=%5Ctheta_%7B%5Cmathcal%7BS%7D%7D&id=Wgwe5) 有关和与 $RETRIEVE: Coreset Selection for Efficient and Robust Semi-Supervised Learning - 图13$ 有关的两部分，降低了贪心算法的复杂度：

$RETRIEVE: Coreset Selection for Efficient and Robust Semi-Supervised Learning - 图14$ %3D-L%7BS%7D%5Cleft(%5Cmathcal%7BD%7D%2C%20%5Ctheta%5E%7BS%7D%5Cright)%2B%5Calpha%7Bt%7D%20%5Clambda%7Bt%7D%20%5Cnabla%7B%5Ctheta%7D%20L%7BS%7D%5Cleft(%5Cmathcal%7BD%7D%2C%20%5Ctheta%5E%7BS%7D%5Cright)%5E%7BT%7D%20%5Cboldsymbol%7Bm%7D%7Be%20t%7D%20%5Cnabla%7B%5Ctheta%7D%20l%7Bu%7D%5Cleft(x%7Be%7D%2C%20%5Ctheta%7Bt%7D%5Cright)%0A#card=math&code=%5Chat%7Bf%7D%5Cleft%28%5Ctheta%7Bt%7D%2C%20%5Cmathcal%7BS%7D%20%5Ccup%20e%5Cright%29%3D-L%7BS%7D%5Cleft%28%5Cmathcal%7BD%7D%2C%20%5Ctheta%5E%7BS%7D%5Cright%29%2B%5Calpha%7Bt%7D%20%5Clambda%7Bt%7D%20%5Cnabla%7B%5Ctheta%7D%20L%7BS%7D%5Cleft%28%5Cmathcal%7BD%7D%2C%20%5Ctheta%5E%7BS%7D%5Cright%29%5E%7BT%7D%20%5Cboldsymbol%7Bm%7D%7Be%20t%7D%20%5Cnabla%7B%5Ctheta%7D%20l%7Bu%7D%5Cleft%28x%7Be%7D%2C%20%5Ctheta_%7Bt%7D%5Cright%29%0A&id=qmqUp)

在实现上，算法使用了随机贪心选择算法，计算梯度时仅使用仅计算模型最后一层的梯度（分类器梯度），同时辅以模型 Warm-up 操作。

Theorem

本文在补充材料中证明了，子集选择的价值函数
RETRIEVE: Coreset Selection for Efficient and Robust Semi-Supervised Learning - 图15
具有 $RETRIEVE: Coreset Selection for Efficient and Robust Semi-Supervised Learning - 图16$ 的性质，也就是满足对于所有的 $RETRIEVE: Coreset Selection for Efficient and Robust Semi-Supervised Learning - 图17$ ，都有 $RETRIEVE: Coreset Selection for Efficient and Robust Semi-Supervised Learning - 图18$ %20-%20%20f(%5Ctheta%7Bt%7D%2C%20%5Cmathbf%7BX%7D)%20%5Cgeq%20%5Cleft%20(%201-%5Calpha%20%5Cright)%20%5Cleft%20%5B%20f(%5Ctheta%7Bt%7D%2C%20%5Cmathbf%7BY%7D%20%5Ccup%20%5C%7Be%5C%7D)%20-%20%20f(%5Ctheta%7Bt%7D%2C%20%5Cmathbf%7BY%7D)%20%5Cright%20%5D#card=math&code=f%28%5Ctheta%7Bt%7D%2C%20%5Cmathbf%7BX%7D%20%5Ccup%20%5C%7Be%5C%7D%29%20-%20%20f%28%5Ctheta%7Bt%7D%2C%20%5Cmathbf%7BX%7D%29%20%5Cgeq%20%5Cleft%20%28%201-%5Calpha%20%5Cright%29%20%5Cleft%20%5B%20f%28%5Ctheta%7Bt%7D%2C%20%5Cmathbf%7BY%7D%20%5Ccup%20%5C%7Be%5C%7D%29%20-%20%20f%28%5Ctheta_%7Bt%7D%2C%20%5Cmathbf%7BY%7D%29%20%5Cright%20%5D&id=LOF6c)。
这里假设样本的范数不超过 $RETRIEVE: Coreset Selection for Efficient and Robust Semi-Supervised Learning - 图19$ ，当监督/无监督损失为交叉熵时， $RETRIEVE: Coreset Selection for Efficient and Robust Semi-Supervised Learning - 图20$ ；当监督损失为交叉熵，无监督损失为均方误差时， $RETRIEVE: Coreset Selection for Efficient and Robust Semi-Supervised Learning - 图21$ 。
再通过 submodularity 的相关文献[2]直接推导出，利用此价值函数直接贪心选择样本子集，价值函数能够达到最优解的 $RETRIEVE: Coreset Selection for Efficient and Robust Semi-Supervised Learning - 图22$ %5Cright)#card=math&code=1-%5Cexp%5Cleft%28-%281-%5Calpha%29%5Cright%29&id=ulmMX)。
再结合文献[3]中推导的结果，从 n 个样本中选择 k 个样本作为子集，每次仅从 n 个样本中采样 $RETRIEVE: Coreset Selection for Efficient and Robust Semi-Supervised Learning - 图23$ 个样本，选择最优的一个加入子集。最终所选择子集可以达到最优子集的 $RETRIEVE: Coreset Selection for Efficient and Robust Semi-Supervised Learning - 图24$ -%5Cepsilon#card=math&code=1-%5Cexp%28-1%29-%5Cepsilon&id=qOp81) 近似。
最终可以在每次选择样本时，仅遍历 $RETRIEVE: Coreset Selection for Efficient and Robust Semi-Supervised Learning - 图25$ #card=math&code=O%28m%20%5Clog%20%5Cfrac%7B1%7D%7B%5Cepsilon%7D%29&id=B1LJl) 个（次）样本，就能达到最优子集的 $RETRIEVE: Coreset Selection for Efficient and Robust Semi-Supervised Learning - 图26$ %5Cright)%20-%20%5Cepsilon#card=math&code=1%20-%20%5Cexp%5Cleft%20%28%20-%281-%5Calpha%29%5Cright%29%20-%20%5Cepsilon&id=IsZFk) 近似。

Question

对比 DS3L[1]，仅将双层优化的外层从样本赋权进一步加强成为了样本选择，限制条件相比原本更为严格了。计算开销上大幅下降可以理解，但是性能反而变好了是不太科学的地方。
文章在 Intro 中提到，这个方法直觉上是在对齐标记数据与无标记数据的梯度。这件事情并不显然，而且在后文中也并没有直接体现出来。甚至我觉得这个说法可能并不准确…

Reference

[1] L.-Z. Guo, Z.-Y. Zhang, Y. Jiang, Y.-F. Li, and Z.-H. Zhou. Safe deep semi-supervised learning for unseen-class unlabeled data. In H. D. III and A. Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3897–3906. PMLR, 13–18 Jul 2020.
[2] K. Gatmiry and M. Gomez-Rodriguez. On the network visibility problem. CoRR, abs/1811.07863, 2018.
[3] B. Mirzasoleiman, A. Badanidiyuru, A. Karbasi, J. Vondrák, and A. Krause. Lazier than lazy greedy. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI’15, page 1812–1818. AAAI Press, 2015.