Knowledge Distillation - (ICCV 2019) Similarity-Preserving Knowledge Distillation - 《深度学习模型压缩》

主要想法
相关工作
网络结构
主要工作
- 与过去方法的差异
实验细节
相关链接

主要的新意：

提出了一个保留相似性损失，来促使学生网络学习老师网络在对于数据内部的关系表达的知识
该损失可以和一些现有的方法相配合，实现更好的蒸馏效果
实验设计中，尝试了一种迁移学习的比较实验

主要想法

这篇文章应该是ICCV2019的一篇文章（根据极市平台https://github.com/extreme-assistant/iccv2019的提前的统计）。本文针对的是分类任务，尝试从一种新的角度来构建知识蒸馏，即通过所谓“保留相似性”的手段来实现更好的蒸馏。

本文的构思主要基于一个核心前提：语义相似的输入趋向于在训练好的网络中产生相似的激活模式，反之亦然。基于此，本文提出了核心的假设：如果两个输入在教师网络中有着高度相似的激活，那么引导学生网络趋向于对该输入同样产生高的相似激活（反之亦然）的参数组合，那将是有利的（对于学生更好的学习老师网络的能力与知识）。

这个图指示了CIFAR-10中10000张图片分别对应于教师网络最后一个卷积层的激活值中所有通道内计算均值得到的矢量，整体绘制出来得到的结果。这里分成了十类，每一类对应相邻的1000张图片，可见，相邻的1000张图片的激活情况是类似的，而不同类别之间有明显差异。

网络结构

这里是对大小为b的batch中。所有图片一起编码，进而得到一个bxb的相似性矩阵。

主要工作

在这些想法的加持下，本文构造了一种除了最终用于分类的交叉熵损失之外的保留相似性知识蒸馏损失（similarity-preserving knowledge distillation loss）:

这里的损失是：

针对同一batch数据得到教师模型和学生模型在网络特定层次（教师的l层和学生的l’层）的特征针对batch的相似性关联矩阵
通过计算来自教师模型和学生模型各自关联矩阵的差值的F范数的平方，再针对所有的层次的匹配对计算结果进行加和，针对b^2计算均值可得最终的保留相似性知识蒸馏损失
注意2和3中的公式分母的形式，对于矩阵的右下角标为2的情况，表示的是row-wise L2 normalization ~~矩阵的谱范数，参见~~链接~~，具体计算可以写作：，这里的表示的是对内部输入取其最大的特征值，对应的范数并不相同（随着论文的第二版的更新，论文中补充了这部分的介绍）~~
式子4中的F范数表示的是Frobenius norm（矩阵的各个元素平方之和再开平方根，它通常也叫做矩阵的L2范数，它的有点在它是一个凸函数，可以求导求解，易于计算，参见链接）

进而在整体训练的时候使用如下损失来进行对学生网络的监督：

这里的 (ICCV 2019) Similarity-Preserving Knowledge Distillation - 图10 是一个超参数，用来平衡损失。

图3中展示了对于CIFAR-10测试集上的数个batch的G矩阵可视化结果，这里的激活是从最后一个卷积层收集而来的。

每一列表示一个单独的batch，两个网络都是一致的。
每个batch的图像中，对于样本的顺序已经通过其真值类别分组。一个batch包含128张图片样本。在两行的G矩阵中，显示了独特的块状模式，这指示了者系网络的最后一层的激活，在相同类别的时候有着相似的结果，而不同类别也有着不同的结果，也就是前者有着更大的相似性，后者相似性较小。
图中每个块大小不同，这主要是因为不同类别在每个batch中包含的样本数不同。
上下对比也可以看出来，对于复杂模型（下面），块状模式更加明显突出，这也反映出来，其对于捕获数据集的语义信息有着更强的能力。
这样的现象也在一定程度上支撑了本文的假设，也反映出前面提出的相似性损失的意义与价值所在，就是促使学生网络可以更好的模仿学习老师模型对于数据特征中的关联信息的学习。

与过去方法的差异

knowledge distillation：使用的是类别得分作为蒸馏知识的传递媒介，而提出的保留相似性知识蒸馏损失则是定义在特征激活之上
FitNets、flow-based distillation、attention transfer：也使用了定义在激活之上的蒸馏损失，但是存在一个关键的不同在于这些先前的蒸馏方法鼓励学生模仿老师表征空间的不同方面。我们的方法不同于这种常见的方法，它的目的是尽可能保留（模仿学习）输入样本的成对激活相似性。与其模仿表征空间，倒不如直接模仿教师模型在特征值处理过程中获取的特征之间的关系。它的行为并不受老师模型的表征空间的旋转而改变。

实验细节

关于实验的设计：We now turn to the experimental validation of our distillation approach on three public datasets.

We start with CIFAR-10 as it is a commonly adopted dataset for comparing distillation methods, and its relatively small size allows multiple student and teacher combinations to be evaluated.
We then consider the task of transfer learning, and show how distillation and fine-tuning can be combined to perform transfer learning on a texture dataset with limited training data.
Finally, we report results on the larger CINIC-10 dataset.

CIFAR-10

CIFAR-10 consists of 50,000 training images and 10,000 testing images at a resolution of 32x32.The dataset covers ten object classes, with each class having an equal number of images.

We conducted experiments using wide residual networks (WideResNets) following [4, 41].
We adopted the standard protocol for training wide residual networks on CIFAR-10 (SGD with Nesterov momentum; 200 epochs; batch size of 128; and an initial learning rate of 0.1, decayed by a factor of 0.2 at epochs 60, 120, and 160).
We applied the standard horizontal flip and random crop data augmentation.
We performed baseline comparisons with respect to traditional knowledge distillation (softened class scores) and attention transfer.
- For traditional knowledge distillation, we set = 0.9 and T = 4 following the CIFAR-10 experiments in [4, 41].
- Attention transfer losses were applied for each of the three residual block groups. We set the weight of the distillation loss in attention transfer and similarity-preserving distillation by held-out validation(链接) on a subset of the training set (=1000 for attention transfer, =3000 for similarity-preserving distillation).

在CIFAR-10上进行试验的时候，使用的是WideResNets结构，具体细节可见原文，使用更为复杂的结构作为教师模型，对应的小模型作为学生模型，这里针对的是相同架构不同复杂度的模型分别应用不同的蒸馏方法后的比较。

The above similarity-preserving distillation results were produced using only the activations collected from the last convolution layers of the student and teacher networks. We also experimented with using the activations at the end of each WideResNet block, but found no improvement in performance. We therefore used only the activations at the final convolution layers in the subsequent experiments. Activation similarities may be less informative in the earlier layers of the network because these layers encode more generic features, which tend to be present across many images. Progressing deeper in the network, the channels encode increasingly specialized features, and the activation patterns of semantically similar images become more distinctive.

Transfer learning combining distillation with fine-tuning

Suppose we are faced with a novel recognition task in a specialized image domain with limited training data. 这里从自然图像分类任务迁移到了纹理材料分类任务。

A natural strategy to adopt is to transfer the knowledge of a network pre-trained on ImageNet (or another suitable large-scale dataset) to the new recognition task by fine-tuning.
Here, we combine knowledge distillation with fine-tuning:
- we initialize the student network with source domain (in this case, ImageNet) pretrained weights,
- then fine-tune the student to the target domain using both distillation and cross-entropy losses（式子5）.
We analyzed this scenario using the describable textures dataset [Describing textures in the wild], which is composed of 5,640 images covering 47 texture categories.
- Image sizes range from 300x300 to 640x640.
- We applied ImageNet-style data augmentation with horizontal flipping and random resized cropping during training.
- At test time, images were resized to 256x256 and center cropped to 224x224 for input to the networks.
- For evaluation, we adopted the standard ten training-validation-testing splits.
- To demonstrate the versatility of our method on different network architectures, and in particular its compatibility with mobile-friendly architectures, we experimented with variants of MobileNet and MobileNetV2.
We compared with an attention transfer baseline.
- Softened class score based distillation is not directly comparable in this setting because the classes in the source and target domains are disjoint（类别数量不同）. The teacher would first have to be fine-tuned to the target domain, which significantly increases training time and may not be practical when employing expensive teachers or trasferring to large datasets.
- Similarity-preserving distillation can be applied directly to train the student, without first fine-tuning the teacher, since it aims to preserve similarities instead of mimicking the teacher’s representation space.
- We set the hyperparameters for attention transfer and similarity-preserving distillation by held-out validation on the ten standard splits.
- All net-works were trained using SGD with Nesterov momentum, a batch size of 96, and for 60 epochs with an initial learning rate of 0.01 reduced to 0.001 after 30 epochs.

表五中显示了迁移学习的一些实验结果。

The results suggest that there may be a challenging domain shift in the important image areas for the network to attend. Moreover, while attention transfer summarizes the activation map by summing out the channel dimension, similarity-preserving distillation makes use of the full activation map in computing the similarity-based distillation loss, which may be more robust in the presence of a domain shift in attention.

CINIC-10

The CINIC-10 dataset is designed to be a middle option relative to CIFAR-10 and ImageNet: it is composed of 32x32 images in the style of CIFAR-10, but at a total of 270,000 images its scale is closer to that of ImageNet. We adopted CINIC-10 for rapid experimentation because several GPU-months would have been required to perform full held-out validation and training on ImageNet for our method and all baselines.
For the student and teacher architectures, we experimented with variants of the state-of-the-art mobile archi-tecture ShuffleNetV2.
We used the standard training validation-testing split and set the hyperparameters for similarity-preserving distillation and all baselines by held-out validation.
- KD:{=0.6, T=16}
- AT:= 50
- SP:= 2000
- KD+SP:{=0.6, T=16, =2000}
- AT+SP:{= 30, = 2000}
All networks were trained using SGD with Nesterov momentum, a batch size of 96, for 140 epochs with an initial learning rate of 0.01 decayed by a factor of 10 after the 100th and 120th epochs. We applied CIFAR-style data augmentation with horizontal flips and random crops during training.

This result shows that similarity-preserving distillation complements atten-tion transfer and captures teacher knowledge that is not fully encoded in spatial attention maps

Sensitivity analysis

图4中显示了超参数 (ICCV 2019) Similarity-Preserving Knowledge Distillation - 图23 对于性能的影响。将其从100调整至16000，进行多次测试，得到了一个较为直观的结果。 In all experiments, we set by held-out validation.

(ICCV 2019) Similarity-Preserving Knowledge Distillation