ResNet - Deep Residual Learning for Image Recognition - 《机器学习》

Abstract
1.Introduction
3.Deep Residual Learning
4.Experiment
- 4.1 ImageNet Classification
为什么ResNet可以强化训练？
- ResNet可以避免梯度消失的情况

https://arxiv.org/abs/1512.03385
Deep Residual Learning for Image Recognition.pdf

Abstract

Deeper neural networks are more diffificult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we
evaluate residual nets with a depth of up to 152 layers—8 deeper than VGG nets（这里的152层是很夸张的） [41] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet _test _set. This result won the 1st place on the ILSVRC 2015 classifification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions_1 , where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation._

神经网络层数越深越难学习。作者提出了一个残差学习框架去简化训练，哪怕需要训练的层数更深，也能做到更简单。作者还说他用残差网络赢得了两个图像检测的比赛，图像检测比赛的第一、二名使用的论文都非常值得学习，ResNet一口气赢下了两场比赛。

计算机视觉的文章，作者一般比较喜欢把结果比较明显的图片（好看的图片，放在第一页的右上角，这样比较显眼）图片中显示，层数越深反而效果越差，56-layer的训练误差和测试误差都大于20-layer。所以文章摘要的第一句话就是说，训练更深的神经网络是很难的。

看文章一般是先看结论的，但是这篇文章是没有结论的，因为CVPR要求每篇文章的正文不能超过八页，但是这篇文章展示训练结果用的篇幅太多的，就没有结论了。

第二张图Figure 2.是讲ResNet整个架构的实现。

这个Figure 4.的意思是，对比了加Res和不加Res的区别。在不用残差网络的时候，深度越深反而效果越差，加了Res就解决了这个问题。

其中错误率有两次比较明显的下降是因为作者调整了学习率。
“The learning rate starts from 0.1 and is divided by 10 when the error plateaus,”

这里的Top-1 error是ImageNet里面的评判标准，除了Top-1还有Top-5 The former is a multi-class classification error, i.e. the proportion of incorrectly classified images; the latter is the main evaluation criterion used in ILSVRC, and is computed as the proportion of images such that the ground-truth category is outside the top-5 predicted categories.

Top-1就是没猜中的概率，Top-5就是猜的结果中概率最大的5个都没有正确答案的概率。一般来说Top-5都比Top-1小。

1.Introduction

Deep convolutional neural networks [22, 21] have led to a series of breakthroughs for image classifification [21, 50, 40]. Deep networks naturally integrate low/mid/high level features [50] and classififiers in an end-to-end multi layer fashion, and the “levels” of features can be enriched
by the number of stacked layers (depth).

第一句话介绍深度卷积神经网络的优点，Deep networks naturally integrate low/mid/high level features 层数越高可以提取到图片的越多特征，从低级的视觉特征到高级的语义特征。

Driven by the signifificance of depth, a question arises: _Is learning better networks as easy as stacking more layers? _An obstacle to answering this question was the notorious problem of vanishing/exploding gradients [1, 9], which hamper convergence from the beginning. This problem, however, has been largely addressed by normalized initialization [23, 9, 37, 13] and intermediate normalization layers
[16], which enable networks with tens of layers to start converging for stochastic gradient descent (SGD) with backpropagation [22].

随着网络不断加深，就会有一个问题浮现出来： “学习更好的网络就是把网络做的更深吗？”但是随着网络变深，一个问题(obstacle，阻碍)也随之而来 this question was the notorious problem of vanishing/exploding gradients——梯度消失或者爆炸。解决办法就是，normalized initialization初始化的时候不要太大也不要太小，在中间加入一些Normalization，比如说BN，可以校验每个层之间的输出，梯度的均值和方差。虽然加入了这些方法，网络是可以收敛了，但是性能变差了。

When deeper networks are able to start converging, a degradation _problem has been exposed: with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly. Unexpectedly, such degradation is _not caused by overfifitting, and adding more layers to a suitably deep model leads to higher training error, as reported in [11, 42] and thoroughly verifified by
our experiments. Fig. 1 shows a typical example.、

这里说效果变差了并不是由于过拟合导致的，因为我们的训练误差也变大了。 overfitting——过拟合，是指训练误差减小，但是预测误差变大。

The degradation (of training accuracy) indicates that not all systems are similarly easy to optimize. Let us consider a shallower architecture and its deeper counterpart that adds more layers onto it. There exists a solution _by construction _to the deeper model: the added layers are _identity _mapping,
and the other layers are copied from the learned shallower model. The existence of this constructed solution indicates that a deeper model should produce no higher training error than its shallower counterpart. But experiments show that our current solvers on hand are unable to find solutions that are comparably good or better than the constructed solution (or unable to do so in feasible time).

作者先考虑一个浅一点的网络，然后再做出它的deeper counterpart，如果浅的网络效果还不错，深的网络效果也不会差。深的网络新加的那些层，总是可以变成identity mapping(一种输入和输出的一一对应) 比如说，当我们用SGD学习的时候，下面的层和shallow的一样，上面的层就变成identity mapping。按理来说精度是不会变差的，但是SGD unable to find solutions，本文就是提出了一个构造identity mapping的方法，使更深的网络性能不会变差。 deep residual learning framework

In this paper, we address the degradation problem by introducing a deep residual learning _framework. Instead of hoping each few stacked layers directly fifit a desired underlying mapping, we explicitly let these layers fifit a residual mapping. Formally, denoting the desired underlying mapping as _H(x), we let the stacked nonlinear layers fifit another mapping of F(x) := H(x)−x. The original mapping is recast into F(x)+x. We hypothesize that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push
the residual to zero than to fifit an identity mapping by a stack of nonlinear layers.

X就是我们学到的内容 H(X)就是我们想要的映射，是还没有把shallow扩成deep的时候。 F(X)学习的东西是H(X)-X而不是X，H(X)-X就是残差。这样输出的就是F（X）+X最后结果也是一样的。

The formulation of F(x) +x can be realized by feedforward neural networks with “shortcut connections” (Fig. 2). Shortcut connections [2, 34, 49] are those skipping one or more layers. In our case, the shortcut connections simply perform identity _mapping, and their outputs are added to the outputs of the stacked layers (Fig. 2). Identity shortcut connections add neither extra parameter nor computational complexity. The entire network can still be trained end-to-end by SGD with backpropagation, and can be easily implemented using common libraries (_e.g., Caffe [19]) without modifying the solvers.

在神经网络上实现是通过shortcut connections的方法。ResNet不增加学习的参数，不会增加模型的复杂度，也不会加大训练的计算量。而且还不需要修改原来的代码。

We present comprehensive experiments on ImageNet [36] to show the degradation problem and evaluate our method. We show that: 1) Our extremely deep residual nets are easy to optimize, but the counterpart “plain” nets (that simply stack layers) exhibit higher training error when the depth increases; 2) Our deep residual nets can easily enjoy accuracy gains from greatly increased depth, producing results substantially better than previous networks.

不加残差连接的话呢，Plain的版本效果就会很差。加了残差链接之后，越深效果越好精度越高。

看到这里其实基本上已经理解ResNet在干嘛了，不需要深入研究的话，不往下看也没关系的。

3.Deep Residual Learning

3.3 Residual Network

Based on the above plain network, we insert shortcut connections (Fig. 3, right) which turn the
network into its counterpart residual version. The identity shortcuts (Eqn.(1)) can be directly used when the input and output are of the same dimensions (solid line shortcuts in Fig. 3). When the dimensions increase (dotted line shortcuts in Fig. 3), we consider two options: (A) The shortcut still performs identity mapping, with extra zero entries padded for increasing dimensions. This option introduces no extra
parameter; (B) The projection shortcut in Eqn.(2) is used to match dimensions (done by 1_×_1 convolutions). For both options, when the shortcuts go across feature maps of two sizes, they are performed with a stride of 2

残差链接如何处理输入和输出是不同形状的情况呢？ 1，在输入和输出上分别添加一些额外的0，使输入和输出的维度相同 2，或者使用1x1的卷积层，这个卷积层的特点：在空间上不做任何东西的改动，但是主要改变通道的维度。

3.4 Implementation

Our implementation for ImageNet follows the practice in [21, 41]. The image is resized with its shorter side randomly sampled in [256, _480] for scale augmentation [41]. A 224×224 crop is randomly sampled from an image or its horizontal flflip, with the per-pixel mean subtracted [21]. The standard color augmentation in [21] is used. We adopt batch normalization (BN) [16] right after each convolution and
before activation, following [16]. We initialize the weights as in [13] and train all plain/residual nets from scratch. We use SGD with a mini-batch size of 256. The learning rate starts from 0.1 and is divided by 10 when the error plateaus, and the models are trained for up to 60 × _104 iterations. We use a weight decay of 0.0001 and a momentum of 0.9. We do not use dropout [14], following the practice in [16].

In testing, for comparison studies we adopt the standard 10-crop testing [21]. For best results, we adopt the fully convolutional form as in [41, 13], and average the scores at multiple scales (images are resized such that the shorter side is in {_224, 256, 384, 480, 640}_).

减掉了per-pixel的均值，并且用了颜色的增强。也用了Batch Normalization 这里学习率用的是0.1每一次error不发生较大改变的时候，就把学习率除以十

4.Experiment

4.1 ImageNet Classification

Plain Networks. We fifirst evaluate 18-layer and 34-layer plain nets. The 34-layer plain net is in Fig. 3 (middle). The 18-layer plain net is of a similar form. See Table 1 for detailed architectures.

这里的FLOPs是在计算，我们这个神经网络需要多少个浮点数计算。
FLOPs=输入的高 X 输入的宽 X 通道数 X 输出通道数
这里并没有讲为什么要这样设计神经网络，这里很有可能是超参数自动调出来的。这里的3x3是啥意思？卷积核吗

有5种不同深度的版本。一开始的7x7的卷积都是一样的，pooling层也是一样的。全连接层也是一样的。不同架构之间，中间的卷积层是不一样的。这里有三个残差块，Residual Block

红色的线表示的是验证精度或者是测试精度，浅色的线表示的是训练精度。在训练的时候，我们做了一些数据增强，所以我们的训练精度会比较低，因为我们加入了一些噪音。

In Table 3 we compare three options: (A) zero-padding shortcuts are used for increasing dimensions, and all shortcuts are parameter free (the same as Table 2 and Fig. 4 right); (B) projection shortcuts are used for increasing dimensions, and other shortcuts are identity; and (C) all shortcuts are projections.

文章比较了，输入和输出不一样的时候，如何做残差连接，这里给出了三种方法。 A：补零 B：做投影 C：所有的连接都做投影——即使输入和输出的形状是一样的也加1x1的卷积层。 C的效果虽然好一些，但是是很贵的。带来了大量的计算复杂度，不划算现在的ResNet都是采用了B方案，就是当维度不一样的时候投影一次。

Deeper Bottleneck Architectures.
Next we describe our deeper nets for ImageNet. Because of concerns on the training time that we can afford, we modify the building block as a bottleneck _design4 . For each residual function _F, we
use a stack of 3 layers instead of 2 (Fig. 5). The three layers are 1×_1, 3×3, and 1×1 convolutions, where the 1×1 layers are responsible for reducing and then increasing (restoring) dimensions, leaving the 3
×_3 layer a bottleneck with smaller input/output dimensions. Fig. 5 shows an example, where both designs have similar time complexity.

当ResNet的深度超过50层以上的时候，会引入一个叫做bottleneck的东西

如果我们要做到比较深的时候，维度就应该要大一些，64->256，当深度越深的时候，可以学到更多的模式，当我们增加了通道数之后，我们的计算复杂度是平方关系的增加，所以我们先做一个低维的映射，再做一个高维的映射，来降低运算量。虽然通道数是之前的四倍，但是这样设计之后算法的复杂度是差不多的。

Deep Residual Learning for Image Recognition - 图12

再回到表1就能理解50层以上的设计了。这里有个不理解的地方就是，为什么深度越高，通道数也要越多呢？

为什么ResNet可以强化训练？

ResNet可以避免梯度消失的情况

如果梯度太小了，神经网络是训练不动的。

没加ResNet的时候根本就没有训练动，加了ResNet训练动了，自然而然精度就高了。

SGD收敛的意思就是说训练不动了，其实没有特别多的意义。

蓝色的这条线也是收敛的意思。