Two-stage的检测器通常检测精度比较高,它是在RPN上的候选生成框(已经做过NMS过滤)的稀疏集合上面用了分类器。与之相反的,one-stage的方法是用在可能的object locations上面做常规、密集采样(比如YOLOv3在416×416的输入情况下,特征输出层为13×13,26×26,52×52,一共有batchsize×3×(13×13+26×26+52×52)个候选框),它具有更快速、更简单的特点,但是精度没有two-stage的方法高。作者探究这这种情况发生的原因,在训练时候出现前景背景**类别(也即正负样本)的不平衡(imbalance)是中心原因。作者提出重新构建了标准交叉熵损失来解决类别不均衡(class imbalance),这样它就能降低容易分类的样例的比重(well-classified examples)。这个方法专注训练在hard example的稀疏集合上,能够防止大量的easy negatives在训练中压倒训练器(overwhelming the detector)。


Focal loss 其实对简单样本的进行惩罚的一种损失函数。是对标准的 Cross Entropy Loss(CE或者是二分类的BCE) 的一种改进。 Focal Loss对于简单样本(即网络认为概率比较大)设置比CE更小的loss。这里先介绍下BCE,CE,FL三种损失函数的表达式:

CE Loss 其实只统计预测正确(即网络预测类别与真实label的一致的损失,通过loss梯度下降来达到预测正确的概率趋近于1)

  • Binary Cross Entropy Loss:

Focal Loss(Focal Loss for Dense Object Detection) - 图1
Focal Loss(Focal Loss for Dense Object Detection) - 图2 means positive, Focal Loss(Focal Loss for Dense Object Detection) - 图3 means negative

  • Binary Cross Entropy Loss(**in pytorch):**

     ![](https://cdn.nlark.com/yuque/__latex/2c46fcb782571dc6ec235bc68290f232.svg#card=math&code=%20%5Cmathcal%20L%3D%20%5C%7Bl_1%2C%5Cdots%2Cl_N%5C%7D%5E%5Ctop%2C%20%5Cquad%0A%20%20%20%20%20%20%20%20l_n%20%3D%20-%20w_n%20%5Cleft%5B%20y_n%20%5Ccdot%20%5Clog%20x_n%20%2B%20%281%20-%20y_n%29%20%5Ccdot%20%5Clog%20%281%20-%20x_n%29%20%5Cright%5D&height=20&width=428)<br />          ![](https://cdn.nlark.com/yuque/__latex/8d9c307cb7f3c4a32822a51922d1ceaa.svg#card=math&code=N%20&height=13&width=13) means batch size, ![](https://cdn.nlark.com/yuque/__latex/9ca3596f61305b0530482174ec346a4b.svg#card=math&code=w_n&height=12&width=18) means a manual rescaling weight given to the loss  of each batch element.
  • BCEWithLogitsLoss(in pytorch):

This loss combines a Sigmoid layer and the BCELoss in one singleclass. This version is more numerically stable than using a plain Sigmoidfollowed by a BCELoss as, by combining the operations into one layer,we take advantage of the log-sum-exp trick for numerical stability.
Focal Loss(Focal Loss for Dense Object Detection) - 图4
Focal Loss(Focal Loss for Dense Object Detection) - 图5 means Sigmoid function

It’s possible to trade off recall and precision by adding weights(Focal Loss(Focal Loss for Dense Object Detection) - 图6) to positive examples. In the case of multi-label classification the loss can be described as:
Focal Loss(Focal Loss for Dense Object Detection) - 图7
where: Focal Loss(Focal Loss for Dense Object Detection) - 图8 is the class number (Focal Loss(Focal Loss for Dense Object Detection) - 图9 for multi-label binary classification, Focal Loss(Focal Loss for Dense Object Detection) - 图10 for single-label binary classification),
Focal Loss(Focal Loss for Dense Object Detection) - 图11 is the number of the sample in the batch and Focal Loss(Focal Loss for Dense Object Detection) - 图12 is the weight of the positive answer for the class Focal Loss(Focal Loss for Dense Object Detection) - 图13.
Focal Loss(Focal Loss for Dense Object Detection) - 图14 increases the recall, Focal Loss(Focal Loss for Dense Object Detection) - 图15 increases the precision.

  • Cross Entropy Loss:

Focal Loss(Focal Loss for Dense Object Detection) - 图16
Focal Loss(Focal Loss for Dense Object Detection) - 图17 means class number, Focal Loss(Focal Loss for Dense Object Detection) - 图18 means the possibility of ture label

  • Focal Loss:

Focal Loss(Focal Loss for Dense Object Detection) - 图19

The advantage of Focal Loss

举个列子,如论文中的图1, 在p=0.6时, 标准的CE有较大的loss,Focal Loss只有相对较小的loss。这其实是对简单样本的权重更新的一种衰减(decay)。显然观察图我们会发现,Focal loss 在 Focal Loss(Focal Loss for Dense Object Detection) - 图20 值接近于1的时候,其loss远小于CE,使得网络的loss集中在hard misclassfied 样本中,继而使得网络专注于对困难例子的改善。另外当 Focal Loss(Focal Loss for Dense Object Detection) - 图21, Focal Loss 退化为CE Loss,因此FL 是CE的泛化版本。
Focal loss的属性:

  1. 当一个样例被误分类,那么 Focal Loss(Focal Loss for Dense Object Detection) - 图22,那么调制因子 Focal Loss(Focal Loss for Dense Object Detection) - 图23,损失不被影响;当 Focal Loss(Focal Loss for Dense Object Detection) - 图24,调制因子 Focal Loss(Focal Loss for Dense Object Detection) - 图25,那么容易分类(well-classified)样本的权值就被调低了。
  2. 专注参数 Focal Loss(Focal Loss for Dense Object Detection) - 图26 平滑地调节了易分样本调低权值的比例。 Focal Loss(Focal Loss for Dense Object Detection) - 图27 增大能增强调制因子的影响,实验发现 Focal Loss(Focal Loss for Dense Object Detection) - 图28 取2最好。

直觉上来说,调制因子减少了易分样本的损失贡献,拓宽了样例接收到低损失的范围。举例来说,当 Focal Loss(Focal Loss for Dense Object Detection) - 图29 时,一个样本被分类 Focal Loss(Focal Loss for Dense Object Detection) - 图30 的损失比CE小100倍((1-0.9)^2=100)。这样就增加了那些误分类(准确的说应该是正确类别概率低的情况)的重要性。

此外论文中还引入另外一个超参数Focal Loss(Focal Loss for Dense Object Detection) - 图32 (平衡因子),作用是用来平衡正负样本本身的比例不均,文中Focal Loss(Focal Loss for Dense Object Detection) - 图33 取0.25,即正样本要比负样本占比小,这是因为负例易分,即 Focal Loss(Focal Loss for Dense Object Detection) - 图34-balanced的Focal Loss的变体。
Focal Loss(Focal Loss for Dense Object Detection) - 图35


这里对Focal loss进行反向传播梯度计算。

YOLOv3 with Focal Loss

首先梳理一下YOLOV3的检测过程,看看哪个地方适用于 Focal Loss:
(1)对于所有 predict boxes,若其与所有的真实方框 IoU 小于 ignore_thresh,惩罚objectness,如果大于,不进行惩罚
(2)对于所有 true boxes,判断它的尺寸如何,该丢给哪一层检测(FPN 中的哪一层)
(3)得出了该哪一层检测后,找 true boxes 的中心点,并且找和它靠近的 predict boxes,指定它去学习 true box
(4)location,objectness,classification 项的调整
只有一个地方,就是(1)阶段, 待网络训练稳定后,一个 batch 中惩罚数和不惩罚数目的比例接近达到了300 : 1。这也是正常的,因为如果按照416的尺寸输入,yolov3 的 anchor 总数达到 (1313 + 2626 + 5252)3 = 10647 , 显然大部分预测框和true box的IoU怎么会大于 ignore_thresh(0.5 for VOC 0.7 for MS COCO)


YOLO Loss = Location Loss + Objectness Loss + Classification Loss
这里用在Focal Loss 用在 Objectness Loss,因为只有在这里正负(前景和背景)样本极度不平均。

Orginal Loss
Focal Loss(Focal Loss for Dense Object Detection) - 图36
Focal Loss(Focal Loss for Dense Object Detection) - 图37
Focal Loss
Focal Loss(Focal Loss for Dense Object Detection) - 图38