Focal Loss for Dense Object Detection

Why is the accuracy of the one-stage algorithm lower than that of the two-stage algorithm?

One of the reason is the extreme foreground-background imbalance.

Two stage algorithm

  • When training the first stage, boxes are sorted by the foreground sore and remove the vast majority of easy negatives.
  • When training the second stage, biased sampling is used to construct mini batches that contain
    a 1:3 ratio of positive to negative examples.

Focal loss

focal_loss.png

Cross entropy: (target Y is one-shot)

Focal Loss - 图2%20%3D%20-%5Ctext%7Blog%7D%20(pi)#card=math&code=%5Ctext%7BCE%7D%20%3D%20%5Csum%7Bi%3D1%7D%5EN%20y_i%20%5Ctext%7Blog%7D%20%28p_i%29%20%3D%20-%5Ctext%7Blog%7D%20%28p_i%29)

Focal loss:

Focal Loss - 图3%3D-%5Calpha%7B%5Cmathrm%7Bt%7D%7D%5Cleft(1-p%7B%5Cmathrm%7Bt%7D%7D%5Cright)%5E%7B%5Cgamma%7D%20%5Clog%20%5Cleft(p%7B%5Cmathrm%7Bt%7D%7D%5Cright)#card=math&code=%5Cmathrm%7BFL%7D%5Cleft%28p%7B%5Cmathrm%7Bt%7D%7D%5Cright%29%3D-%5Calpha%7B%5Cmathrm%7Bt%7D%7D%5Cleft%281-p%7B%5Cmathrm%7Bt%7D%7D%5Cright%29%5E%7B%5Cgamma%7D%20%5Clog%20%5Cleft%28p_%7B%5Cmathrm%7Bt%7D%7D%5Cright%29)

where Focal Loss - 图4 is the parameter to balance the positive and negative example, Focal Loss - 图5%5E%5Cgamma#card=math&code=%281-p_t%29%5E%5Cgamma)
is the adaptive part which make down-weight easy examples (Focal Loss - 图6) and thus focus on
training on hard negatives(Focal Loss - 图7).

RetinaNet

RetinaNet.png

Class Imbalance and Model Initialization

sigmoid.png

For the final conv layer of the classification subnet, set the bias initialization to
Focal Loss - 图10%2F%5Cpi)#card=math&code=b%20%3D%20-%5Ctext%7Blog%7D%28%281-%5Cpi%29%2F%5Cpi%29) and a Gaussian wight fill with Focal Loss - 图11

If we don’t set bias initialization, the probability of positives and negatives equal 0.5.
This makes the instability at the start of training. And the focal_loss don’t work at the beginning.

Target Y is one-shot, P is the distribution calculate by model.

  • Focal Loss - 图12%20%3D%200.5#card=math&code=%5Csigma%280%29%20%3D%200.5), Focal Loss - 图13. The optimization direction is opposite. ( Focal Loss - 图14 and Focal Loss - 图15 )
    And this will cause the instability at the start of training.
  • Focal Loss - 图16%2F%5Cpi))%20%3D%20%5Cpi%20%3D%200.01#card=math&code=%5Csigma%28-%5Ctext%7Blog%7D%28%281-%5Cpi%29%2F%5Cpi%29%29%20%3D%20%5Cpi%20%3D%200.01), Focal Loss - 图17. The direction of optimization is
    consistent, and more efficient. ( Focal Loss - 图18%5E%7B%5Cgamma%7D%20%5Cto%201#card=math&code=%281-p_t%29%5E%7B%5Cgamma%7D%20%5Cto%201) )

Comparison to Others

Comparison to SSD

Sort negatives using the highest confidence loss for each prior box and pick the top ones so that
the ratio between the negatives and positives is at most 3:1.

SSD only performs hard data mini on negatives by sampling to balance the classes, while focal loss
performs hard data mini on both positives and negatives.

On the other hand focal loss computes loss using all positives and negatives, but SSD does not because of sampling.

Comparison to Faster RCNN

The same as SSD.

My thinking

Let the neural network do the hard data mining.