Focal Loss for Dense Object Detection
Why is the accuracy of the one-stage algorithm lower than that of the two-stage algorithm?
One of the reason is the extreme foreground-background imbalance.
Two stage algorithm
- When training the first stage, boxes are sorted by the foreground sore and remove the vast majority of easy negatives.
- When training the second stage, biased sampling is used to construct mini batches that contain
a 1:3 ratio of positive to negative examples.
Focal loss

Cross entropy: (target Y is one-shot)
%20%3D%20-%5Ctext%7Blog%7D%20(pi)#card=math&code=%5Ctext%7BCE%7D%20%3D%20%5Csum%7Bi%3D1%7D%5EN%20y_i%20%5Ctext%7Blog%7D%20%28p_i%29%20%3D%20-%5Ctext%7Blog%7D%20%28p_i%29)
Focal loss:
%3D-%5Calpha%7B%5Cmathrm%7Bt%7D%7D%5Cleft(1-p%7B%5Cmathrm%7Bt%7D%7D%5Cright)%5E%7B%5Cgamma%7D%20%5Clog%20%5Cleft(p%7B%5Cmathrm%7Bt%7D%7D%5Cright)#card=math&code=%5Cmathrm%7BFL%7D%5Cleft%28p%7B%5Cmathrm%7Bt%7D%7D%5Cright%29%3D-%5Calpha%7B%5Cmathrm%7Bt%7D%7D%5Cleft%281-p%7B%5Cmathrm%7Bt%7D%7D%5Cright%29%5E%7B%5Cgamma%7D%20%5Clog%20%5Cleft%28p_%7B%5Cmathrm%7Bt%7D%7D%5Cright%29)
where is the parameter to balance the positive and negative example,
%5E%5Cgamma#card=math&code=%281-p_t%29%5E%5Cgamma)
is the adaptive part which make down-weight easy examples () and thus focus on
training on hard negatives().
RetinaNet

Class Imbalance and Model Initialization

For the final conv layer of the classification subnet, set the bias initialization to%2F%5Cpi)#card=math&code=b%20%3D%20-%5Ctext%7Blog%7D%28%281-%5Cpi%29%2F%5Cpi%29) and a Gaussian wight fill with
If we don’t set bias initialization, the probability of positives and negatives equal 0.5.
This makes the instability at the start of training. And the focal_loss don’t work at the beginning.
Target Y is one-shot, P is the distribution calculate by model.
%20%3D%200.5#card=math&code=%5Csigma%280%29%20%3D%200.5),
. The optimization direction is opposite. (
and
)
And this will cause the instability at the start of training.%2F%5Cpi))%20%3D%20%5Cpi%20%3D%200.01#card=math&code=%5Csigma%28-%5Ctext%7Blog%7D%28%281-%5Cpi%29%2F%5Cpi%29%29%20%3D%20%5Cpi%20%3D%200.01),
. The direction of optimization is
consistent, and more efficient. (%5E%7B%5Cgamma%7D%20%5Cto%201#card=math&code=%281-p_t%29%5E%7B%5Cgamma%7D%20%5Cto%201) )
Comparison to Others
Comparison to SSD
Sort negatives using the highest confidence loss for each prior box and pick the top ones so that
the ratio between the negatives and positives is at most 3:1.
SSD only performs hard data mini on negatives by sampling to balance the classes, while focal loss
performs hard data mini on both positives and negatives.
On the other hand focal loss computes loss using all positives and negatives, but SSD does not because of sampling.
Comparison to Faster RCNN
The same as SSD.
My thinking
Let the neural network do the hard data mining.
