FCOS: Fully Convolutional One-Stage Object Detection

FCOS: Fully Convolutional One-Stage Object Detection

ICCV2019

Introduction
- Anchor Base or Anchor Free?
- Why use FCN?
Architecture
My thinking
Introduction

Anchor Base or Anchor Free?

Drawbacks of anchor-base detectors

Detection performance is sensitive to the sizes, aspect ratios, and number of anchor boxes.
These hyper-parameters need to be carefully tuned in anchor-based detectors.
Detector encounter difficulties to deal with object candidates with with large shape variations
particularly for small objects, because these hyper-parameters are kept fixed. It will also
hamper the generalization ability of detectors, as these parameters need to be re-designed on
new tasks.
In order to achieve high recall rate, a anchor-based detectors need to densely place anchor box
on the input image. (The essence of anchor is densely sampling.) It cause the imbalance between
positives and negatives because of most of the samples are easy negative.
Anchors boxes also involve complicated computation such as IOU.

Why use FCN?

FCN will not lose the spatial information.

Solve object detection in the neat per-pixel prediction fashion.

Architecture

Mapping Strategy

The ground-truth bounding boxes are defined as $FCOS - 图2$ , where $FCOS - 图3$ %7D%2C%20y%7B0%7D%5E%7B(i)%7D%2C%20x%7B1%7D%5E%7B(i)%7D%20y%7B1%7D%5E%7B(i)%7D%2C%20c%5E%7B(i)%7D%5Cright)#card=math&code=B%7Bi%7D%3D%5Cleft%28x%7B0%7D%5E%7B%28i%29%7D%2C%20y%7B0%7D%5E%7B%28i%29%7D%2C%20x%7B1%7D%5E%7B%28i%29%7D%20y%7B1%7D%5E%7B%28i%29%7D%2C%20c%5E%7B%28i%29%7D%5Cright%29).
Here $FCOS - 图4$ %7D%2C%20y%7B0%7D%5E%7B(i)%7D%5Cright)#card=math&code=%5Cleft%28x%7B0%7D%5E%7B%28i%29%7D%2C%20y%7B0%7D%5E%7B%28i%29%7D%5Cright%29) and ![](https://g.yuque.com/gr/latex?%5Cleft(x%7B0%7D%5E%7B(i)%7D%2C%20y%7B0%7D%5E%7B(i)%7D%5Cright)#card=math&code=%5Cleft%28x%7B0%7D%5E%7B%28i%29%7D%2C%20y_%7B0%7D%5E%7B%28i%29%7D%5Cright%29) denote the
coordinates of the left-top and right-bottom.

Each location $FCOS - 图5$ #card=math&code=%28x%2Cy%29) one the feature map can map it back onto the input image as
$FCOS - 图6$ #card=math&code=%5Cleft%28%5Cleft%5Clfloor%5Cfrac%7Bs%7D%7B2%7D%5Cright%5Crfloor%2Bx%20s%2C%5Cleft%5Clfloor%5Cfrac%7Bs%7D%7B2%7D%5Cright%5Crfloor%2By%20s%5Cright%29),
which is near the center of the receptive field of the location $FCOS - 图7$ #card=math&code=%28x%2Cy%29).

The location is $FCOS - 图8$ #card=math&code=%28x%2C%20y%29) is considered as positive sample if it falls into any ground-truth box and
the class label $FCOS - 图9$ of the location is the ground-truth box label.Otherwise it is a negetive sample
and $FCOS - 图10$ .

If a location fall into multiple bounding boxes, it is considered as an ambiguous sample, and choose
the bounding box with the minimal area as its regression target.

Besides the label for classification, a 4D real vector $FCOS - 图11$ #card=math&code=%5Cboldsymbol%7Bt%7D%5E%7B%2A%7D%3D%5Cleft%28l%5E%7B%2A%7D%2C%20t%5E%7B%2A%7D%2C%20r%5E%7B%2A%7D%2C%20b%5E%7B%2A%7D%5Cright%29)
being the regression targets for the location.

$FCOS - 图12$ %7D%2C%20%5Cquad%20t%5E%7B%7D%3Dy-y_%7B0%7D%5E%7B(i)%7D%2C%20%5C%5C%0Ar%5E%7B%7D%20%26%3Dx%7B1%7D%5E%7B(i)%7D-x%2C%20%5Cquad%20b%5E%7B*%7D%3Dy%7B1%7D%5E%7B(i)%7D-y%20.%0A%5Cend%7Baligned%7D%0A#card=math&code=%5Cbegin%7Baligned%7D%0Al%5E%7B%2A%7D%20%26%3Dx-x%7B0%7D%5E%7B%28i%29%7D%2C%20%5Cquad%20t%5E%7B%2A%7D%3Dy-y%7B0%7D%5E%7B%28i%29%7D%2C%20%5C%5C%0Ar%5E%7B%2A%7D%20%26%3Dx%7B1%7D%5E%7B%28i%29%7D-x%2C%20%5Cquad%20b%5E%7B%2A%7D%3Dy%7B1%7D%5E%7B%28i%29%7D-y%20.%0A%5Cend%7Baligned%7D%0A)

Compared to YOLOv1, FCOS takes advantages of all points in a ground truth bounding box to predict bounding boxes,
and the low-quality detected bounding boxes are suppressed by the “center-ness” branch.

YOLOv1 march cell ground-truth target by checking which center of gound-true in the cell.

Network Output

The final layer predicts an 80D vector $FCOS - 图14$ of classification labels and a 4D vector
$FCOS - 图15$ #card=math&code=%5Cboldsymbol%7Bt%7D%3D%28l%2Ct%2Cr%2Cb%29) bounding box coordinates. Moreover, since the regression targets are
always positive, this model employ $FCOS - 图16$ #card=math&code=%5Ctext%7Bexp%7D%28x%29) to map any real number to $FCOS - 图17$ #card=math&code=%280%2C%20%5Cinfty%29).

Why use FPN?

The large stride of final feature maps in a CNN can result in a relatively low best possible
recall (BPR)
Overlaps in ground-truth boxes can cause intractable ambiguity. And FPN can greatly resolve it.

This paper directly limit the range of bounding box regression for each level. Read the paper for
more detail information.

trick: $FCOS - 图18$ %20%5Cto%20%5Ctext%7Bexp%7D(s_ix)#card=math&code=%5Ctext%7Bexp%7D%28x%29%20%5Cto%20%5Ctext%7Bexp%7D%28s_ix%29)

Why use Center-ness?

It is observed that a lot of low-quality predicted bounding boxes produced by locations far from the
center of an object.

When testing, the final score (used for ranking the detected bounding boxes) is computed by
multiplying the predicted center-ness with the corresponding classification score.

Thus the center-ness can down-weight the scores of bounding boxes far from the center of an object.

As a result, with high probability, these low-quality bounding boxes might be filtered out by
the final non-maximum suppression (NMS) process, improving the detection performance remarkably.

The centerness target:

$FCOS - 图19$ %7D%7B%5Cmax%20%5Cleft(l%5E%7B%7D%2C%20r%5E%7B%7D%5Cright)%7D%7D%20%5Ctimes%20%5Cfrac%7B%5Cmin%20%5Cleft(t%5E%7B%7D%2C%20b%5E%7B%7D%5Cright)%7D%7B%5Cmax%20%5Cleft(t%5E%7B%7D%2C%20b%5E%7B%7D%5Cright)%7D%0A#card=math&code=%5Ctext%20%7B%20centerness%20%7D%5E%7B%2A%7D%3D%5Csqrt%7B%5Cfrac%7B%5Cmin%20%5Cleft%28l%5E%7B%2A%7D%2C%20r%5E%7B%2A%7D%5Cright%29%7D%7B%5Cmax%20%5Cleft%28l%5E%7B%2A%7D%2C%20r%5E%7B%2A%7D%5Cright%29%7D%7D%20%5Ctimes%20%5Cfrac%7B%5Cmin%20%5Cleft%28t%5E%7B%2A%7D%2C%20b%5E%7B%2A%7D%5Cright%29%7D%7B%5Cmax%20%5Cleft%28t%5E%7B%2A%7D%2C%20b%5E%7B%2A%7D%5Cright%29%7D%0A)

My thinking

Pixel can be seen as a special anchor box which weight and height are both 0. (anchor $FCOS - 图21$ pixel)
And the special anchor box also can be seen as any shape anchor box, we have no need to set hyper-parameters such scale and aspect.
In the case of dense object detection such as crowd detection, the recall rate may be reduced.
Because only one special box for one pixel.
Compared to YOLOv1, mapping strategy is not same, anchor free take advantages of all pixel in ground-truth bounding boxes. ( $FCOS - 图22$ not $FCOS - 图23$ )
Relatively, this map strategy will cause low-quality bounding boxes problem, this paper solves it by centerness.

FCOS