FCOS: Fully Convolutional One-Stage Object Detection

ICCV2019

Anchor Base or Anchor Free?

Drawbacks of anchor-base detectors

  • Detection performance is sensitive to the sizes, aspect ratios, and number of anchor boxes.
    These hyper-parameters need to be carefully tuned in anchor-based detectors.
  • Detector encounter difficulties to deal with object candidates with with large shape variations
    particularly for small objects
    , because these hyper-parameters are kept fixed. It will also
    hamper the generalization ability of detectors, as these parameters need to be re-designed on
    new tasks.
  • In order to achieve high recall rate, a anchor-based detectors need to densely place anchor box
    on the input image. (The essence of anchor is densely sampling.) It cause the imbalance between
    positives and negatives because of most of the samples are easy negative.
  • Anchors boxes also involve complicated computation such as IOU.

Why use FCN?

FCN will not lose the spatial information.

Solve object detection in the neat per-pixel prediction fashion.

Architecture

fcos.png

Mapping Strategy

The ground-truth bounding boxes are defined as FCOS - 图2, where FCOS - 图3%7D%2C%20y%7B0%7D%5E%7B(i)%7D%2C%20x%7B1%7D%5E%7B(i)%7D%20y%7B1%7D%5E%7B(i)%7D%2C%20c%5E%7B(i)%7D%5Cright)#card=math&code=B%7Bi%7D%3D%5Cleft%28x%7B0%7D%5E%7B%28i%29%7D%2C%20y%7B0%7D%5E%7B%28i%29%7D%2C%20x%7B1%7D%5E%7B%28i%29%7D%20y%7B1%7D%5E%7B%28i%29%7D%2C%20c%5E%7B%28i%29%7D%5Cright%29).
Here FCOS - 图4%7D%2C%20y%7B0%7D%5E%7B(i)%7D%5Cright)#card=math&code=%5Cleft%28x%7B0%7D%5E%7B%28i%29%7D%2C%20y%7B0%7D%5E%7B%28i%29%7D%5Cright%29) and ![](https://g.yuque.com/gr/latex?%5Cleft(x%7B0%7D%5E%7B(i)%7D%2C%20y%7B0%7D%5E%7B(i)%7D%5Cright)#card=math&code=%5Cleft%28x%7B0%7D%5E%7B%28i%29%7D%2C%20y_%7B0%7D%5E%7B%28i%29%7D%5Cright%29) denote the
coordinates of the left-top and right-bottom.

Each location FCOS - 图5#card=math&code=%28x%2Cy%29) one the feature map can map it back onto the input image as
FCOS - 图6#card=math&code=%5Cleft%28%5Cleft%5Clfloor%5Cfrac%7Bs%7D%7B2%7D%5Cright%5Crfloor%2Bx%20s%2C%5Cleft%5Clfloor%5Cfrac%7Bs%7D%7B2%7D%5Cright%5Crfloor%2By%20s%5Cright%29),
which is near the center of the receptive field of the location FCOS - 图7#card=math&code=%28x%2Cy%29).

The location is FCOS - 图8#card=math&code=%28x%2C%20y%29) is considered as positive sample if it falls into any ground-truth box and
the class label FCOS - 图9 of the location is the ground-truth box label.Otherwise it is a negetive sample
and FCOS - 图10.

If a location fall into multiple bounding boxes, it is considered as an ambiguous sample, and choose
the bounding box with the minimal area as its regression target.

Besides the label for classification, a 4D real vector FCOS - 图11#card=math&code=%5Cboldsymbol%7Bt%7D%5E%7B%2A%7D%3D%5Cleft%28l%5E%7B%2A%7D%2C%20t%5E%7B%2A%7D%2C%20r%5E%7B%2A%7D%2C%20b%5E%7B%2A%7D%5Cright%29)
being the regression targets for the location.

FCOS - 图12%7D%2C%20%5Cquad%20t%5E%7B%7D%3Dy-y_%7B0%7D%5E%7B(i)%7D%2C%20%5C%5C%0Ar%5E%7B%7D%20%26%3Dx%7B1%7D%5E%7B(i)%7D-x%2C%20%5Cquad%20b%5E%7B*%7D%3Dy%7B1%7D%5E%7B(i)%7D-y%20.%0A%5Cend%7Baligned%7D%0A#card=math&code=%5Cbegin%7Baligned%7D%0Al%5E%7B%2A%7D%20%26%3Dx-x%7B0%7D%5E%7B%28i%29%7D%2C%20%5Cquad%20t%5E%7B%2A%7D%3Dy-y%7B0%7D%5E%7B%28i%29%7D%2C%20%5C%5C%0Ar%5E%7B%2A%7D%20%26%3Dx%7B1%7D%5E%7B%28i%29%7D-x%2C%20%5Cquad%20b%5E%7B%2A%7D%3Dy%7B1%7D%5E%7B%28i%29%7D-y%20.%0A%5Cend%7Baligned%7D%0A)

Compared to YOLOv1, FCOS takes advantages of all points in a ground truth bounding box to predict bounding boxes,
and the low-quality detected bounding boxes are suppressed by the “center-ness” branch.

YOLOv1 march cell ground-truth target by checking which center of gound-true in the cell.

fcos_box.png

Network Output

The final layer predicts an 80D vector FCOS - 图14 of classification labels and a 4D vector
FCOS - 图15#card=math&code=%5Cboldsymbol%7Bt%7D%3D%28l%2Ct%2Cr%2Cb%29) bounding box coordinates. Moreover, since the regression targets are
always positive, this model employ FCOS - 图16#card=math&code=%5Ctext%7Bexp%7D%28x%29) to map any real number to FCOS - 图17#card=math&code=%280%2C%20%5Cinfty%29).

Why use FPN?

  • The large stride of final feature maps in a CNN can result in a relatively low best possible
    recall (BPR)
  • Overlaps in ground-truth boxes can cause intractable ambiguity. And FPN can greatly resolve it.

This paper directly limit the range of bounding box regression for each level. Read the paper for
more detail information.

trick: FCOS - 图18%20%5Cto%20%5Ctext%7Bexp%7D(s_ix)#card=math&code=%5Ctext%7Bexp%7D%28x%29%20%5Cto%20%5Ctext%7Bexp%7D%28s_ix%29)

Why use Center-ness?

It is observed that a lot of low-quality predicted bounding boxes produced by locations far from the
center of an object
.

When testing, the final score (used for ranking the detected bounding boxes) is computed by
multiplying the predicted center-ness with the corresponding classification score.

Thus the center-ness can down-weight the scores of bounding boxes far from the center of an object.

As a result, with high probability, these low-quality bounding boxes might be filtered out by
the final non-maximum suppression (NMS) process, improving the detection performance remarkably.

The centerness target:

FCOS - 图19%7D%7B%5Cmax%20%5Cleft(l%5E%7B%7D%2C%20r%5E%7B%7D%5Cright)%7D%7D%20%5Ctimes%20%5Cfrac%7B%5Cmin%20%5Cleft(t%5E%7B%7D%2C%20b%5E%7B%7D%5Cright)%7D%7B%5Cmax%20%5Cleft(t%5E%7B%7D%2C%20b%5E%7B%7D%5Cright)%7D%0A#card=math&code=%5Ctext%20%7B%20centerness%20%7D%5E%7B%2A%7D%3D%5Csqrt%7B%5Cfrac%7B%5Cmin%20%5Cleft%28l%5E%7B%2A%7D%2C%20r%5E%7B%2A%7D%5Cright%29%7D%7B%5Cmax%20%5Cleft%28l%5E%7B%2A%7D%2C%20r%5E%7B%2A%7D%5Cright%29%7D%7D%20%5Ctimes%20%5Cfrac%7B%5Cmin%20%5Cleft%28t%5E%7B%2A%7D%2C%20b%5E%7B%2A%7D%5Cright%29%7D%7B%5Cmax%20%5Cleft%28t%5E%7B%2A%7D%2C%20b%5E%7B%2A%7D%5Cright%29%7D%0A)

center-ness.png

My thinking

  • Pixel can be seen as a special anchor box which weight and height are both 0. (anchor FCOS - 图21 pixel)
    And the special anchor box also can be seen as any shape anchor box, we have no need to set hyper-parameters such scale and aspect.
  • In the case of dense object detection such as crowd detection, the recall rate may be reduced.
    Because only one special box for one pixel.
  • Compared to YOLOv1, mapping strategy is not same, anchor free take advantages of all pixel in ground-truth bounding boxes. (FCOS - 图22 not FCOS - 图23)
    Relatively, this map strategy will cause low-quality bounding boxes problem, this paper solves it by centerness.