Feature Pyramid Networks for Object Detection

Principles

Combines low-resolution, semantically strong features with high-resolution, semantically weak features
via a top-down path way and lateral connections.

High level features are highly abstract, but lost location information.

Low level features do not have strong representation capacity.

Goal: Improve robustness on scale

Applicable situation: Small Object Detection

Not applicable situation: Situations highly dependent on semantic. (because combination with low
level features will reduce semantic information)

Architecture Detail

fpn1.png

fpn2.png

Why use 3x3 convolution?

Reduce the aliasing effect of upsampling.

The features are not continuous after upsampling, and filtering is needed for smoothing.

Comparison with previous work

fpn3.png

Comparison with Image Pyramid

Feature pyramid not sacrifices speed and memory.

Comparison with Single feature map

Feature pyramid is more robust to variance in scale.

Comparison with SSD

Feature pyramid uses lower feature map and has strong semantics at all scales.

My thinking

Is there a more efficient way to combine?