Feature Pyramid Networks for Object Detection
Principles
Combines low-resolution, semantically strong features with high-resolution, semantically weak features
via a top-down path way and lateral connections.
High level features are highly abstract, but lost location information.
Low level features do not have strong representation capacity.
Goal: Improve robustness on scale
Applicable situation: Small Object Detection
Not applicable situation: Situations highly dependent on semantic. (because combination with low
level features will reduce semantic information)
Architecture Detail


Why use 3x3 convolution?
Reduce the aliasing effect of upsampling.
The features are not continuous after upsampling, and filtering is needed for smoothing.
Comparison with previous work

Comparison with Image Pyramid
Feature pyramid not sacrifices speed and memory.
Comparison with Single feature map
Feature pyramid is more robust to variance in scale.
Comparison with SSD
Feature pyramid uses lower feature map and has strong semantics at all scales.
My thinking
Is there a more efficient way to combine?
