EfficientDet: Scalable and Efficient Object Detection
CVPR2020
- Introduction
- BiFPN
- EfficientDet
-
Introduction
Trade off accuracy and efficiency
The large model sizes and expensive computation costs deter their deployment in many real-world
applications such as robotics and self-driving cars where model size and latency are highly constrained.
It is possible to build a scalable detection architecture with both higher accuracy and better
efficiency across a wide spectrum of resource constraints (e.g., from 3B to 300B FLOPs)?
This paper aims to tackle this problem by systematically studying various design choices of detector architectures.
BiFPN
BiFPN: efficient bidirectional cross-scale connections and weighted feature fusion

Why design BiFPN?
- Conventional top-down FPN is inherently limited by the one-way information flow.
- NAS-FPN employs neural architecture search to search for better cross-scale feature network topology, but it requires thousands of GPU hours during search and found the network is irregular and difficult to interpret or modify.
- PAFPN achieves better accuracy than FPN and NAS-FPN, but with the cost of more parameters and computations.
BiFPN is improved on the basis of PAFPN
- Removing those nodes that only have input edge.
If a node has only one input edge with on feature fusion, then it will have less contribution to
feature network that aims at fusing different features. - Adding an extra edge from the original input to output node if they are at the same level, in order
to fuse more features without adding much cost. (short cut ?) - Unlike PANet that only has one top-down and one bottom-up path, this paper treats each bidirectional(top-down & bottom-up) path as one feature network layer, and repeat the same layer multiple times to enable more high-level feature fusion.
Weighted Feature Fusion
Different input features at different resolutions usually contribute to the output feature unequally.
To address this issue, the paper proposes to add an additional weight for each input, and let the
network to learn the importance of each input feature.
- Unbounded fusion:
Since the weight is unbounded, it could potentially cause training instability. - Softmax-based fusion:
An intutive idea is to apply softmax to each weight, such that all weights are normalized to be
a probability with value range from 0 to 1. However, the extra softmax leads to significant slowdown
one GPU hardware. - Fast normalized fusion:
Whereis ensured by applying a Relu after each
, and
is a small value to avoid numerical instability.
Fusion Example:
%7D%7Bw%7B1%7D%2Bw%7B2%7D%2B%5Cepsilon%7D%5Cright)%20%5C%5C%0AP%7B6%7D%5E%7B%5Ctext%20%7Bout%20%7D%7D%20%26%3D%5Coperatorname%7BConv%7D%5Cleft(%5Cfrac%7Bw%7B1%7D%5E%7B%5Cprime%7D%20%5Ccdot%20P%7B6%7D%5E%7Bi%20n%7D%2Bw%7B2%7D%5E%7B%5Cprime%7D%20%5Ccdot%20P%7B6%7D%5E%7Bt%20d%7D%2Bw%7B3%7D%5E%7B%5Cprime%7D%20%5Ccdot%20%5Coperatorname%7BResize%7D%5Cleft(P%7B5%7D%5E%7B%5Ctext%20%7Bout%20%7D%7D%5Cright)%7D%7Bw%7B1%7D%5E%7B%5Cprime%7D%2Bw%7B2%7D%5E%7B%5Cprime%7D%2Bw%7B3%7D%5E%7B%5Cprime%7D%2B%5Cepsilon%7D%5Cright)%0A%5Cend%7Baligned%7D%0A#card=math&code=%5Cbegin%7Baligned%7D%0AP%7B6%7D%5E%7Bt%20d%7D%20%26%3D%5Coperatorname%7BConv%7D%5Cleft%28%5Cfrac%7Bw%7B1%7D%20%5Ccdot%20P%7B6%7D%5E%7Bi%20n%7D%2Bw%7B2%7D%20%5Ccdot%20%5Coperatorname%7BResize%7D%5Cleft%28P%7B7%7D%5E%7Bi%20n%7D%5Cright%29%7D%7Bw%7B1%7D%2Bw%7B2%7D%2B%5Cepsilon%7D%5Cright%29%20%5C%5C%0AP%7B6%7D%5E%7B%5Ctext%20%7Bout%20%7D%7D%20%26%3D%5Coperatorname%7BConv%7D%5Cleft%28%5Cfrac%7Bw%7B1%7D%5E%7B%5Cprime%7D%20%5Ccdot%20P%7B6%7D%5E%7Bi%20n%7D%2Bw%7B2%7D%5E%7B%5Cprime%7D%20%5Ccdot%20P%7B6%7D%5E%7Bt%20d%7D%2Bw%7B3%7D%5E%7B%5Cprime%7D%20%5Ccdot%20%5Coperatorname%7BResize%7D%5Cleft%28P%7B5%7D%5E%7B%5Ctext%20%7Bout%20%7D%7D%5Cright%29%7D%7Bw%7B1%7D%5E%7B%5Cprime%7D%2Bw%7B2%7D%5E%7B%5Cprime%7D%2Bw_%7B3%7D%5E%7B%5Cprime%7D%2B%5Cepsilon%7D%5Cright%29%0A%5Cend%7Baligned%7D%0A)
EfficientDet
This paper employs ImageNet-pretrained EfficientNets as the backbone network.

Compound Scaling
Previous works mostly scale up a baseline detector by employing bigger backbone network, using
larger input images, or stacking more FPN layers. These methods are usually ineffective since they
only focus on a single or limited scaling dimension
This paper proposes a new compound scaling method for object detection, which uses a simple compound coefficient to joinly scale up all dimensions of backbone network, BiFPN network, class/box network, and resolution.
EfficientDet can be regarded as the extension of EfficientNet in the field of object detection.
Read the paper for more detail.
My thinking
- If we want to determine whether a parameter works, observing its change curve in the process of training is a good way.
