EfficientDet: Scalable and Efficient Object Detection

CVPR2020

  • Introduction
  • BiFPN
  • EfficientDet
  • My thinking

    Introduction

  • Trade off accuracy and efficiency
    The large model sizes and expensive computation costs deter their deployment in many real-world
    applications such as robotics and self-driving cars where model size and latency are highly constrained.
    It is possible to build a scalable detection architecture with both higher accuracy and better
    efficiency across a wide spectrum of resource constraints (e.g., from 3B to 300B FLOPs)?
    This paper aims to tackle this problem by systematically studying various design choices of detector architectures.

BiFPN

BiFPN: efficient bidirectional cross-scale connections and weighted feature fusion

BiFPN.png

Why design BiFPN?

  • Conventional top-down FPN is inherently limited by the one-way information flow.
  • NAS-FPN employs neural architecture search to search for better cross-scale feature network topology, but it requires thousands of GPU hours during search and found the network is irregular and difficult to interpret or modify.
  • PAFPN achieves better accuracy than FPN and NAS-FPN, but with the cost of more parameters and computations.

BiFPN is improved on the basis of PAFPN

  • Removing those nodes that only have input edge.
    If a node has only one input edge with on feature fusion, then it will have less contribution to
    feature network that aims at fusing different features.
  • Adding an extra edge from the original input to output node if they are at the same level, in order
    to fuse more features without adding much cost. (short cut ?)
  • Unlike PANet that only has one top-down and one bottom-up path, this paper treats each bidirectional(top-down & bottom-up) path as one feature network layer, and repeat the same layer multiple times to enable more high-level feature fusion.

Weighted Feature Fusion

Different input features at different resolutions usually contribute to the output feature unequally.

To address this issue, the paper proposes to add an additional weight for each input, and let the
network to learn the importance of each input feature.

  • Unbounded fusion: EfficientDet - 图2
    Since the weight is unbounded, it could potentially cause training instability.
  • Softmax-based fusion: EfficientDet - 图3
    An intutive idea is to apply softmax to each weight, such that all weights are normalized to be
    a probability with value range from 0 to 1. However, the extra softmax leads to significant slowdown
    one GPU hardware.
  • Fast normalized fusion: EfficientDet - 图4
    Where EfficientDet - 图5 is ensured by applying a Relu after each EfficientDet - 图6, and EfficientDet - 图7 is a small value to avoid numerical instability.

Fusion Example:

EfficientDet - 图8%7D%7Bw%7B1%7D%2Bw%7B2%7D%2B%5Cepsilon%7D%5Cright)%20%5C%5C%0AP%7B6%7D%5E%7B%5Ctext%20%7Bout%20%7D%7D%20%26%3D%5Coperatorname%7BConv%7D%5Cleft(%5Cfrac%7Bw%7B1%7D%5E%7B%5Cprime%7D%20%5Ccdot%20P%7B6%7D%5E%7Bi%20n%7D%2Bw%7B2%7D%5E%7B%5Cprime%7D%20%5Ccdot%20P%7B6%7D%5E%7Bt%20d%7D%2Bw%7B3%7D%5E%7B%5Cprime%7D%20%5Ccdot%20%5Coperatorname%7BResize%7D%5Cleft(P%7B5%7D%5E%7B%5Ctext%20%7Bout%20%7D%7D%5Cright)%7D%7Bw%7B1%7D%5E%7B%5Cprime%7D%2Bw%7B2%7D%5E%7B%5Cprime%7D%2Bw%7B3%7D%5E%7B%5Cprime%7D%2B%5Cepsilon%7D%5Cright)%0A%5Cend%7Baligned%7D%0A#card=math&code=%5Cbegin%7Baligned%7D%0AP%7B6%7D%5E%7Bt%20d%7D%20%26%3D%5Coperatorname%7BConv%7D%5Cleft%28%5Cfrac%7Bw%7B1%7D%20%5Ccdot%20P%7B6%7D%5E%7Bi%20n%7D%2Bw%7B2%7D%20%5Ccdot%20%5Coperatorname%7BResize%7D%5Cleft%28P%7B7%7D%5E%7Bi%20n%7D%5Cright%29%7D%7Bw%7B1%7D%2Bw%7B2%7D%2B%5Cepsilon%7D%5Cright%29%20%5C%5C%0AP%7B6%7D%5E%7B%5Ctext%20%7Bout%20%7D%7D%20%26%3D%5Coperatorname%7BConv%7D%5Cleft%28%5Cfrac%7Bw%7B1%7D%5E%7B%5Cprime%7D%20%5Ccdot%20P%7B6%7D%5E%7Bi%20n%7D%2Bw%7B2%7D%5E%7B%5Cprime%7D%20%5Ccdot%20P%7B6%7D%5E%7Bt%20d%7D%2Bw%7B3%7D%5E%7B%5Cprime%7D%20%5Ccdot%20%5Coperatorname%7BResize%7D%5Cleft%28P%7B5%7D%5E%7B%5Ctext%20%7Bout%20%7D%7D%5Cright%29%7D%7Bw%7B1%7D%5E%7B%5Cprime%7D%2Bw%7B2%7D%5E%7B%5Cprime%7D%2Bw_%7B3%7D%5E%7B%5Cprime%7D%2B%5Cepsilon%7D%5Cright%29%0A%5Cend%7Baligned%7D%0A)

EfficientDet

This paper employs ImageNet-pretrained EfficientNets as the backbone network.

EfficientDet.png

Compound Scaling

Previous works mostly scale up a baseline detector by employing bigger backbone network, using
larger input images, or stacking more FPN layers. These methods are usually ineffective since they
only focus on a single or limited scaling dimension

This paper proposes a new compound scaling method for object detection, which uses a simple compound coefficient EfficientDet - 图10 to joinly scale up all dimensions of backbone network, BiFPN network, class/box network, and resolution.

EfficientDet can be regarded as the extension of EfficientNet in the field of object detection.

Read the paper for more detail.

My thinking

  • If we want to determine whether a parameter works, observing its change curve in the process of training is a good way.