SSD: Single Shot MultiBox Detector

SSD: Single Shot MultiBox Detector

ECCV 2016

The Principle of SSD
Architecture
How to train
My thinking
Code

The Principle of SSD
Multi-Scale Feature Map (Feature Pyramid)

Default Box (Prior Box)

SSD is essentially dense sampling on the multi-scale.
Prior Box Location : $SSD - 图3$ #card=math&code=%28d%5E%7Bcx%7D%2C%20d%5E%7Bcy%7D%2C%20d%5Ew%2C%20d%5Eh%29)
Bounding Box Location : $SSD - 图4$ #card=math&code=%28b%5E%7Bcx%7D%2C%20b%5E%7Bcy%7D%2C%20b%5Ew%2C%20b%5Eh%29)
encode
$SSD - 图5$ %20%2F%20d%5E%7Bw%7D%2C%20l%5E%7Bc%20y%7D%3D%5Cleft(b%5E%7Bc%20y%7D-d%5E%7Bc%20y%7D%5Cright)%20%2F%20d%5E%7Bh%7D#card=math&code=l%5E%7Bc%20x%7D%3D%5Cleft%28b%5E%7Bc%20x%7D-d%5E%7Bc%20x%7D%5Cright%29%20%2F%20d%5E%7Bw%7D%2C%20l%5E%7Bc%20y%7D%3D%5Cleft%28b%5E%7Bc%20y%7D-d%5E%7Bc%20y%7D%5Cright%29%20%2F%20d%5E%7Bh%7D)
$SSD - 图6$ %2C%20l%5E%7Bh%7D%3D%5Clog%20%5Cleft(b%5E%7Bh%7D%20%2F%20d%5E%7Bh%7D%5Cright)#card=math&code=l%5E%7Bw%7D%3D%5Clog%20%5Cleft%28b%5E%7Bw%7D%20%2F%20d%5E%7Bw%7D%5Cright%29%2C%20l%5E%7Bh%7D%3D%5Clog%20%5Cleft%28b%5E%7Bh%7D%20%2F%20d%5E%7Bh%7D%5Cright%29)
decode
$SSD - 图7$
$SSD - 图8$ %2C%20b%5E%7Bh%7D%3Dd%5E%7Bh%7D%20%5Cexp%20%5Cleft(l%5E%7Bh%7D%5Cright)#card=math&code=b%5E%7Bw%7D%3Dd%5E%7Bw%7D%20%5Cexp%20%5Cleft%28l%5E%7Bw%7D%5Cright%29%2C%20b%5E%7Bh%7D%3Dd%5E%7Bh%7D%20%5Cexp%20%5Cleft%28l%5E%7Bh%7D%5Cright%29)
Detect by Convolution

Replace the fc with Convolution layer to reduce parameters.
Use convolution layer 3*3*75 to encode detect feature map.
For example, $SSD - 图10$ %20%3D%203*(4%2B21)#card=math&code=n_c%20%3D%20n_b%2A%28loc%2Bc%29%20%3D%203%2A%284%2B21%29).
$SSD - 图11$ is the prior box num, $SSD - 图12$ is the channel of detector feature map, loc is the prior box location offset.

Architecture

Base Mode is VGG16

Replace the fc6 with Convolution layer (3*3)
Replace the fc7 with Convolution layer (1*1)
Pool5 (2*2 stride = 2) —> Pool5 (3*3 stride = 1)
Conv6 is Atrous Convolution

Remove dropout and fc8, add some Convolution layer

Detector layers: Conv4_3, Conv7, Conv8_2, Conv9_2, Conv10_2, Conv11_2

Detector layers size: (38, 38), (19, 19), (10, 10), (5, 5), (3, 3), (1, 1)

Prior Box scale is increasing: $SSD - 图15$ %2C%20k%20%5Cin%5B1%2C%20m%5D#card=math&code=s%7Bk%7D%3Ds%7B%5Cmin%20%7D%2B%5Cfrac%7Bs%7B%5Cmax%20%7D-s%7B%5Cmin%20%7D%7D%7Bm-1%7D%28k-1%29%2C%20k%20%5Cin%5B1%2C%20m%5D)

How to train

Matching Strategy

Match each ground truth box to the prior box with the best Jaccard overlap.
Match prior box to any ground truth with Jaccard overlap higher than a threshold (0.5)

Hard Negative Mining

Most of the prior boxes are negatives. This introduces a significant imbalance between the
positive and negative training examples.

How to do

Sort negatives using the highest confidence loss for each prior box and pick the top ones so that
the ratio between the negatives and positives is at most 3:1.

Loss Func

The overall objective loss function is weighted sum of the location loss and the confidence loss.

$SSD - 图16$ %3D%5Cfrac%7B1%7D%7BN%7D%5Cleft(L%7Bc%20o%20n%20f%7D(x%2C%20c)%2B%5Calpha%20L%7Bl%20o%20c%7D(x%2C%20l%2C%20g)%5Cright)%0A#card=math&code=L%28x%2C%20c%2C%20l%2C%20g%29%3D%5Cfrac%7B1%7D%7BN%7D%5Cleft%28L%7Bc%20o%20n%20f%7D%28x%2C%20c%29%2B%5Calpha%20L%7Bl%20o%20c%7D%28x%2C%20l%2C%20g%29%5Cright%29%0A)

My thinking

Balancing negatives and positives (prior box) is a good idea. (Focal loss is better)
The aspect ratio of prior box is artificially designed, and it may not suitable for all situations.
Using multi-scale feature map (Feature Pyramid) of different layers for detection is a good idea. (Feature Pyramid Network is better)
The model does not support multi-scale training, we cannot input images of different size
and we cannot improve multi-scale robustness by using the method.

Code

def _built_net(self):
        """Construct the SSD net"""
        self.end_points = {}  # record the detection layers output
        self._images = tf.placeholder(tf.float32, shape=[None, self.ssd_params.img_shape[0],
                                                        self.ssd_params.img_shape[1], 3])
        with tf.variable_scope("ssd_300_vgg"):
            # original vgg layers
            # block 1
            net = conv2d(self._images, 64, 3, scope="conv1_1")
            net = conv2d(net, 64, 3, scope="conv1_2")
            self.end_points["block1"] = net
            net = max_pool2d(net, 2, scope="pool1")
            # block 2
            net = conv2d(net, 128, 3, scope="conv2_1")
            net = conv2d(net, 128, 3, scope="conv2_2")
            self.end_points["block2"] = net
            net = max_pool2d(net, 2, scope="pool2")
            # block 3
            net = conv2d(net, 256, 3, scope="conv3_1")
            net = conv2d(net, 256, 3, scope="conv3_2")
            net = conv2d(net, 256, 3, scope="conv3_3")
            self.end_points["block3"] = net
            net = max_pool2d(net, 2, scope="pool3")
            # block 4
            net = conv2d(net, 512, 3, scope="conv4_1")
            net = conv2d(net, 512, 3, scope="conv4_2")
            net = conv2d(net, 512, 3, scope="conv4_3")
            self.end_points["block4"] = net
            net = max_pool2d(net, 2, scope="pool4")
            # block 5
            net = conv2d(net, 512, 3, scope="conv5_1")
            net = conv2d(net, 512, 3, scope="conv5_2")
            net = conv2d(net, 512, 3, scope="conv5_3")
            self.end_points["block5"] = net
            print(net)
            net = max_pool2d(net, 3, stride=1, scope="pool5")
            print(net)
            # additional SSD layers
            # block 6: use dilate conv
            net = conv2d(net, 1024, 3, dilation_rate=6, scope="conv6")
            self.end_points["block6"] = net
            #net = dropout(net, is_training=self.is_training)
            # block 7
            net = conv2d(net, 1024, 1, scope="conv7")
            self.end_points["block7"] = net
            # block 8
            net = conv2d(net, 256, 1, scope="conv8_1x1")
            net = conv2d(pad2d(net, 1), 512, 3, stride=2, scope="conv8_3x3",
                         padding="valid")
            self.end_points["block8"] = net
            # block 9
            net = conv2d(net, 128, 1, scope="conv9_1x1")
            net = conv2d(pad2d(net, 1), 256, 3, stride=2, scope="conv9_3x3",
                         padding="valid")
            self.end_points["block9"] = net
            # block 10
            net = conv2d(net, 128, 1, scope="conv10_1x1")
            net = conv2d(net, 256, 3, scope="conv10_3x3", padding="valid")
            self.end_points["block10"] = net
            # block 11
            net = conv2d(net, 128, 1, scope="conv11_1x1")
            net = conv2d(net, 256, 3, scope="conv11_3x3", padding="valid")
            self.end_points["block11"] = net
            # class and location predictions
            predictions = []
            logits = []
            locations = []
            for i, layer in enumerate(self.ssd_params.feat_layers):
                cls, loc = ssd_multibox_layer(self.end_points[layer], self.ssd_params.num_classes,
                                              self.ssd_params.anchor_sizes[i],
                                              self.ssd_params.anchor_ratios[i],
                                              self.ssd_params.normalizations[i], scope=layer+"_box")
                predictions.append(tf.nn.softmax(cls))
                logits.append(cls)
                locations.append(loc)
            return predictions, logits, locations

SSD