SSD: Single Shot MultiBox Detector
ECCV 2016
- The Principle of SSD
- Architecture
- How to train
- My thinking
-
The Principle of SSD
Multi-Scale Feature Map (Feature Pyramid)

- Default Box (Prior Box)

SSD is essentially dense sampling on the multi-scale.
Prior Box Location :#card=math&code=%28d%5E%7Bcx%7D%2C%20d%5E%7Bcy%7D%2C%20d%5Ew%2C%20d%5Eh%29)
Bounding Box Location :#card=math&code=%28b%5E%7Bcx%7D%2C%20b%5E%7Bcy%7D%2C%20b%5Ew%2C%20b%5Eh%29)
encode%20%2F%20d%5E%7Bw%7D%2C%20l%5E%7Bc%20y%7D%3D%5Cleft(b%5E%7Bc%20y%7D-d%5E%7Bc%20y%7D%5Cright)%20%2F%20d%5E%7Bh%7D#card=math&code=l%5E%7Bc%20x%7D%3D%5Cleft%28b%5E%7Bc%20x%7D-d%5E%7Bc%20x%7D%5Cright%29%20%2F%20d%5E%7Bw%7D%2C%20l%5E%7Bc%20y%7D%3D%5Cleft%28b%5E%7Bc%20y%7D-d%5E%7Bc%20y%7D%5Cright%29%20%2F%20d%5E%7Bh%7D)
%2C%20l%5E%7Bh%7D%3D%5Clog%20%5Cleft(b%5E%7Bh%7D%20%2F%20d%5E%7Bh%7D%5Cright)#card=math&code=l%5E%7Bw%7D%3D%5Clog%20%5Cleft%28b%5E%7Bw%7D%20%2F%20d%5E%7Bw%7D%5Cright%29%2C%20l%5E%7Bh%7D%3D%5Clog%20%5Cleft%28b%5E%7Bh%7D%20%2F%20d%5E%7Bh%7D%5Cright%29)
decode%2C%20b%5E%7Bh%7D%3Dd%5E%7Bh%7D%20%5Cexp%20%5Cleft(l%5E%7Bh%7D%5Cright)#card=math&code=b%5E%7Bw%7D%3Dd%5E%7Bw%7D%20%5Cexp%20%5Cleft%28l%5E%7Bw%7D%5Cright%29%2C%20b%5E%7Bh%7D%3Dd%5E%7Bh%7D%20%5Cexp%20%5Cleft%28l%5E%7Bh%7D%5Cright%29)
- Detect by Convolution

Replace the fc with Convolution layer to reduce parameters.
Use convolution layer3*3*75to encode detect feature map.
For example,%20%3D%203*(4%2B21)#card=math&code=n_c%20%3D%20n_b%2A%28loc%2Bc%29%20%3D%203%2A%284%2B21%29).
is the prior box num,
is the channel of detector feature map,
locis the prior box location offset.
Architecture

Base Mode is VGG16
- Replace the fc6 with Convolution layer (
3*3) - Replace the fc7 with Convolution layer (
1*1) - Pool5 (
2*2stride = 2) —> Pool5 (3*3stride = 1) - Conv6 is Atrous Convolution

- Remove dropout and fc8, add some Convolution layer
Detector layers: Conv4_3, Conv7, Conv8_2, Conv9_2, Conv10_2, Conv11_2
Detector layers size: (38, 38), (19, 19), (10, 10), (5, 5), (3, 3), (1, 1)
Prior Box scale is increasing: %2C%20k%20%5Cin%5B1%2C%20m%5D#card=math&code=s%7Bk%7D%3Ds%7B%5Cmin%20%7D%2B%5Cfrac%7Bs%7B%5Cmax%20%7D-s%7B%5Cmin%20%7D%7D%7Bm-1%7D%28k-1%29%2C%20k%20%5Cin%5B1%2C%20m%5D)
How to train
Matching Strategy
- Match each ground truth box to the prior box with the best Jaccard overlap.
- Match prior box to any ground truth with Jaccard overlap higher than a threshold (0.5)
Hard Negative Mining
Most of the prior boxes are negatives. This introduces a significant imbalance between the
positive and negative training examples.
How to do
Sort negatives using the highest confidence loss for each prior box and pick the top ones so that
the ratio between the negatives and positives is at most 3:1.
Loss Func
The overall objective loss function is weighted sum of the location loss and the confidence loss.
%3D%5Cfrac%7B1%7D%7BN%7D%5Cleft(L%7Bc%20o%20n%20f%7D(x%2C%20c)%2B%5Calpha%20L%7Bl%20o%20c%7D(x%2C%20l%2C%20g)%5Cright)%0A#card=math&code=L%28x%2C%20c%2C%20l%2C%20g%29%3D%5Cfrac%7B1%7D%7BN%7D%5Cleft%28L%7Bc%20o%20n%20f%7D%28x%2C%20c%29%2B%5Calpha%20L%7Bl%20o%20c%7D%28x%2C%20l%2C%20g%29%5Cright%29%0A)
My thinking
- Balancing negatives and positives (prior box) is a good idea. (Focal loss is better)
- The aspect ratio of prior box is artificially designed, and it may not suitable for all situations.
- Using multi-scale feature map (Feature Pyramid) of different layers for detection is a good idea. (Feature Pyramid Network is better)
- The model does not support multi-scale training, we cannot input images of different size
and we cannot improve multi-scale robustness by using the method.
Code
def _built_net(self):"""Construct the SSD net"""self.end_points = {} # record the detection layers outputself._images = tf.placeholder(tf.float32, shape=[None, self.ssd_params.img_shape[0],self.ssd_params.img_shape[1], 3])with tf.variable_scope("ssd_300_vgg"):# original vgg layers# block 1net = conv2d(self._images, 64, 3, scope="conv1_1")net = conv2d(net, 64, 3, scope="conv1_2")self.end_points["block1"] = netnet = max_pool2d(net, 2, scope="pool1")# block 2net = conv2d(net, 128, 3, scope="conv2_1")net = conv2d(net, 128, 3, scope="conv2_2")self.end_points["block2"] = netnet = max_pool2d(net, 2, scope="pool2")# block 3net = conv2d(net, 256, 3, scope="conv3_1")net = conv2d(net, 256, 3, scope="conv3_2")net = conv2d(net, 256, 3, scope="conv3_3")self.end_points["block3"] = netnet = max_pool2d(net, 2, scope="pool3")# block 4net = conv2d(net, 512, 3, scope="conv4_1")net = conv2d(net, 512, 3, scope="conv4_2")net = conv2d(net, 512, 3, scope="conv4_3")self.end_points["block4"] = netnet = max_pool2d(net, 2, scope="pool4")# block 5net = conv2d(net, 512, 3, scope="conv5_1")net = conv2d(net, 512, 3, scope="conv5_2")net = conv2d(net, 512, 3, scope="conv5_3")self.end_points["block5"] = netprint(net)net = max_pool2d(net, 3, stride=1, scope="pool5")print(net)# additional SSD layers# block 6: use dilate convnet = conv2d(net, 1024, 3, dilation_rate=6, scope="conv6")self.end_points["block6"] = net#net = dropout(net, is_training=self.is_training)# block 7net = conv2d(net, 1024, 1, scope="conv7")self.end_points["block7"] = net# block 8net = conv2d(net, 256, 1, scope="conv8_1x1")net = conv2d(pad2d(net, 1), 512, 3, stride=2, scope="conv8_3x3",padding="valid")self.end_points["block8"] = net# block 9net = conv2d(net, 128, 1, scope="conv9_1x1")net = conv2d(pad2d(net, 1), 256, 3, stride=2, scope="conv9_3x3",padding="valid")self.end_points["block9"] = net# block 10net = conv2d(net, 128, 1, scope="conv10_1x1")net = conv2d(net, 256, 3, scope="conv10_3x3", padding="valid")self.end_points["block10"] = net# block 11net = conv2d(net, 128, 1, scope="conv11_1x1")net = conv2d(net, 256, 3, scope="conv11_3x3", padding="valid")self.end_points["block11"] = net# class and location predictionspredictions = []logits = []locations = []for i, layer in enumerate(self.ssd_params.feat_layers):cls, loc = ssd_multibox_layer(self.end_points[layer], self.ssd_params.num_classes,self.ssd_params.anchor_sizes[i],self.ssd_params.anchor_ratios[i],self.ssd_params.normalizations[i], scope=layer+"_box")predictions.append(tf.nn.softmax(cls))logits.append(cls)locations.append(loc)return predictions, logits, locations
