SSD: Single Shot MultiBox Detector

ECCV 2016

ssd1.png

  • Default Box (Prior Box)
    ssd2.jpg
    SSD is essentially dense sampling on the multi-scale.
    Prior Box Location : SSD - 图3#card=math&code=%28d%5E%7Bcx%7D%2C%20d%5E%7Bcy%7D%2C%20d%5Ew%2C%20d%5Eh%29)
    Bounding Box Location : SSD - 图4#card=math&code=%28b%5E%7Bcx%7D%2C%20b%5E%7Bcy%7D%2C%20b%5Ew%2C%20b%5Eh%29)
    encode
    SSD - 图5%20%2F%20d%5E%7Bw%7D%2C%20l%5E%7Bc%20y%7D%3D%5Cleft(b%5E%7Bc%20y%7D-d%5E%7Bc%20y%7D%5Cright)%20%2F%20d%5E%7Bh%7D#card=math&code=l%5E%7Bc%20x%7D%3D%5Cleft%28b%5E%7Bc%20x%7D-d%5E%7Bc%20x%7D%5Cright%29%20%2F%20d%5E%7Bw%7D%2C%20l%5E%7Bc%20y%7D%3D%5Cleft%28b%5E%7Bc%20y%7D-d%5E%7Bc%20y%7D%5Cright%29%20%2F%20d%5E%7Bh%7D)
    SSD - 图6%2C%20l%5E%7Bh%7D%3D%5Clog%20%5Cleft(b%5E%7Bh%7D%20%2F%20d%5E%7Bh%7D%5Cright)#card=math&code=l%5E%7Bw%7D%3D%5Clog%20%5Cleft%28b%5E%7Bw%7D%20%2F%20d%5E%7Bw%7D%5Cright%29%2C%20l%5E%7Bh%7D%3D%5Clog%20%5Cleft%28b%5E%7Bh%7D%20%2F%20d%5E%7Bh%7D%5Cright%29)
    decode
    SSD - 图7
    SSD - 图8%2C%20b%5E%7Bh%7D%3Dd%5E%7Bh%7D%20%5Cexp%20%5Cleft(l%5E%7Bh%7D%5Cright)#card=math&code=b%5E%7Bw%7D%3Dd%5E%7Bw%7D%20%5Cexp%20%5Cleft%28l%5E%7Bw%7D%5Cright%29%2C%20b%5E%7Bh%7D%3Dd%5E%7Bh%7D%20%5Cexp%20%5Cleft%28l%5E%7Bh%7D%5Cright%29)
  • Detect by Convolution
    ssd5.png
    Replace the fc with Convolution layer to reduce parameters.
    Use convolution layer 3*3*75 to encode detect feature map.
    For example, SSD - 图10%20%3D%203*(4%2B21)#card=math&code=n_c%20%3D%20n_b%2A%28loc%2Bc%29%20%3D%203%2A%284%2B21%29).
    SSD - 图11 is the prior box num, SSD - 图12 is the channel of detector feature map, loc is the prior box location offset.

Architecture

ssd3.jpg

Base Mode is VGG16

  • Replace the fc6 with Convolution layer (3*3)
  • Replace the fc7 with Convolution layer (1*1)
  • Pool5 (2*2 stride = 2) —> Pool5 (3*3 stride = 1)
  • Conv6 is Atrous Convolution

ssd4.png

  • Remove dropout and fc8, add some Convolution layer

Detector layers: Conv4_3, Conv7, Conv8_2, Conv9_2, Conv10_2, Conv11_2

Detector layers size: (38, 38), (19, 19), (10, 10), (5, 5), (3, 3), (1, 1)

Prior Box scale is increasing: SSD - 图15%2C%20k%20%5Cin%5B1%2C%20m%5D#card=math&code=s%7Bk%7D%3Ds%7B%5Cmin%20%7D%2B%5Cfrac%7Bs%7B%5Cmax%20%7D-s%7B%5Cmin%20%7D%7D%7Bm-1%7D%28k-1%29%2C%20k%20%5Cin%5B1%2C%20m%5D)

How to train

Matching Strategy

  • Match each ground truth box to the prior box with the best Jaccard overlap.
  • Match prior box to any ground truth with Jaccard overlap higher than a threshold (0.5)

Hard Negative Mining

Most of the prior boxes are negatives. This introduces a significant imbalance between the
positive and negative training examples.

How to do

Sort negatives using the highest confidence loss for each prior box and pick the top ones so that
the ratio between the negatives and positives is at most 3:1.

Loss Func

The overall objective loss function is weighted sum of the location loss and the confidence loss.

SSD - 图16%3D%5Cfrac%7B1%7D%7BN%7D%5Cleft(L%7Bc%20o%20n%20f%7D(x%2C%20c)%2B%5Calpha%20L%7Bl%20o%20c%7D(x%2C%20l%2C%20g)%5Cright)%0A#card=math&code=L%28x%2C%20c%2C%20l%2C%20g%29%3D%5Cfrac%7B1%7D%7BN%7D%5Cleft%28L%7Bc%20o%20n%20f%7D%28x%2C%20c%29%2B%5Calpha%20L%7Bl%20o%20c%7D%28x%2C%20l%2C%20g%29%5Cright%29%0A)

My thinking

  • Balancing negatives and positives (prior box) is a good idea. (Focal loss is better)
  • The aspect ratio of prior box is artificially designed, and it may not suitable for all situations.
  • Using multi-scale feature map (Feature Pyramid) of different layers for detection is a good idea. (Feature Pyramid Network is better)
  • The model does not support multi-scale training, we cannot input images of different size
    and we cannot improve multi-scale robustness by using the method.

Code

  1. def _built_net(self):
  2. """Construct the SSD net"""
  3. self.end_points = {} # record the detection layers output
  4. self._images = tf.placeholder(tf.float32, shape=[None, self.ssd_params.img_shape[0],
  5. self.ssd_params.img_shape[1], 3])
  6. with tf.variable_scope("ssd_300_vgg"):
  7. # original vgg layers
  8. # block 1
  9. net = conv2d(self._images, 64, 3, scope="conv1_1")
  10. net = conv2d(net, 64, 3, scope="conv1_2")
  11. self.end_points["block1"] = net
  12. net = max_pool2d(net, 2, scope="pool1")
  13. # block 2
  14. net = conv2d(net, 128, 3, scope="conv2_1")
  15. net = conv2d(net, 128, 3, scope="conv2_2")
  16. self.end_points["block2"] = net
  17. net = max_pool2d(net, 2, scope="pool2")
  18. # block 3
  19. net = conv2d(net, 256, 3, scope="conv3_1")
  20. net = conv2d(net, 256, 3, scope="conv3_2")
  21. net = conv2d(net, 256, 3, scope="conv3_3")
  22. self.end_points["block3"] = net
  23. net = max_pool2d(net, 2, scope="pool3")
  24. # block 4
  25. net = conv2d(net, 512, 3, scope="conv4_1")
  26. net = conv2d(net, 512, 3, scope="conv4_2")
  27. net = conv2d(net, 512, 3, scope="conv4_3")
  28. self.end_points["block4"] = net
  29. net = max_pool2d(net, 2, scope="pool4")
  30. # block 5
  31. net = conv2d(net, 512, 3, scope="conv5_1")
  32. net = conv2d(net, 512, 3, scope="conv5_2")
  33. net = conv2d(net, 512, 3, scope="conv5_3")
  34. self.end_points["block5"] = net
  35. print(net)
  36. net = max_pool2d(net, 3, stride=1, scope="pool5")
  37. print(net)
  38. # additional SSD layers
  39. # block 6: use dilate conv
  40. net = conv2d(net, 1024, 3, dilation_rate=6, scope="conv6")
  41. self.end_points["block6"] = net
  42. #net = dropout(net, is_training=self.is_training)
  43. # block 7
  44. net = conv2d(net, 1024, 1, scope="conv7")
  45. self.end_points["block7"] = net
  46. # block 8
  47. net = conv2d(net, 256, 1, scope="conv8_1x1")
  48. net = conv2d(pad2d(net, 1), 512, 3, stride=2, scope="conv8_3x3",
  49. padding="valid")
  50. self.end_points["block8"] = net
  51. # block 9
  52. net = conv2d(net, 128, 1, scope="conv9_1x1")
  53. net = conv2d(pad2d(net, 1), 256, 3, stride=2, scope="conv9_3x3",
  54. padding="valid")
  55. self.end_points["block9"] = net
  56. # block 10
  57. net = conv2d(net, 128, 1, scope="conv10_1x1")
  58. net = conv2d(net, 256, 3, scope="conv10_3x3", padding="valid")
  59. self.end_points["block10"] = net
  60. # block 11
  61. net = conv2d(net, 128, 1, scope="conv11_1x1")
  62. net = conv2d(net, 256, 3, scope="conv11_3x3", padding="valid")
  63. self.end_points["block11"] = net
  64. # class and location predictions
  65. predictions = []
  66. logits = []
  67. locations = []
  68. for i, layer in enumerate(self.ssd_params.feat_layers):
  69. cls, loc = ssd_multibox_layer(self.end_points[layer], self.ssd_params.num_classes,
  70. self.ssd_params.anchor_sizes[i],
  71. self.ssd_params.anchor_ratios[i],
  72. self.ssd_params.normalizations[i], scope=layer+"_box")
  73. predictions.append(tf.nn.softmax(cls))
  74. logits.append(cls)
  75. locations.append(loc)
  76. return predictions, logits, locations