Single Shot Multibox Detection

:label:sec_ssd

In :numref:sec_bbox—:numref:sec_object-detection-dataset, we introduced bounding boxes, anchor boxes, multiscale object detection, and the dataset for object detection. Now we are ready to use such background knowledge to design an object detection model: single shot multibox detection (SSD) :cite:Liu.Anguelov.Erhan.ea.2016. This model is simple, fast, and widely used. Although this is just one of vast amounts of object detection models, some of the design principles and implementation details in this section are also applicable to other models.

Model

:numref:fig_ssd provides an overview of the design of single-shot multibox detection. This model mainly consists of a base network followed by several multiscale feature map blocks. The base network is for extracting features from the input image, so it can use a deep CNN. For example, the original single-shot multibox detection paper adopts a VGG network truncated before the classification layer :cite:Liu.Anguelov.Erhan.ea.2016, while ResNet has also been commonly used. Through our design we can make the base network output larger feature maps so as to generate more anchor boxes for detecting smaller objects. Subsequently, each multiscale feature map block reduces (e.g., by half) the height and width of the feature maps from the previous block, and enables each unit of the feature maps to increase its receptive field on the input image.

Recall the design of multiscale object detection through layerwise representations of images by deep neural networks in :numref:sec_multiscale-object-detection. Since multiscale feature maps closer to the top of :numref:fig_ssd are smaller but have larger receptive fields, they are suitable for detecting fewer but larger objects.

In a nutshell, via its base network and several multiscale feature map blocks, single-shot multibox detection generates a varying number of anchor boxes with different sizes, and detects varying-size objects by predicting classes and offsets of these anchor boxes (thus the bounding boxes); thus, this is a multiscale object detection model.

As a multiscale object detection model, single-shot multibox detection mainly consists of a base network followed by several multiscale feature map blocks. :label:fig_ssd

In the following, we will describe the implementation details of different blocks in :numref:fig_ssd. To begin with, we discuss how to implement the class and bounding box prediction.

Class Prediction Layer

Let the number of object classes be $q$. Then anchor boxes have $q+1$ classes, where class 0 is background. At some scale, suppose that the height and width of feature maps are $h$ and $w$, respectively. When $a$ anchor boxes are generated with each spatial position of these feature maps as their center, a total of $hwa$ anchor boxes need to be classified. This often makes classification with fully-connected layers infeasible due to likely heavy parameterization costs. Recall how we used channels of convolutional layers to predict classes in :numref:sec_nin. Single-shot multibox detection uses the same technique to reduce model complexity.

Specifically, the class prediction layer uses a convolutional layer without altering width or height of feature maps. In this way, there can be a one-to-one correspondence between outputs and inputs at the same spatial dimensions (width and height) of feature maps. More concretely, channels of the output feature maps at any spatial position ($x$, $y$) represent class predictions for all the anchor boxes centered on ($x$, $y$) of the input feature maps. To produce valid predictions, there must be $a(q+1)$ output channels, where for the same spatial position the output channel with index $i(q+1) + j$ represents the prediction of the class $j$ ($0 \leq j \leq q$) for the anchor box $i$ ($0 \leq i < a$).

Below we define such a class prediction layer, specifying $a$ and $q$ via arguments num_anchors and num_classes, respectively. This layer uses a $3\times3$ convolutional layer with a padding of 1. The width and height of the input and output of this convolutional layer remain unchanged.

```{.python .input} %matplotlib inline from d2l import mxnet as d2l from mxnet import autograd, gluon, image, init, np, npx from mxnet.gluon import nn

npx.set_np()

def cls_predictor(num_anchors, num_classes): return nn.Conv2D(num_anchors * (num_classes + 1), kernel_size=3, padding=1)

  1. ```{.python .input}
  2. #@tab pytorch
  3. %matplotlib inline
  4. from d2l import torch as d2l
  5. import torch
  6. import torchvision
  7. from torch import nn
  8. from torch.nn import functional as F
  9. def cls_predictor(num_inputs, num_anchors, num_classes):
  10. return nn.Conv2d(num_inputs, num_anchors * (num_classes + 1),
  11. kernel_size=3, padding=1)

Bounding Box Prediction Layer

The design of the bounding box prediction layer is similar to that of the class prediction layer. The only difference lies in the number of outputs for each anchor box: here we need to predict four offsets rather than $q+1$ classes.

```{.python .input} def bbox_predictor(num_anchors): return nn.Conv2D(num_anchors * 4, kernel_size=3, padding=1)

  1. ```{.python .input}
  2. #@tab pytorch
  3. def bbox_predictor(num_inputs, num_anchors):
  4. return nn.Conv2d(num_inputs, num_anchors * 4, kernel_size=3, padding=1)

Concatenating Predictions for Multiple Scales

As we mentioned, single-shot multibox detection uses multiscale feature maps to generate anchor boxes and predict their classes and offsets. At different scales, the shapes of feature maps or the numbers of anchor boxes centered on the same unit may vary. Therefore, shapes of the prediction outputs at different scales may vary.

In the following example, we construct feature maps at two different scales, Y1 and Y2, for the same minibatch, where the height and width of Y2 are half of those of Y1. Let us take class prediction as an example. Suppose that 5 and 3 anchor boxes are generated for every unit in Y1 and Y2, respectively. Suppose further that the number of object classes is 10. For feature maps Y1 and Y2 the numbers of channels in the class prediction outputs are $5\times(10+1)=55$ and $3\times(10+1)=33$, respectively, where either output shape is (batch size, number of channels, height, width).

```{.python .input} def forward(x, block): block.initialize() return block(x)

Y1 = forward(np.zeros((2, 8, 20, 20)), cls_predictor(5, 10)) Y2 = forward(np.zeros((2, 16, 10, 10)), cls_predictor(3, 10)) Y1.shape, Y2.shape

  1. ```{.python .input}
  2. #@tab pytorch
  3. def forward(x, block):
  4. return block(x)
  5. Y1 = forward(torch.zeros((2, 8, 20, 20)), cls_predictor(8, 5, 10))
  6. Y2 = forward(torch.zeros((2, 16, 10, 10)), cls_predictor(16, 3, 10))
  7. Y1.shape, Y2.shape

As we can see, except for the batch size dimension, the other three dimensions all have different sizes. To concatenate these two prediction outputs for more efficient computation, we will transform these tensors into a more consistent format.

Note that the channel dimension holds the predictions for anchor boxes with the same center. We first move this dimension to the innermost. Since the batch size remains the same for different scales, we can transform the prediction output into a two-dimensional tensor with shape (batch size, height $\times$ width $\times$ number of channels). Then we can concatenate such outputs at different scales along dimension 1.

```{.python .input} def flatten_pred(pred): return npx.batch_flatten(pred.transpose(0, 2, 3, 1))

def concat_preds(preds): return np.concatenate([flatten_pred(p) for p in preds], axis=1)

  1. ```{.python .input}
  2. #@tab pytorch
  3. def flatten_pred(pred):
  4. return torch.flatten(pred.permute(0, 2, 3, 1), start_dim=1)
  5. def concat_preds(preds):
  6. return torch.cat([flatten_pred(p) for p in preds], dim=1)

In this way, even though Y1 and Y2 have different sizes in channels, heights, and widths, we can still concatenate these two prediction outputs at two different scales for the same minibatch.

```{.python .input}

@tab all

concat_preds([Y1, Y2]).shape

  1. ### Downsampling Block
  2. In order to detect objects at multiple scales,
  3. we define the following downsampling block `down_sample_blk` that
  4. halves the height and width of input feature maps.
  5. In fact,
  6. this block applies the design of VGG blocks
  7. in :numref:`subsec_vgg-blocks`.
  8. More concretely,
  9. each downsampling block consists of
  10. two $3\times3$ convolutional layers with padding of 1
  11. followed by a $2\times2$ maximum pooling layer with stride of 2.
  12. As we know, $3\times3$ convolutional layers with padding of 1 do not change the shape of feature maps.
  13. However, the subsequent $2\times2$ maximum pooling reduces the height and width of input feature maps by half.
  14. For both input and output feature maps of this downsampling block,
  15. because $1\times 2+(3-1)+(3-1)=6$,
  16. each unit in the output
  17. has a $6\times6$ receptive field on the input.
  18. Therefore, the downsampling block enlarges the receptive field of each unit in its output feature maps.
  19. ```{.python .input}
  20. def down_sample_blk(num_channels):
  21. blk = nn.Sequential()
  22. for _ in range(2):
  23. blk.add(nn.Conv2D(num_channels, kernel_size=3, padding=1),
  24. nn.BatchNorm(in_channels=num_channels),
  25. nn.Activation('relu'))
  26. blk.add(nn.MaxPool2D(2))
  27. return blk

```{.python .input}

@tab pytorch

def downsample_blk(in_channels, out_channels): blk = [] for in range(2): blk.append(nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1)) blk.append(nn.BatchNorm2d(out_channels)) blk.append(nn.ReLU()) in_channels = out_channels blk.append(nn.MaxPool2d(2)) return nn.Sequential(*blk)

  1. In the following example, our constructed downsampling block changes the number of input channels and halves the height and width of the input feature maps.
  2. ```{.python .input}
  3. forward(np.zeros((2, 3, 20, 20)), down_sample_blk(10)).shape

```{.python .input}

@tab pytorch

forward(torch.zeros((2, 3, 20, 20)), down_sample_blk(3, 10)).shape

  1. ### Base Network Block
  2. The base network block is used to extract features from input images.
  3. For simplicity,
  4. we construct a small base network
  5. consisting of three downsampling blocks
  6. that double the number of channels at each block.
  7. Given a $256\times256$ input image,
  8. this base network block outputs $32 \times 32$ feature maps ($256/2^3=32$).
  9. ```{.python .input}
  10. def base_net():
  11. blk = nn.Sequential()
  12. for num_filters in [16, 32, 64]:
  13. blk.add(down_sample_blk(num_filters))
  14. return blk
  15. forward(np.zeros((2, 3, 256, 256)), base_net()).shape

```{.python .input}

@tab pytorch

def base_net(): blk = [] num_filters = [3, 16, 32, 64] for i in range(len(num_filters) - 1): blk.append(down_sample_blk(num_filters[i], num_filters[i+1])) return nn.Sequential(*blk)

forward(torch.zeros((2, 3, 256, 256)), base_net()).shape

  1. ### The Complete Model
  2. The complete
  3. single shot multibox detection model
  4. consists of five blocks.
  5. The feature maps produced by each block
  6. are used for both
  7. (i) generating anchor boxes
  8. and (ii) predicting classes and offsets of these anchor boxes.
  9. Among these five blocks,
  10. the first one
  11. is the base network block,
  12. the second to the fourth are
  13. downsampling blocks,
  14. and the last block
  15. uses global maximum pooling
  16. to reduce both the height and width to 1.
  17. Technically,
  18. the second to the fifth blocks
  19. are all
  20. those
  21. multiscale feature map blocks
  22. in :numref:`fig_ssd`.
  23. ```{.python .input}
  24. def get_blk(i):
  25. if i == 0:
  26. blk = base_net()
  27. elif i == 4:
  28. blk = nn.GlobalMaxPool2D()
  29. else:
  30. blk = down_sample_blk(128)
  31. return blk

```{.python .input}

@tab pytorch

def get_blk(i): if i == 0: blk = base_net() elif i == 1: blk = down_sample_blk(64, 128) elif i == 4: blk = nn.AdaptiveMaxPool2d((1,1)) else: blk = down_sample_blk(128, 128) return blk

  1. Now we define the forward propagation
  2. for each block.
  3. Different from
  4. in image classification tasks,
  5. outputs here include
  6. (i) CNN feature maps `Y`,
  7. (ii) anchor boxes generated using `Y` at the current scale,
  8. and (iii) classes and offsets predicted (based on `Y`)
  9. for these anchor boxes.
  10. ```{.python .input}
  11. def blk_forward(X, blk, size, ratio, cls_predictor, bbox_predictor):
  12. Y = blk(X)
  13. anchors = d2l.multibox_prior(Y, sizes=size, ratios=ratio)
  14. cls_preds = cls_predictor(Y)
  15. bbox_preds = bbox_predictor(Y)
  16. return (Y, anchors, cls_preds, bbox_preds)

```{.python .input}

@tab pytorch

def blk_forward(X, blk, size, ratio, cls_predictor, bbox_predictor): Y = blk(X) anchors = d2l.multibox_prior(Y, sizes=size, ratios=ratio) cls_preds = cls_predictor(Y) bbox_preds = bbox_predictor(Y) return (Y, anchors, cls_preds, bbox_preds)

  1. Recall that
  2. in :numref:`fig_ssd`
  3. a multiscale feature map block
  4. that is closer to the top
  5. is for detecting larger objects;
  6. thus, it needs to generate larger anchor boxes.
  7. In the above forward propagation,
  8. at each multiscale feature map block
  9. we pass in a list of two scale values
  10. via the `sizes` argument
  11. of the invoked `multibox_prior` function (described in :numref:`sec_anchor`).
  12. In the following,
  13. the interval between 0.2 and 1.05
  14. is split evenly
  15. into five sections to determine the
  16. smaller scale values at the five blocks: 0.2, 0.37, 0.54, 0.71, and 0.88.
  17. Then their larger scale values
  18. are given by
  19. $\sqrt{0.2 \times 0.37} = 0.272$, $\sqrt{0.37 \times 0.54} = 0.447$, and so on.
  20. ```{.python .input}
  21. #@tab all
  22. sizes = [[0.2, 0.272], [0.37, 0.447], [0.54, 0.619], [0.71, 0.79],
  23. [0.88, 0.961]]
  24. ratios = [[1, 2, 0.5]] * 5
  25. num_anchors = len(sizes[0]) + len(ratios[0]) - 1

Now we can define the complete model TinySSD as follows.

```{.python .input} class TinySSD(nn.Block): def init(self, numclasses, **kwargs): super(TinySSD, self)._init(**kwargs) self.num_classes = num_classes for i in range(5):

  1. # Equivalent to the assignment statement `self.blk_i = get_blk(i)`
  2. setattr(self, f'blk_{i}', get_blk(i))
  3. setattr(self, f'cls_{i}', cls_predictor(num_anchors, num_classes))
  4. setattr(self, f'bbox_{i}', bbox_predictor(num_anchors))
  5. def forward(self, X):
  6. anchors, cls_preds, bbox_preds = [None] * 5, [None] * 5, [None] * 5
  7. for i in range(5):
  8. # Here `getattr(self, 'blk_%d' % i)` accesses `self.blk_i`
  9. X, anchors[i], cls_preds[i], bbox_preds[i] = blk_forward(
  10. X, getattr(self, f'blk_{i}'), sizes[i], ratios[i],
  11. getattr(self, f'cls_{i}'), getattr(self, f'bbox_{i}'))
  12. anchors = np.concatenate(anchors, axis=1)
  13. cls_preds = concat_preds(cls_preds)
  14. cls_preds = cls_preds.reshape(
  15. cls_preds.shape[0], -1, self.num_classes + 1)
  16. bbox_preds = concat_preds(bbox_preds)
  17. return anchors, cls_preds, bbox_preds
  1. ```{.python .input}
  2. #@tab pytorch
  3. class TinySSD(nn.Module):
  4. def __init__(self, num_classes, **kwargs):
  5. super(TinySSD, self).__init__(**kwargs)
  6. self.num_classes = num_classes
  7. idx_to_in_channels = [64, 128, 128, 128, 128]
  8. for i in range(5):
  9. # Equivalent to the assignment statement `self.blk_i = get_blk(i)`
  10. setattr(self, f'blk_{i}', get_blk(i))
  11. setattr(self, f'cls_{i}', cls_predictor(idx_to_in_channels[i],
  12. num_anchors, num_classes))
  13. setattr(self, f'bbox_{i}', bbox_predictor(idx_to_in_channels[i],
  14. num_anchors))
  15. def forward(self, X):
  16. anchors, cls_preds, bbox_preds = [None] * 5, [None] * 5, [None] * 5
  17. for i in range(5):
  18. # Here `getattr(self, 'blk_%d' % i)` accesses `self.blk_i`
  19. X, anchors[i], cls_preds[i], bbox_preds[i] = blk_forward(
  20. X, getattr(self, f'blk_{i}'), sizes[i], ratios[i],
  21. getattr(self, f'cls_{i}'), getattr(self, f'bbox_{i}'))
  22. anchors = torch.cat(anchors, dim=1)
  23. cls_preds = concat_preds(cls_preds)
  24. cls_preds = cls_preds.reshape(
  25. cls_preds.shape[0], -1, self.num_classes + 1)
  26. bbox_preds = concat_preds(bbox_preds)
  27. return anchors, cls_preds, bbox_preds

We create a model instance and use it to perform forward propagation on a minibatch of $256 \times 256$ images X.

As shown earlier in this section, the first block outputs $32 \times 32$ feature maps. Recall that the second to fourth downsampling blocks halve the height and width and the fifth block uses global pooling. Since 4 anchor boxes are generated for each unit along spatial dimensions of feature maps, at all the five scales a total of $(32^2 + 16^2 + 8^2 + 4^2 + 1)\times 4 = 5444$ anchor boxes are generated for each image.

```{.python .input} net = TinySSD(num_classes=1) net.initialize() X = np.zeros((32, 3, 256, 256)) anchors, cls_preds, bbox_preds = net(X)

print(‘output anchors:’, anchors.shape) print(‘output class preds:’, cls_preds.shape) print(‘output bbox preds:’, bbox_preds.shape)

  1. ```{.python .input}
  2. #@tab pytorch
  3. net = TinySSD(num_classes=1)
  4. X = torch.zeros((32, 3, 256, 256))
  5. anchors, cls_preds, bbox_preds = net(X)
  6. print('output anchors:', anchors.shape)
  7. print('output class preds:', cls_preds.shape)
  8. print('output bbox preds:', bbox_preds.shape)

Training

Now we will explain how to train the single shot multibox detection model for object detection.

Reading the Dataset and Initializing the Model

To begin with, let us read the banana detection dataset described in :numref:sec_object-detection-dataset.

```{.python .input}

@tab all

batchsize = 32 train_iter, = d2l.load_data_bananas(batch_size)

  1. There is only one class in the banana detection dataset. After defining the model,
  2. we need to initialize its parameters and define
  3. the optimization algorithm.
  4. ```{.python .input}
  5. device, net = d2l.try_gpu(), TinySSD(num_classes=1)
  6. net.initialize(init=init.Xavier(), ctx=device)
  7. trainer = gluon.Trainer(net.collect_params(), 'sgd',
  8. {'learning_rate': 0.2, 'wd': 5e-4})

```{.python .input}

@tab pytorch

device, net = d2l.try_gpu(), TinySSD(num_classes=1) trainer = torch.optim.SGD(net.parameters(), lr=0.2, weight_decay=5e-4)

  1. ### Defining Loss and Evaluation Functions
  2. Object detection has two types of losses.
  3. The first loss concerns classes of anchor boxes:
  4. its computation
  5. can simply reuse
  6. the cross-entropy loss function
  7. that we used for image classification.
  8. The second loss
  9. concerns offsets of positive (non-background) anchor boxes:
  10. this is a regression problem.
  11. For this regression problem,
  12. however,
  13. here we do not use the squared loss
  14. described in :numref:`subsec_normal_distribution_and_squared_loss`.
  15. Instead,
  16. we use the $L_1$ norm loss,
  17. the absolute value of the difference between
  18. the prediction and the ground-truth.
  19. The mask variable `bbox_masks` filters out
  20. negative anchor boxes and illegal (padded)
  21. anchor boxes in the loss calculation.
  22. In the end, we sum up
  23. the anchor box class loss
  24. and the anchor box offset loss
  25. to obtain the loss function for the model.
  26. ```{.python .input}
  27. cls_loss = gluon.loss.SoftmaxCrossEntropyLoss()
  28. bbox_loss = gluon.loss.L1Loss()
  29. def calc_loss(cls_preds, cls_labels, bbox_preds, bbox_labels, bbox_masks):
  30. cls = cls_loss(cls_preds, cls_labels)
  31. bbox = bbox_loss(bbox_preds * bbox_masks, bbox_labels * bbox_masks)
  32. return cls + bbox

```{.python .input}

@tab pytorch

cls_loss = nn.CrossEntropyLoss(reduction=’none’) bbox_loss = nn.L1Loss(reduction=’none’)

def calc_loss(cls_preds, cls_labels, bbox_preds, bbox_labels, bbox_masks): batch_size, num_classes = cls_preds.shape[0], cls_preds.shape[2] cls = cls_loss(cls_preds.reshape(-1, num_classes), cls_labels.reshape(-1)).reshape(batch_size, -1).mean(dim=1) bbox = bbox_loss(bbox_preds bbox_masks, bbox_labels bbox_masks).mean(dim=1) return cls + bbox

  1. We can use accuracy to evaluate the classification results.
  2. Due to the used $L_1$ norm loss for the offsets,
  3. we use the *mean absolute error* to evaluate the
  4. predicted bounding boxes.
  5. These prediction results are obtained
  6. from the generated anchor boxes and the
  7. predicted offsets for them.
  8. ```{.python .input}
  9. def cls_eval(cls_preds, cls_labels):
  10. # Because the class prediction results are on the final dimension,
  11. # `argmax` needs to specify this dimension
  12. return float((cls_preds.argmax(axis=-1).astype(
  13. cls_labels.dtype) == cls_labels).sum())
  14. def bbox_eval(bbox_preds, bbox_labels, bbox_masks):
  15. return float((np.abs((bbox_labels - bbox_preds) * bbox_masks)).sum())

```{.python .input}

@tab pytorch

def cls_eval(cls_preds, cls_labels):

  1. # Because the class prediction results are on the final dimension,
  2. # `argmax` needs to specify this dimension
  3. return float((cls_preds.argmax(dim=-1).type(
  4. cls_labels.dtype) == cls_labels).sum())

def bbox_eval(bbox_preds, bbox_labels, bbox_masks): return float((torch.abs((bbox_labels - bbox_preds) * bbox_masks)).sum())

  1. ### Training the Model
  2. When training the model,
  3. we need to generate multiscale anchor boxes (`anchors`)
  4. and predict their classes (`cls_preds`) and offsets (`bbox_preds`) in the forward propagation.
  5. Then we label the classes (`cls_labels`) and offsets (`bbox_labels`) of such generated anchor boxes
  6. based on the label information `Y`.
  7. Finally, we calculate the loss function
  8. using the predicted and labeled values
  9. of the classes and offsets.
  10. For concise implementations,
  11. evaluation of the test dataset is omitted here.
  12. ```{.python .input}
  13. num_epochs, timer = 20, d2l.Timer()
  14. animator = d2l.Animator(xlabel='epoch', xlim=[1, num_epochs],
  15. legend=['class error', 'bbox mae'])
  16. for epoch in range(num_epochs):
  17. # Sum of training accuracy, no. of examples in sum of training accuracy,
  18. # Sum of absolute error, no. of examples in sum of absolute error
  19. metric = d2l.Accumulator(4)
  20. for features, target in train_iter:
  21. timer.start()
  22. X = features.as_in_ctx(device)
  23. Y = target.as_in_ctx(device)
  24. with autograd.record():
  25. # Generate multiscale anchor boxes and predict their classes and
  26. # offsets
  27. anchors, cls_preds, bbox_preds = net(X)
  28. # Label the classes and offsets of these anchor boxes
  29. bbox_labels, bbox_masks, cls_labels = d2l.multibox_target(anchors,
  30. Y)
  31. # Calculate the loss function using the predicted and labeled
  32. # values of the classes and offsets
  33. l = calc_loss(cls_preds, cls_labels, bbox_preds, bbox_labels,
  34. bbox_masks)
  35. l.backward()
  36. trainer.step(batch_size)
  37. metric.add(cls_eval(cls_preds, cls_labels), cls_labels.size,
  38. bbox_eval(bbox_preds, bbox_labels, bbox_masks),
  39. bbox_labels.size)
  40. cls_err, bbox_mae = 1 - metric[0] / metric[1], metric[2] / metric[3]
  41. animator.add(epoch + 1, (cls_err, bbox_mae))
  42. print(f'class err {cls_err:.2e}, bbox mae {bbox_mae:.2e}')
  43. print(f'{len(train_iter._dataset) / timer.stop():.1f} examples/sec on '
  44. f'{str(device)}')

```{.python .input}

@tab pytorch

num_epochs, timer = 20, d2l.Timer() animator = d2l.Animator(xlabel=’epoch’, xlim=[1, num_epochs], legend=[‘class error’, ‘bbox mae’]) net = net.to(device) for epoch in range(num_epochs):

  1. # Sum of training accuracy, no. of examples in sum of training accuracy,
  2. # Sum of absolute error, no. of examples in sum of absolute error
  3. metric = d2l.Accumulator(4)
  4. net.train()
  5. for features, target in train_iter:
  6. timer.start()
  7. trainer.zero_grad()
  8. X, Y = features.to(device), target.to(device)
  9. # Generate multiscale anchor boxes and predict their classes and
  10. # offsets
  11. anchors, cls_preds, bbox_preds = net(X)
  12. # Label the classes and offsets of these anchor boxes
  13. bbox_labels, bbox_masks, cls_labels = d2l.multibox_target(anchors, Y)
  14. # Calculate the loss function using the predicted and labeled values
  15. # of the classes and offsets
  16. l = calc_loss(cls_preds, cls_labels, bbox_preds, bbox_labels,
  17. bbox_masks)
  18. l.mean().backward()
  19. trainer.step()
  20. metric.add(cls_eval(cls_preds, cls_labels), cls_labels.numel(),
  21. bbox_eval(bbox_preds, bbox_labels, bbox_masks),
  22. bbox_labels.numel())
  23. cls_err, bbox_mae = 1 - metric[0] / metric[1], metric[2] / metric[3]
  24. animator.add(epoch + 1, (cls_err, bbox_mae))

print(f’class err {cls_err:.2e}, bbox mae {bbox_mae:.2e}’) print(f’{len(train_iter.dataset) / timer.stop():.1f} examples/sec on ‘ f’{str(device)}’)

  1. ## Prediction
  2. During prediction,
  3. the goal is to detect all the objects of interest
  4. on the image.
  5. Below
  6. we read and resize a test image,
  7. converting it to
  8. a four-dimensional tensor that is
  9. required by convolutional layers.
  10. ```{.python .input}
  11. img = image.imread('../img/banana.jpg')
  12. feature = image.imresize(img, 256, 256).astype('float32')
  13. X = np.expand_dims(feature.transpose(2, 0, 1), axis=0)

```{.python .input}

@tab pytorch

X = torchvision.io.read_image(‘../img/banana.jpg’).unsqueeze(0).float() img = X.squeeze(0).permute(1, 2, 0).long()

  1. Using the `multibox_detection` function below,
  2. the predicted bounding boxes
  3. are obtained
  4. from the anchor boxes and their predicted offsets.
  5. Then non-maximum suppression is used
  6. to remove similar predicted bounding boxes.
  7. ```{.python .input}
  8. def predict(X):
  9. anchors, cls_preds, bbox_preds = net(X.as_in_ctx(device))
  10. cls_probs = npx.softmax(cls_preds).transpose(0, 2, 1)
  11. output = d2l.multibox_detection(cls_probs, bbox_preds, anchors)
  12. idx = [i for i, row in enumerate(output[0]) if row[0] != -1]
  13. return output[0, idx]
  14. output = predict(X)

```{.python .input}

@tab pytorch

def predict(X): net.eval() anchors, cls_preds, bbox_preds = net(X.to(device)) cls_probs = F.softmax(cls_preds, dim=2).permute(0, 2, 1) output = d2l.multibox_detection(cls_probs, bbox_preds, anchors) idx = [i for i, row in enumerate(output[0]) if row[0] != -1] return output[0, idx]

output = predict(X)

  1. Finally, we display
  2. all the predicted bounding boxes with
  3. confidence 0.9 or above
  4. as the output.
  5. ```{.python .input}
  6. def display(img, output, threshold):
  7. d2l.set_figsize((5, 5))
  8. fig = d2l.plt.imshow(img.asnumpy())
  9. for row in output:
  10. score = float(row[1])
  11. if score < threshold:
  12. continue
  13. h, w = img.shape[0:2]
  14. bbox = [row[2:6] * np.array((w, h, w, h), ctx=row.ctx)]
  15. d2l.show_bboxes(fig.axes, bbox, '%.2f' % score, 'w')
  16. display(img, output, threshold=0.9)

```{.python .input}

@tab pytorch

def display(img, output, threshold): d2l.set_figsize((5, 5)) fig = d2l.plt.imshow(img) for row in output: score = float(row[1]) if score < threshold: continue h, w = img.shape[0:2] bbox = [row[2:6] * torch.tensor((w, h, w, h), device=row.device)] d2l.show_bboxes(fig.axes, bbox, ‘%.2f’ % score, ‘w’)

display(img, output.cpu(), threshold=0.9)

  1. ## Summary
  2. * Single shot multibox detection is a multiscale object detection model. Via its base network and several multiscale feature map blocks, single-shot multibox detection generates a varying number of anchor boxes with different sizes, and detects varying-size objects by predicting classes and offsets of these anchor boxes (thus the bounding boxes).
  3. * When training the single-shot multibox detection model, the loss function is calculated based on the predicted and labeled values of the anchor box classes and offsets.
  4. ## Exercises
  5. 1. Can you improve the single-shot multibox detection by improving the loss function? For example, replace $L_1$ norm loss with smooth $L_1$ norm loss for the predicted offsets. This loss function uses a square function around zero for smoothness, which is controlled by the hyperparameter $\sigma$:
  6. $$
  7. f(x) =
  8. \begin{cases}
  9. (\sigma x)^2/2,& \text{if }|x| < 1/\sigma^2\\
  10. |x|-0.5/\sigma^2,& \text{otherwise}
  11. \end{cases}
  12. $$
  13. When $\sigma$ is very large, this loss is similar to the $L_1$ norm loss. When its value is smaller, the loss function is smoother.
  14. ```{.python .input}
  15. sigmas = [10, 1, 0.5]
  16. lines = ['-', '--', '-.']
  17. x = np.arange(-2, 2, 0.1)
  18. d2l.set_figsize()
  19. for l, s in zip(lines, sigmas):
  20. y = npx.smooth_l1(x, scalar=s)
  21. d2l.plt.plot(x.asnumpy(), y.asnumpy(), l, label='sigma=%.1f' % s)
  22. d2l.plt.legend();

```{.python .input}

@tab pytorch

def smooth_l1(data, scalar): out = [] for i in data: if abs(i) < 1 / (scalar 2): out.append(((scalar * i) 2) / 2) else: out.append(abs(i) - 0.5 / (scalar ** 2)) return torch.tensor(out)

sigmas = [10, 1, 0.5] lines = [‘-‘, ‘—‘, ‘-.’] x = torch.arange(-2, 2, 0.1) d2l.set_figsize()

for l, s in zip(lines, sigmas): y = smooth_l1(x, scalar=s) d2l.plt.plot(x, y, l, label=’sigma=%.1f’ % s) d2l.plt.legend();

  1. Besides, in the experiment we used cross-entropy loss for class prediction:
  2. denoting by $p_j$ the predicted probability for the ground-truth class $j$, the cross-entropy loss is $-\log p_j$. We can also use the focal loss
  3. :cite:`Lin.Goyal.Girshick.ea.2017`: given hyperparameters $\gamma > 0$
  4. and $\alpha > 0$, this loss is defined as:
  5. $$ - \alpha (1-p_j)^{\gamma} \log p_j.$$
  6. As we can see, increasing $\gamma$
  7. can effectively reduce the relative loss
  8. for well-classified examples (e.g., $p_j > 0.5$)
  9. so the training
  10. can focus more on those difficult examples that are misclassified.
  11. ```{.python .input}
  12. def focal_loss(gamma, x):
  13. return -(1 - x) ** gamma * np.log(x)
  14. x = np.arange(0.01, 1, 0.01)
  15. for l, gamma in zip(lines, [0, 1, 5]):
  16. y = d2l.plt.plot(x.asnumpy(), focal_loss(gamma, x).asnumpy(), l,
  17. label='gamma=%.1f' % gamma)
  18. d2l.plt.legend();

```{.python .input}

@tab pytorch

def focal_loss(gamma, x): return -(1 - x) * gamma torch.log(x)

x = torch.arange(0.01, 1, 0.01) for l, gamma in zip(lines, [0, 1, 5]): y = d2l.plt.plot(x, focal_loss(gamma, x), l, label=’gamma=%.1f’ % gamma) d2l.plt.legend(); ```

  1. Due to space limitations, we have omitted some implementation details of the single shot multibox detection model in this section. Can you further improve the model in the following aspects:
    1. When an object is much smaller compared with the image, the model could resize the input image bigger.
    2. There are typically a vast number of negative anchor boxes. To make the class distribution more balanced, we could downsample negative anchor boxes.
    3. In the loss function, assign different weight hyperparameters to the class loss and the offset loss.
    4. Use other methods to evaluate the object detection model, such as those in the single shot multibox detection paper :cite:Liu.Anguelov.Erhan.ea.2016.

:begin_tab:mxnet Discussions :end_tab:

:begin_tab:pytorch Discussions :end_tab: