Anchor Boxes

:label:sec_anchor

Object detection algorithms usually sample a large number of regions in the input image, determine whether these regions contain objects of interest, and adjust the boundaries of the regions so as to predict the ground-truth bounding boxes of the objects more accurately. Different models may adopt different region sampling schemes. Here we introduce one of such methods: it generates multiple bounding boxes with varying scales and aspect ratios centered on each pixel. These bounding boxes are called anchor boxes. We will design an object detection model based on anchor boxes in :numref:sec_ssd.

First, let us modify the printing accuracy just for more concise outputs.

```{.python .input} %matplotlib inline from d2l import mxnet as d2l from mxnet import gluon, image, np, npx

np.set_printoptions(2) # Simplify printing accuracy npx.set_np()

  1. ```{.python .input}
  2. #@tab pytorch
  3. %matplotlib inline
  4. from d2l import torch as d2l
  5. import torch
  6. torch.set_printoptions(2) # Simplify printing accuracy

Generating Multiple Anchor Boxes

Suppose that the input image has a height of $h$ and width of $w$. We generate anchor boxes with different shapes centered on each pixel of the image. Let the scale be $s\in (0, 1]$ and the aspect ratio (ratio of width to height) is $r > 0$. Then the width and height of the anchor box are $ws\sqrt{r}$ and $hs/\sqrt{r}$, respectively. Note that when the center position is given, an anchor box with known width and height is determined.

To generate multiple anchor boxes with different shapes, let us set a series of scales $s_1,\ldots, s_n$ and a series of aspect ratios $r_1,\ldots, r_m$. When using all the combinations of these scales and aspect ratios with each pixel as the center, the input image will have a total of $whnm$ anchor boxes. Although these anchor boxes may cover all the ground-truth bounding boxes, the computational complexity is easily too high. In practice, we can only consider those combinations containing $s_1$ or $r_1$:

(s_1, r_1), (s_1, r_2), \ldots, (s_1, r_m), (s_2, r_1), (s_3, r_1), \ldots, (s_n, r_1).

That is to say, the number of anchor boxes centered on the same pixel is $n+m-1$. For the entire input image, we will generate a total of $wh(n+m-1)$ anchor boxes.

The above method of generating anchor boxes is implemented in the following multibox_prior function. We specify the input image, a list of scales, and a list of aspect ratios, then this function will return all the anchor boxes.

```{.python .input}

@save

def multibox_prior(data, sizes, ratios): “””Generate anchor boxes with different shapes centered on each pixel.””” in_height, in_width = data.shape[-2:] device, num_sizes, num_ratios = data.ctx, len(sizes), len(ratios) boxes_per_pixel = (num_sizes + num_ratios - 1) size_tensor = d2l.tensor(sizes, ctx=device) ratio_tensor = d2l.tensor(ratios, ctx=device)

  1. # Offsets are required to move the anchor to the center of a pixel. Since
  2. # a pixel has height=1 and width=1, we choose to offset our centers by 0.5
  3. offset_h, offset_w = 0.5, 0.5
  4. steps_h = 1.0 / in_height # Scaled steps in y-axis
  5. steps_w = 1.0 / in_width # Scaled steps in x-axis
  6. # Generate all center points for the anchor boxes
  7. center_h = (d2l.arange(in_height, ctx=device) + offset_h) * steps_h
  8. center_w = (d2l.arange(in_width, ctx=device) + offset_w) * steps_w
  9. shift_x, shift_y = d2l.meshgrid(center_w, center_h)
  10. shift_x, shift_y = shift_x.reshape(-1), shift_y.reshape(-1)
  11. # Generate `boxes_per_pixel` number of heights and widths that are later
  12. # used to create anchor box corner coordinates (xmin, xmax, ymin, ymax)
  13. w = np.concatenate((size_tensor * np.sqrt(ratio_tensor[0]),
  14. sizes[0] * np.sqrt(ratio_tensor[1:]))) \
  15. * in_height / in_width # Handle rectangular inputs
  16. h = np.concatenate((size_tensor / np.sqrt(ratio_tensor[0]),
  17. sizes[0] / np.sqrt(ratio_tensor[1:])))
  18. # Divide by 2 to get half height and half width
  19. anchor_manipulations = np.tile(np.stack((-w, -h, w, h)).T,
  20. (in_height * in_width, 1)) / 2
  21. # Each center point will have `boxes_per_pixel` number of anchor boxes, so
  22. # generate a grid of all anchor box centers with `boxes_per_pixel` repeats
  23. out_grid = d2l.stack([shift_x, shift_y, shift_x, shift_y],
  24. axis=1).repeat(boxes_per_pixel, axis=0)
  25. output = out_grid + anchor_manipulations
  26. return np.expand_dims(output, axis=0)
  1. ```{.python .input}
  2. #@tab pytorch
  3. #@save
  4. def multibox_prior(data, sizes, ratios):
  5. """Generate anchor boxes with different shapes centered on each pixel."""
  6. in_height, in_width = data.shape[-2:]
  7. device, num_sizes, num_ratios = data.device, len(sizes), len(ratios)
  8. boxes_per_pixel = (num_sizes + num_ratios - 1)
  9. size_tensor = d2l.tensor(sizes, device=device)
  10. ratio_tensor = d2l.tensor(ratios, device=device)
  11. # Offsets are required to move the anchor to the center of a pixel. Since
  12. # a pixel has height=1 and width=1, we choose to offset our centers by 0.5
  13. offset_h, offset_w = 0.5, 0.5
  14. steps_h = 1.0 / in_height # Scaled steps in y axis
  15. steps_w = 1.0 / in_width # Scaled steps in x axis
  16. # Generate all center points for the anchor boxes
  17. center_h = (torch.arange(in_height, device=device) + offset_h) * steps_h
  18. center_w = (torch.arange(in_width, device=device) + offset_w) * steps_w
  19. shift_y, shift_x = torch.meshgrid(center_h, center_w)
  20. shift_y, shift_x = shift_y.reshape(-1), shift_x.reshape(-1)
  21. # Generate `boxes_per_pixel` number of heights and widths that are later
  22. # used to create anchor box corner coordinates (xmin, xmax, ymin, ymax)
  23. w = torch.cat((size_tensor * torch.sqrt(ratio_tensor[0]),
  24. sizes[0] * torch.sqrt(ratio_tensor[1:])))\
  25. * in_height / in_width # Handle rectangular inputs
  26. h = torch.cat((size_tensor / torch.sqrt(ratio_tensor[0]),
  27. sizes[0] / torch.sqrt(ratio_tensor[1:])))
  28. # Divide by 2 to get half height and half width
  29. anchor_manipulations = torch.stack((-w, -h, w, h)).T.repeat(
  30. in_height * in_width, 1) / 2
  31. # Each center point will have `boxes_per_pixel` number of anchor boxes, so
  32. # generate a grid of all anchor box centers with `boxes_per_pixel` repeats
  33. out_grid = torch.stack([shift_x, shift_y, shift_x, shift_y],
  34. dim=1).repeat_interleave(boxes_per_pixel, dim=0)
  35. output = out_grid + anchor_manipulations
  36. return output.unsqueeze(0)

We can see that the shape of the returned anchor box variable Y is (batch size, number of anchor boxes, 4).

```{.python .input} img = image.imread(‘../img/catdog.jpg’).asnumpy() h, w = img.shape[:2]

print(h, w) X = np.random.uniform(size=(1, 3, h, w)) # Construct input data Y = multibox_prior(X, sizes=[0.75, 0.5, 0.25], ratios=[1, 2, 0.5]) Y.shape

  1. ```{.python .input}
  2. #@tab pytorch
  3. img = d2l.plt.imread('../img/catdog.jpg')
  4. h, w = img.shape[:2]
  5. print(h, w)
  6. X = torch.rand(size=(1, 3, h, w)) # Construct input data
  7. Y = multibox_prior(X, sizes=[0.75, 0.5, 0.25], ratios=[1, 2, 0.5])
  8. Y.shape

After changing the shape of the anchor box variable Y to (image height, image width, number of anchor boxes centered on the same pixel, 4), we can obtain all the anchor boxes centered on a specified pixel position. In the following, we access the first anchor box centered on (250, 250). It has four elements: the $(x, y)$-axis coordinates at the upper-left corner and the $(x, y)$-axis coordinates at the lower-right corner of the anchor box. The coordinate values of both axes are divided by the width and height of the image, respectively; thus, the range is between 0 and 1.

```{.python .input}

@tab all

boxes = Y.reshape(h, w, 5, 4) boxes[250, 250, 0, :]

  1. In order to show all the anchor boxes centered on one pixel in the image,
  2. we define the following `show_bboxes` function to draw multiple bounding boxes on the image.
  3. ```{.python .input}
  4. #@tab all
  5. #@save
  6. def show_bboxes(axes, bboxes, labels=None, colors=None):
  7. """Show bounding boxes."""
  8. def _make_list(obj, default_values=None):
  9. if obj is None:
  10. obj = default_values
  11. elif not isinstance(obj, (list, tuple)):
  12. obj = [obj]
  13. return obj
  14. labels = _make_list(labels)
  15. colors = _make_list(colors, ['b', 'g', 'r', 'm', 'c'])
  16. for i, bbox in enumerate(bboxes):
  17. color = colors[i % len(colors)]
  18. rect = d2l.bbox_to_rect(d2l.numpy(bbox), color)
  19. axes.add_patch(rect)
  20. if labels and len(labels) > i:
  21. text_color = 'k' if color == 'w' else 'w'
  22. axes.text(rect.xy[0], rect.xy[1], labels[i],
  23. va='center', ha='center', fontsize=9, color=text_color,
  24. bbox=dict(facecolor=color, lw=0))

As we just saw, the coordinate values of the $x$ and $y$ axes in the variable boxes have been divided by the width and height of the image, respectively. When drawing anchor boxes, we need to restore their original coordinate values; thus, we define variable bbox_scale below. Now, we can draw all the anchor boxes centered on (250, 250) in the image. As you can see, the blue anchor box with a scale of 0.75 and an aspect ratio of 1 well surrounds the dog in the image.

```{.python .input}

@tab all

d2l.set_figsize() bbox_scale = d2l.tensor((w, h, w, h)) fig = d2l.plt.imshow(img) show_bboxes(fig.axes, boxes[250, 250, :, :] * bbox_scale, [‘s=0.75, r=1’, ‘s=0.5, r=1’, ‘s=0.25, r=1’, ‘s=0.75, r=2’, ‘s=0.75, r=0.5’])

  1. ## Intersection over Union (IoU)
  2. We just mentioned that an anchor box "well" surrounds the dog in the image.
  3. If the ground-truth bounding box of the object is known, how can "well" here be quantified?
  4. Intuitively, we can measure the similarity between
  5. the anchor box and the ground-truth bounding box.
  6. We know that the *Jaccard index* can measure the similarity between two sets. Given sets $\mathcal{A}$ and $\mathcal{B}$, their Jaccard index is the size of their intersection divided by the size of their union:
  7. $$J(\mathcal{A},\mathcal{B}) = \frac{\left|\mathcal{A} \cap \mathcal{B}\right|}{\left| \mathcal{A} \cup \mathcal{B}\right|}.$$
  8. In fact, we can consider the pixel area of any bounding box as a set of pixels.
  9. In this way, we can measure the similarity of the two bounding boxes by the Jaccard index of their pixel sets. For two bounding boxes, we usually refer their Jaccard index as *intersection over union* (*IoU*), which is the ratio of their intersection area to their union area, as shown in :numref:`fig_iou`.
  10. The range of an IoU is between 0 and 1:
  11. 0 means that two bounding boxes do not overlap at all,
  12. while 1 indicates that the two bounding boxes are equal.
  13. ![IoU is the ratio of the intersection area to the union area of two bounding boxes.](/uploads/projects/d2l-ai-CN/img/iou.svg)
  14. :label:`fig_iou`
  15. For the remainder of this section, we will use IoU to measure the similarity between anchor boxes and ground-truth bounding boxes, and between different anchor boxes.
  16. Given two lists of anchor or bounding boxes,
  17. the following `box_iou` computes their pairwise IoU
  18. across these two lists.
  19. ```{.python .input}
  20. #@save
  21. def box_iou(boxes1, boxes2):
  22. """Compute pairwise IoU across two lists of anchor or bounding boxes."""
  23. box_area = lambda boxes: ((boxes[:, 2] - boxes[:, 0]) *
  24. (boxes[:, 3] - boxes[:, 1]))
  25. # Shape of `boxes1`, `boxes2`, `areas1`, `areas2`: (no. of boxes1, 4),
  26. # (no. of boxes2, 4), (no. of boxes1,), (no. of boxes2,)
  27. areas1 = box_area(boxes1)
  28. areas2 = box_area(boxes2)
  29. # Shape of `inter_upperlefts`, `inter_lowerrights`, `inters`: (no. of
  30. # boxes1, no. of boxes2, 2)
  31. inter_upperlefts = np.maximum(boxes1[:, None, :2], boxes2[:, :2])
  32. inter_lowerrights = np.minimum(boxes1[:, None, 2:], boxes2[:, 2:])
  33. inters = (inter_lowerrights - inter_upperlefts).clip(min=0)
  34. # Shape of `inter_areas` and `union_areas`: (no. of boxes1, no. of boxes2)
  35. inter_areas = inters[:, :, 0] * inters[:, :, 1]
  36. union_areas = areas1[:, None] + areas2 - inter_areas
  37. return inter_areas / union_areas

```{.python .input}

@tab pytorch

@save

def box_iou(boxes1, boxes2): “””Compute pairwise IoU across two lists of anchor or bounding boxes.””” box_area = lambda boxes: ((boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1]))

  1. # Shape of `boxes1`, `boxes2`, `areas1`, `areas2`: (no. of boxes1, 4),
  2. # (no. of boxes2, 4), (no. of boxes1,), (no. of boxes2,)
  3. areas1 = box_area(boxes1)
  4. areas2 = box_area(boxes2)
  5. # Shape of `inter_upperlefts`, `inter_lowerrights`, `inters`: (no. of
  6. # boxes1, no. of boxes2, 2)
  7. inter_upperlefts = torch.max(boxes1[:, None, :2], boxes2[:, :2])
  8. inter_lowerrights = torch.min(boxes1[:, None, 2:], boxes2[:, 2:])
  9. inters = (inter_lowerrights - inter_upperlefts).clamp(min=0)
  10. # Shape of `inter_areas` and `union_areas`: (no. of boxes1, no. of boxes2)
  11. inter_areas = inters[:, :, 0] * inters[:, :, 1]
  12. union_areas = areas1[:, None] + areas2 - inter_areas
  13. return inter_areas / union_areas
  1. ## Labeling Anchor Boxes in Training Data
  2. :label:`subsec_labeling-anchor-boxes`
  3. In a training dataset,
  4. we consider each anchor box as a training example.
  5. In order to train an object detection model,
  6. we need *class* and *offset* labels for each anchor box,
  7. where the former is
  8. the class of the object relevant to the anchor box
  9. and the latter is the offset
  10. of the ground-truth bounding box relative to the anchor box.
  11. During the prediction,
  12. for each image
  13. we generate multiple anchor boxes,
  14. predict classes and offsets for all the anchor boxes,
  15. adjust their positions according to the predicted offsets to obtain the predicted bounding boxes,
  16. and finally only output those
  17. predicted bounding boxes that satisfy certain criteria.
  18. As we know, an object detection training set
  19. comes with labels for
  20. locations of *ground-truth bounding boxes*
  21. and classes of their surrounded objects.
  22. To label any generated *anchor box*,
  23. we refer to the labeled
  24. location and class of its *assigned* ground-truth bounding box that is closet to the anchor box.
  25. In the following,
  26. we describe an algorithm for assigning
  27. closest ground-truth bounding boxes to anchor boxes.
  28. ### Assigning Ground-Truth Bounding Boxes to Anchor Boxes
  29. Given an image,
  30. suppose that the anchor boxes are $A_1, A_2, \ldots, A_{n_a}$ and the ground-truth bounding boxes are $B_1, B_2, \ldots, B_{n_b}$, where $n_a \geq n_b$.
  31. Let us define a matrix $\mathbf{X} \in \mathbb{R}^{n_a \times n_b}$, whose element $x_{ij}$ in the $i^\mathrm{th}$ row and $j^\mathrm{th}$ column is the IoU of the anchor box $A_i$ and the ground-truth bounding box $B_j$. The algorithm consists of the following steps:
  32. 1. Find the largest element in matrix $\mathbf{X}$ and denote its row and column indices as $i_1$ and $j_1$, respectively. Then the ground-truth bounding box $B_{j_1}$ is assigned to the anchor box $A_{i_1}$. This is quite intuitive because $A_{i_1}$ and $B_{j_1}$ are the closet among all the pairs of anchor boxes and ground-truth bounding boxes. After the first assignment, discard all the elements in the ${i_1}^\mathrm{th}$ row and the ${j_1}^\mathrm{th}$ column in matrix $\mathbf{X}$.
  33. 1. Find the largest of the remaining elements in matrix $\mathbf{X}$ and denote its row and column indices as $i_2$ and $j_2$, respectively. We assign ground-truth bounding box $B_{j_2}$ to anchor box $A_{i_2}$ and discard all the elements in the ${i_2}^\mathrm{th}$ row and the ${j_2}^\mathrm{th}$ column in matrix $\mathbf{X}$.
  34. 1. At this point, elements in two rows and two columns in matrix $\mathbf{X}$ have been discarded. We proceed until all elements in $n_b$ columns in matrix $\mathbf{X}$ are discarded. At this time, we have assigned a ground-truth bounding box to each of $n_b$ anchor boxes.
  35. 1. Only traverse through the remaining $n_a - n_b$ anchor boxes. For example, given any anchor box $A_i$, find the ground-truth bounding box $B_j$ with the largest IoU with $A_i$ throughout the $i^\mathrm{th}$ row of matrix $\mathbf{X}$, and assign $B_j$ to $A_i$ only if this IoU is greater than a predefined threshold.
  36. Let us illustrate the above algorithm using a concrete
  37. example.
  38. As shown in :numref:`fig_anchor_label` (left), assuming that the maximum value in matrix $\mathbf{X}$ is $x_{23}$, we assign the ground-truth bounding box $B_3$ to the anchor box $A_2$.
  39. Then, we discard all the elements in row 2 and column 3 of the matrix, find the largest $x_{71}$ in the remaining elements (shaded area), and assign the ground-truth bounding box $B_1$ to the anchor box $A_7$.
  40. Next, as shown in :numref:`fig_anchor_label` (middle), discard all the elements in row 7 and column 1 of the matrix, find the largest $x_{54}$ in the remaining elements (shaded area), and assign the ground-truth bounding box $B_4$ to the anchor box $A_5$.
  41. Finally, as shown in :numref:`fig_anchor_label` (right), discard all the elements in row 5 and column 4 of the matrix, find the largest $x_{92}$ in the remaining elements (shaded area), and assign the ground-truth bounding box $B_2$ to the anchor box $A_9$.
  42. After that, we only need to traverse through
  43. the remaining anchor boxes $A_1, A_3, A_4, A_6, A_8$ and determine whether to assign them ground-truth bounding boxes according to the threshold.
  44. ![Assigning ground-truth bounding boxes to anchor boxes.](/uploads/projects/d2l-ai-CN/img/anchor-label.svg)
  45. :label:`fig_anchor_label`
  46. This algorithm is implemented in the following `assign_anchor_to_bbox` function.
  47. ```{.python .input}
  48. #@save
  49. def assign_anchor_to_bbox(ground_truth, anchors, device, iou_threshold=0.5):
  50. """Assign closest ground-truth bounding boxes to anchor boxes."""
  51. num_anchors, num_gt_boxes = anchors.shape[0], ground_truth.shape[0]
  52. # Element x_ij in the i-th row and j-th column is the IoU of the anchor
  53. # box i and the ground-truth bounding box j
  54. jaccard = box_iou(anchors, ground_truth)
  55. # Initialize the tensor to hold the assigned ground-truth bounding box for
  56. # each anchor
  57. anchors_bbox_map = np.full((num_anchors,), -1, dtype=np.int32, ctx=device)
  58. # Assign ground-truth bounding boxes according to the threshold
  59. max_ious, indices = np.max(jaccard, axis=1), np.argmax(jaccard, axis=1)
  60. anc_i = np.nonzero(max_ious >= 0.5)[0]
  61. box_j = indices[max_ious >= 0.5]
  62. anchors_bbox_map[anc_i] = box_j
  63. col_discard = np.full((num_anchors,), -1)
  64. row_discard = np.full((num_gt_boxes,), -1)
  65. for _ in range(num_gt_boxes):
  66. max_idx = np.argmax(jaccard) # Find the largest IoU
  67. box_idx = (max_idx % num_gt_boxes).astype('int32')
  68. anc_idx = (max_idx / num_gt_boxes).astype('int32')
  69. anchors_bbox_map[anc_idx] = box_idx
  70. jaccard[:, box_idx] = col_discard
  71. jaccard[anc_idx, :] = row_discard
  72. return anchors_bbox_map

```{.python .input}

@tab pytorch

@save

def assign_anchor_to_bbox(ground_truth, anchors, device, iou_threshold=0.5): “””Assign closest ground-truth bounding boxes to anchor boxes.””” num_anchors, num_gt_boxes = anchors.shape[0], ground_truth.shape[0]

  1. # Element x_ij in the i-th row and j-th column is the IoU of the anchor
  2. # box i and the ground-truth bounding box j
  3. jaccard = box_iou(anchors, ground_truth)
  4. # Initialize the tensor to hold the assigned ground-truth bounding box for
  5. # each anchor
  6. anchors_bbox_map = torch.full((num_anchors,), -1, dtype=torch.long,
  7. device=device)
  8. # Assign ground-truth bounding boxes according to the threshold
  9. max_ious, indices = torch.max(jaccard, dim=1)
  10. anc_i = torch.nonzero(max_ious >= 0.5).reshape(-1)
  11. box_j = indices[max_ious >= 0.5]
  12. anchors_bbox_map[anc_i] = box_j
  13. col_discard = torch.full((num_anchors,), -1)
  14. row_discard = torch.full((num_gt_boxes,), -1)
  15. for _ in range(num_gt_boxes):
  16. max_idx = torch.argmax(jaccard) # Find the largest IoU
  17. box_idx = (max_idx % num_gt_boxes).long()
  18. anc_idx = (max_idx / num_gt_boxes).long()
  19. anchors_bbox_map[anc_idx] = box_idx
  20. jaccard[:, box_idx] = col_discard
  21. jaccard[anc_idx, :] = row_discard
  22. return anchors_bbox_map
  1. ### Labeling Classes and Offsets
  2. Now we can label the class and offset for each anchor box. Suppose that an anchor box $A$ is assigned
  3. a ground-truth bounding box $B$.
  4. On one hand,
  5. the class of the anchor box $A$ will be
  6. labeled as that of $B$.
  7. On the other hand,
  8. the offset of the anchor box $A$
  9. will be labeled according to the
  10. relative position between
  11. the central coordinates of $B$ and $A$
  12. together with the relative size between
  13. these two boxes.
  14. Given varying
  15. positions and sizes of different boxes in the dataset,
  16. we can apply transformations
  17. to those relative positions and sizes
  18. that may lead to
  19. more uniformly distributed offsets
  20. that are easier to fit.
  21. Here we describe a common transformation.
  22. Given the central coordinates of $A$ and $B$ as $(x_a, y_a)$ and $(x_b, y_b)$,
  23. their widths as $w_a$ and $w_b$,
  24. and their heights as $h_a$ and $h_b$, respectively.
  25. We may label the offset of $A$ as
  26. $$\left( \frac{ \frac{x_b - x_a}{w_a} - \mu_x }{\sigma_x},
  27. \frac{ \frac{y_b - y_a}{h_a} - \mu_y }{\sigma_y},
  28. \frac{ \log \frac{w_b}{w_a} - \mu_w }{\sigma_w},
  29. \frac{ \log \frac{h_b}{h_a} - \mu_h }{\sigma_h}\right),$$
  30. where default values of the constants are $\mu_x = \mu_y = \mu_w = \mu_h = 0, \sigma_x=\sigma_y=0.1$, and $\sigma_w=\sigma_h=0.2$.
  31. This transformation is implemented below in the `offset_boxes` function.
  32. ```{.python .input}
  33. #@tab all
  34. #@save
  35. def offset_boxes(anchors, assigned_bb, eps=1e-6):
  36. """Transform for anchor box offsets."""
  37. c_anc = d2l.box_corner_to_center(anchors)
  38. c_assigned_bb = d2l.box_corner_to_center(assigned_bb)
  39. offset_xy = 10 * (c_assigned_bb[:, :2] - c_anc[:, :2]) / c_anc[:, 2:]
  40. offset_wh = 5 * d2l.log(eps + c_assigned_bb[:, 2:] / c_anc[:, 2:])
  41. offset = d2l.concat([offset_xy, offset_wh], axis=1)
  42. return offset

If an anchor box is not assigned a ground-truth bounding box, we just label the class of the anchor box as “background”. Anchor boxes whose classes are background are often referred to as negative anchor boxes, and the rest are called positive anchor boxes. We implement the following multibox_target function to label classes and offsets for anchor boxes (the anchors argument) using ground-truth bounding boxes (the labels argument). This function sets the background class to zero and increments the integer index of a new class by one.

```{.python .input}

@save

def multibox_target(anchors, labels): “””Label anchor boxes using ground-truth bounding boxes.””” batch_size, anchors = labels.shape[0], anchors.squeeze(0) batch_offset, batch_mask, batch_class_labels = [], [], [] device, num_anchors = anchors.ctx, anchors.shape[0] for i in range(batch_size): label = labels[i, :, :] anchors_bbox_map = assign_anchor_to_bbox( label[:, 1:], anchors, device) bbox_mask = np.tile((np.expand_dims((anchors_bbox_map >= 0), axis=-1)), (1, 4)).astype(‘int32’)

  1. # Initialize class labels and assigned bounding box coordinates with
  2. # zeros
  3. class_labels = d2l.zeros(num_anchors, dtype=np.int32, ctx=device)
  4. assigned_bb = d2l.zeros((num_anchors, 4), dtype=np.float32,
  5. ctx=device)
  6. # Label classes of anchor boxes using their assigned ground-truth
  7. # bounding boxes. If an anchor box is not assigned any, we label its
  8. # class as background (the value remains zero)
  9. indices_true = np.nonzero(anchors_bbox_map >= 0)[0]
  10. bb_idx = anchors_bbox_map[indices_true]
  11. class_labels[indices_true] = label[bb_idx, 0].astype('int32') + 1
  12. assigned_bb[indices_true] = label[bb_idx, 1:]
  13. # Offset transformation
  14. offset = offset_boxes(anchors, assigned_bb) * bbox_mask
  15. batch_offset.append(offset.reshape(-1))
  16. batch_mask.append(bbox_mask.reshape(-1))
  17. batch_class_labels.append(class_labels)
  18. bbox_offset = d2l.stack(batch_offset)
  19. bbox_mask = d2l.stack(batch_mask)
  20. class_labels = d2l.stack(batch_class_labels)
  21. return (bbox_offset, bbox_mask, class_labels)
  1. ```{.python .input}
  2. #@tab pytorch
  3. #@save
  4. def multibox_target(anchors, labels):
  5. """Label anchor boxes using ground-truth bounding boxes."""
  6. batch_size, anchors = labels.shape[0], anchors.squeeze(0)
  7. batch_offset, batch_mask, batch_class_labels = [], [], []
  8. device, num_anchors = anchors.device, anchors.shape[0]
  9. for i in range(batch_size):
  10. label = labels[i, :, :]
  11. anchors_bbox_map = assign_anchor_to_bbox(
  12. label[:, 1:], anchors, device)
  13. bbox_mask = ((anchors_bbox_map >= 0).float().unsqueeze(-1)).repeat(
  14. 1, 4)
  15. # Initialize class labels and assigned bounding box coordinates with
  16. # zeros
  17. class_labels = torch.zeros(num_anchors, dtype=torch.long,
  18. device=device)
  19. assigned_bb = torch.zeros((num_anchors, 4), dtype=torch.float32,
  20. device=device)
  21. # Label classes of anchor boxes using their assigned ground-truth
  22. # bounding boxes. If an anchor box is not assigned any, we label its
  23. # class as background (the value remains zero)
  24. indices_true = torch.nonzero(anchors_bbox_map >= 0)
  25. bb_idx = anchors_bbox_map[indices_true]
  26. class_labels[indices_true] = label[bb_idx, 0].long() + 1
  27. assigned_bb[indices_true] = label[bb_idx, 1:]
  28. # Offset transformation
  29. offset = offset_boxes(anchors, assigned_bb) * bbox_mask
  30. batch_offset.append(offset.reshape(-1))
  31. batch_mask.append(bbox_mask.reshape(-1))
  32. batch_class_labels.append(class_labels)
  33. bbox_offset = torch.stack(batch_offset)
  34. bbox_mask = torch.stack(batch_mask)
  35. class_labels = torch.stack(batch_class_labels)
  36. return (bbox_offset, bbox_mask, class_labels)

An Example

Let us illustrate anchor box labeling via a concrete example. We define ground-truth bounding boxes for the dog and cat in the loaded image, where the first element is the class (0 for dog and 1 for cat) and the remaining four elements are the $(x, y)$-axis coordinates at the upper-left corner and the lower-right corner (range is between 0 and 1). We also construct five anchor boxes to be labeled using the coordinates of the upper-left corner and the lower-right corner: $A_0, \ldots, A_4$ (the index starts from 0). Then we plot these ground-truth bounding boxes and anchor boxes in the image.

```{.python .input}

@tab all

ground_truth = d2l.tensor([[0, 0.1, 0.08, 0.52, 0.92], [1, 0.55, 0.2, 0.9, 0.88]]) anchors = d2l.tensor([[0, 0.1, 0.2, 0.3], [0.15, 0.2, 0.4, 0.4], [0.63, 0.05, 0.88, 0.98], [0.66, 0.45, 0.8, 0.8], [0.57, 0.3, 0.92, 0.9]])

fig = d2l.plt.imshow(img) show_bboxes(fig.axes, ground_truth[:, 1:] bbox_scale, [‘dog’, ‘cat’], ‘k’) show_bboxes(fig.axes, anchors bbox_scale, [‘0’, ‘1’, ‘2’, ‘3’, ‘4’]);

  1. Using the `multibox_target` function defined above,
  2. we can label classes and offsets
  3. of these anchor boxes based on
  4. the ground-truth bounding boxes for the dog and cat.
  5. In this example, indices of
  6. the background, dog, and cat classes
  7. are 0, 1, and 2, respectively.
  8. Below we add an dimension for examples of anchor boxes and ground-truth bounding boxes.
  9. ```{.python .input}
  10. labels = multibox_target(np.expand_dims(anchors, axis=0),
  11. np.expand_dims(ground_truth, axis=0))

```{.python .input}

@tab pytorch

labels = multibox_target(anchors.unsqueeze(dim=0), ground_truth.unsqueeze(dim=0))

  1. There are three items in the returned result, all of which are in the tensor format.
  2. The third item contains the labeled classes of the input anchor boxes.
  3. Let us analyze the returned class labels below based on
  4. anchor box and ground-truth bounding box positions in the image.
  5. First, among all the pairs of anchor boxes
  6. and ground-truth bounding boxes,
  7. the IoU of the anchor box $A_4$ and the ground-truth bounding box of the cat is the largest.
  8. Thus, the class of $A_4$ is labeled as the cat.
  9. Taking out
  10. pairs containing $A_4$ or the ground-truth bounding box of the cat, among the rest
  11. the pair of the anchor box $A_1$ and the ground-truth bounding box of the dog has the largest IoU.
  12. So the class of $A_1$ is labeled as the dog.
  13. Next, we need to traverse through the remaining three unlabeled anchor boxes: $A_0$, $A_2$, and $A_3$.
  14. For $A_0$,
  15. the class of the ground-truth bounding box with the largest IoU is the dog,
  16. but the IoU is below the predefined threshold (0.5),
  17. so the class is labeled as background;
  18. for $A_2$,
  19. the class of the ground-truth bounding box with the largest IoU is the cat and the IoU exceeds the threshold, so the class is labeled as the cat;
  20. for $A_3$,
  21. the class of the ground-truth bounding box with the largest IoU is the cat, but the value is below the threshold, so the class is labeled as background.
  22. ```{.python .input}
  23. #@tab all
  24. labels[2]

The second returned item is a mask variable of the shape (batch size, four times the number of anchor boxes). Every four elements in the mask variable correspond to the four offset values of each anchor box. Since we do not care about background detection, offsets of this negative class should not affect the objective function. Through elementwise multiplications, zeros in the mask variable will filter out negative class offsets before calculating the objective function.

```{.python .input}

@tab all

labels[1]

  1. The first returned item contains the four offset values labeled for each anchor box.
  2. Note that the offsets of negative-class anchor boxes are labeled as zeros.
  3. ```{.python .input}
  4. #@tab all
  5. labels[0]

Predicting Bounding Boxes with Non-Maximum Suppression

:label:subsec_predicting-bounding-boxes-nms

During prediction, we generate multiple anchor boxes for the image and predict classes and offsets for each of them. A predicted bounding box is thus obtained according to an anchor box with its predicted offset. Below we implement the offset_inverse function that takes in anchors and offset predictions as inputs and applies inverse offset transformations to return the predicted bounding box coordinates.

```{.python .input}

@tab all

@save

def offset_inverse(anchors, offset_preds): “””Predict bounding boxes based on anchor boxes with predicted offsets.””” anc = d2l.box_corner_to_center(anchors) pred_bbox_xy = (offset_preds[:, :2] anc[:, 2:] / 10) + anc[:, :2] pred_bbox_wh = d2l.exp(offset_preds[:, 2:] / 5) anc[:, 2:] pred_bbox = d2l.concat((pred_bbox_xy, pred_bbox_wh), axis=1) predicted_bbox = d2l.box_center_to_corner(pred_bbox) return predicted_bbox

  1. When there are many anchor boxes,
  2. many similar (with significant overlap)
  3. predicted bounding boxes
  4. can be potentially output for surrounding the same object.
  5. To simplify the output,
  6. we can merge similar predicted bounding boxes
  7. that belong to the same object
  8. by using *non-maximum suppression* (NMS).
  9. Here is how non-maximum suppression works.
  10. For a predicted bounding box $B$,
  11. the object detection model calculates the predicted likelihood
  12. for each class.
  13. Denoting by $p$ the largest predicted likelihood,
  14. the class corresponding to this probability is the predicted class for $B$.
  15. Specifically, we refer to $p$ as the *confidence* (score) of the predicted bounding box $B$.
  16. On the same image,
  17. all the predicted non-background bounding boxes
  18. are sorted by confidence in descending order
  19. to generate a list $L$.
  20. Then we manipulate the sorted list $L$ in the following steps:
  21. 1. Select the predicted bounding box $B_1$ with the highest confidence from $L$ as a basis and remove all non-basis predicted bounding boxes whose IoU with $B_1$ exceeds a predefined threshold $\epsilon$ from $L$. At this point, $L$ keeps the predicted bounding box with the highest confidence but drops others that are too similar to it. In a nutshell, those with *non-maximum* confidence scores are *suppressed*.
  22. 1. Select the predicted bounding box $B_2$ with the second highest confidence from $L$ as another basis and remove all non-basis predicted bounding boxes whose IoU with $B_2$ exceeds $\epsilon$ from $L$.
  23. 1. Repeat the above process until all the predicted bounding boxes in $L$ have been used as a basis. At this time, the IoU of any pair of predicted bounding boxes in $L$ is below the threshold $\epsilon$; thus, no pair is too similar with each other.
  24. 1. Output all the predicted bounding boxes in the list $L$.
  25. The following `nms` function sorts confidence scores in descending order and returns their indices.
  26. ```{.python .input}
  27. #@save
  28. def nms(boxes, scores, iou_threshold):
  29. """Sort confidence scores of predicted bounding boxes."""
  30. B = scores.argsort()[::-1]
  31. keep = [] # Indices of predicted bounding boxes that will be kept
  32. while B.size > 0:
  33. i = B[0]
  34. keep.append(i)
  35. if B.size == 1: break
  36. iou = box_iou(boxes[i, :].reshape(-1, 4),
  37. boxes[B[1:], :].reshape(-1, 4)).reshape(-1)
  38. inds = np.nonzero(iou <= iou_threshold)[0]
  39. B = B[inds + 1]
  40. return np.array(keep, dtype=np.int32, ctx=boxes.ctx)

```{.python .input}

@tab pytorch

@save

def nms(boxes, scores, iou_threshold): “””Sort confidence scores of predicted bounding boxes.””” B = torch.argsort(scores, dim=-1, descending=True) keep = [] # Indices of predicted bounding boxes that will be kept while B.numel() > 0: i = B[0] keep.append(i) if B.numel() == 1: break iou = box_iou(boxes[i, :].reshape(-1, 4), boxes[B[1:], :].reshape(-1, 4)).reshape(-1) inds = torch.nonzero(iou <= iou_threshold).reshape(-1) B = B[inds + 1] return d2l.tensor(keep, device=boxes.device)

  1. We define the following `multibox_detection`
  2. to apply non-maximum suppression
  3. to predicting bounding boxes.
  4. Do not worry if you find the implementation
  5. a bit complicated: we will show how it works
  6. with a concrete example right after the implementation.
  7. ```{.python .input}
  8. #@save
  9. def multibox_detection(cls_probs, offset_preds, anchors, nms_threshold=0.5,
  10. pos_threshold=0.009999999):
  11. """Predict bounding boxes using non-maximum suppression."""
  12. device, batch_size = cls_probs.ctx, cls_probs.shape[0]
  13. anchors = np.squeeze(anchors, axis=0)
  14. num_classes, num_anchors = cls_probs.shape[1], cls_probs.shape[2]
  15. out = []
  16. for i in range(batch_size):
  17. cls_prob, offset_pred = cls_probs[i], offset_preds[i].reshape(-1, 4)
  18. conf, class_id = np.max(cls_prob[1:], 0), np.argmax(cls_prob[1:], 0)
  19. predicted_bb = offset_inverse(anchors, offset_pred)
  20. keep = nms(predicted_bb, conf, nms_threshold)
  21. # Find all non-`keep` indices and set the class to background
  22. all_idx = np.arange(num_anchors, dtype=np.int32, ctx=device)
  23. combined = d2l.concat((keep, all_idx))
  24. unique, counts = np.unique(combined, return_counts=True)
  25. non_keep = unique[counts == 1]
  26. all_id_sorted = d2l.concat((keep, non_keep))
  27. class_id[non_keep] = -1
  28. class_id = class_id[all_id_sorted].astype('float32')
  29. conf, predicted_bb = conf[all_id_sorted], predicted_bb[all_id_sorted]
  30. # Here `pos_threshold` is a threshold for positive (non-background)
  31. # predictions
  32. below_min_idx = (conf < pos_threshold)
  33. class_id[below_min_idx] = -1
  34. conf[below_min_idx] = 1 - conf[below_min_idx]
  35. pred_info = d2l.concat((np.expand_dims(class_id, axis=1),
  36. np.expand_dims(conf, axis=1),
  37. predicted_bb), axis=1)
  38. out.append(pred_info)
  39. return d2l.stack(out)

```{.python .input}

@tab pytorch

@save

def multibox_detection(cls_probs, offset_preds, anchors, nms_threshold=0.5, pos_threshold=0.009999999): “””Predict bounding boxes using non-maximum suppression.””” device, batch_size = cls_probs.device, cls_probs.shape[0] anchors = anchors.squeeze(0) num_classes, num_anchors = cls_probs.shape[1], cls_probs.shape[2] out = [] for i in range(batch_size): cls_prob, offset_pred = cls_probs[i], offset_preds[i].reshape(-1, 4) conf, class_id = torch.max(cls_prob[1:], 0) predicted_bb = offset_inverse(anchors, offset_pred) keep = nms(predicted_bb, conf, nms_threshold)

  1. # Find all non-`keep` indices and set the class to background
  2. all_idx = torch.arange(num_anchors, dtype=torch.long, device=device)
  3. combined = torch.cat((keep, all_idx))
  4. uniques, counts = combined.unique(return_counts=True)
  5. non_keep = uniques[counts == 1]
  6. all_id_sorted = torch.cat((keep, non_keep))
  7. class_id[non_keep] = -1
  8. class_id = class_id[all_id_sorted]
  9. conf, predicted_bb = conf[all_id_sorted], predicted_bb[all_id_sorted]
  10. # Here `pos_threshold` is a threshold for positive (non-background)
  11. # predictions
  12. below_min_idx = (conf < pos_threshold)
  13. class_id[below_min_idx] = -1
  14. conf[below_min_idx] = 1 - conf[below_min_idx]
  15. pred_info = torch.cat((class_id.unsqueeze(1),
  16. conf.unsqueeze(1),
  17. predicted_bb), dim=1)
  18. out.append(pred_info)
  19. return d2l.stack(out)
  1. Now let us apply the above implementations
  2. to a concrete example with four anchor boxes.
  3. For simplicity, we assume that the
  4. predicted offsets are all zeros.
  5. This means that the predicted bounding boxes are anchor boxes.
  6. For each class among the background, dog, and cat,
  7. we also define its predicted likelihood.
  8. ```{.python .input}
  9. #@tab all
  10. anchors = d2l.tensor([[0.1, 0.08, 0.52, 0.92], [0.08, 0.2, 0.56, 0.95],
  11. [0.15, 0.3, 0.62, 0.91], [0.55, 0.2, 0.9, 0.88]])
  12. offset_preds = d2l.tensor([0] * d2l.size(anchors))
  13. cls_probs = d2l.tensor([[0] * 4, # Predicted background likelihood
  14. [0.9, 0.8, 0.7, 0.1], # Predicted dog likelihood
  15. [0.1, 0.2, 0.3, 0.9]]) # Predicted cat likelihood

We can plot these predicted bounding boxes with their confidence on the image.

```{.python .input}

@tab all

fig = d2l.plt.imshow(img) show_bboxes(fig.axes, anchors * bbox_scale, [‘dog=0.9’, ‘dog=0.8’, ‘dog=0.7’, ‘cat=0.9’])

  1. Now we can invoke the `multibox_detection` function
  2. to perform non-maximum suppression,
  3. where the threshold is set to 0.5.
  4. Note that we add
  5. a dimension for examples in the tensor input.
  6. We can see that the shape of the returned result is
  7. (batch size, number of anchor boxes, 6).
  8. The six elements in the innermost dimension
  9. gives the output information for the same predicted bounding box.
  10. The first element is the predicted class index, which starts from 0 (0 is dog and 1 is cat). The value -1 indicates background or removal in non-maximum suppression.
  11. The second element is the confidence of the predicted bounding box.
  12. The remaining four elements are the $(x, y)$-axis coordinates of the upper-left corner and
  13. the lower-right corner of the predicted bounding box, respectively (range is between 0 and 1).
  14. ```{.python .input}
  15. output = multibox_detection(np.expand_dims(cls_probs, axis=0),
  16. np.expand_dims(offset_preds, axis=0),
  17. np.expand_dims(anchors, axis=0),
  18. nms_threshold=0.5)
  19. output

```{.python .input}

@tab pytorch

output = multibox_detection(cls_probs.unsqueeze(dim=0), offset_preds.unsqueeze(dim=0), anchors.unsqueeze(dim=0), nms_threshold=0.5) output

  1. After removing those predicted bounding boxes
  2. of class -1,
  3. we can output the final predicted bounding box
  4. kept by non-maximum suppression.
  5. ```{.python .input}
  6. #@tab all
  7. fig = d2l.plt.imshow(img)
  8. for i in d2l.numpy(output[0]):
  9. if i[0] == -1:
  10. continue
  11. label = ('dog=', 'cat=')[int(i[0])] + str(i[1])
  12. show_bboxes(fig.axes, [d2l.tensor(i[2:]) * bbox_scale], label)

In practice, we can remove predicted bounding boxes with lower confidence even before performing non-maximum suppression, thereby reducing computation in this algorithm. We may also post-process the output of non-maximum suppression, for example, by only keeping results with higher confidence in the final output.

Summary

  • We generate anchor boxes with different shapes centered on each pixel of the image.
  • Intersection over union (IoU), also known as Jaccard index, measures the similarity of two bounding boxes. It is the ratio of their intersection area to their union area.
  • In a training set, we need two types of labels for each anchor box. One is the class of the object relevant to the anchor box and the other is the offset of the ground-truth bounding box relative to the anchor box.
  • During prediction, we can use non-maximum suppression (NMS) to remove similar predicted bounding boxes, thereby simplifying the output.

Exercises

  1. Change values of sizes and ratios in the multibox_prior function. What are the changes to the generated anchor boxes?
  2. Construct and visualize two bounding boxes with an IoU of 0.5. How do they overlap with each other?
  3. Modify the variable anchors in :numref:subsec_labeling-anchor-boxes and :numref:subsec_predicting-bounding-boxes-nms. How do the results change?
  4. Non-maximum suppression is a greedy algorithm that suppresses predicted bounding boxes by removing them. Is it possible that some of these removed ones are actually useful? How can this algorithm be modified to suppress softly? You may refer to Soft-NMS :cite:Bodla.Singh.Chellappa.ea.2017.
  5. Rather than being hand-crafted, can non-maximum suppression be learned?

:begin_tab:mxnet Discussions :end_tab:

:begin_tab:pytorch Discussions :end_tab: