实现Faster R-CNN - 《人工智能文档池》

Implements Faster R-CNN.
实现更快的R-CNN
The input to the model is expected to be a list of tensors, each of shape [C, H, W], one for each image, and should be in 0-1 range. Different images can have different sizes.
模型的输入应该是张量列表，每个张量为[C，H，W]形状，每个图像一个，应该在0-1范围内。不同的图像可以有不同的大小。
The behavior of the model changes depending if it is in training or evaluation mode.
模型的行为会根据其是否处于训练或评估模式而变化。

During training, the model expects both the input tensors, as well as a targets dictionary,containing:
在训练期间，模型需要输入张量以及目标字典，其中包含：

- boxes (Tensor[N, 4]): the ground-truth boxes in [x0, y0, x1, y1] format, with values between 0 and H and 0 and W
- labels (Tensor[N]): the class label for each ground-truth box

The model returns a Dict[Tensor] during training, containing the classification and regression losses for both the RPN and the R-CNN.
该模型在训练期间返回Dict [Tensor]，包含RPN和R-CNN的分类和回归损失。
During inference, the model requires only the input tensors, and returns the post-processed predictions as a List[Dict[Tensor]], one for each input image. The fields of the Dict are as follows:
在推理期间，模型仅需要输入张量，并将后处理的预测作为List [Dict [Tensor]]返回，每个输入图像一个。 Dict的字段如下：

- boxes (Tensor[N, 4]): the predicted boxes in [x0, y0, x1, y1] format, with values between 0 and H and 0 and W
- labels (Tensor[N]): the predicted labels for each image
- scores (Tensor[N]): the scores or each prediction

Arguments:
参数：
backbone (nn.Module): the network used to compute the features for the model.
用于计算模型特征的网络。
It should contain a out_channels attribute, which indicates the number of output channels that each feature map has (and it should be the same for all feature maps).
它应该包含一个outchannels属性，该属性指示每个要素图具有的输出通道数（并且对于所有要素图应该相同）
_The backbone should return a single Tensor or and OrderedDict[Tensor].
num_classes (int): number of output classes of the model (including the background).
numclasses (int)_：模型的输出类数（包括背景）。
If box_predictor is specified, num_classes should be None.
如果指定了boxpredictor，则num_classes应为None。
_min_size (int): minimum size of the image to be rescaled before feeding it to the backbone
在将图像馈送到主干之前要重新缩放的图像的最小尺寸
max_size (int): maximum size of the image to be rescaled before feeding it to the backbone
在将图像馈送到主干之前要重新缩放的图像的最大尺寸
image_mean (Tuple[float, float, float]): mean values used for input normalization.
用于输入归一化的平均值。
They are generally the mean values of the dataset on which the backbone has been trained on
image_std (Tuple[float, float, float]): std values used for input normalization.
它们通常是骨架在imagestd上训练的数据集的平均值（元组[float，float，float]）：用于输入标准化的标准值。
_They are generally the std values of the dataset on which the backbone has been trained on rpn_anchor_generator (AnchorGenerator): module that generates the anchors for a set of feature maps.
它们通常是在rpn_anchor_generator（AnchorGenerator）上训练骨干的数据集的std值：生成一组特征映射的锚点的模块。
rpn_head (nn.Module): module that computes the objectness and regression deltas from the RPN
从RPN计算对象性和回归增量的模块
rpn_pre_nms_top_n_train (int): number of proposals to keep before applying NMS during training
在训练期间应用NMS之前要保留的提案数量
rpn_pre_nms_top_n_test (int): number of proposals to keep before applying NMS during testing
在测试期间应用NMS之前要保留的提案数量
rpn_post_nms_top_n_train (int): number of proposals to keep after applying NMS during training
在训练期间应用NMS之后要保留的提案数量
rpn_post_nms_top_n_test (int): number of proposals to keep after applying NMS during testing
在测试期间应用NMS之后要保留的提案数量
rpn_nms_thresh (float): NMS threshold used for postprocessing the RPN proposals
用于后处理RPN提议的NMS阈值
rpn_fg_iou_thresh (float): minimum IoU between the anchor and the GT box so that they can be
considered as positive during training of the RPN.
锚和GT盒之间的最小IoU，以便在训练RPN时可以认为它们是正的。
rpn_bg_iou_thresh (float): maximum IoU between the anchor and the GT box so that they can be considered as negative during training of the RPN.
锚和GT盒之间的最大IoU，以便它们可以__在训练RPN期间被视为负值。
rpn_batch_size_per_image (int): number of anchors that are sampled during training of the RPN for computing the loss
在训练RPN期间为计算损失而采样的锚的数量
rpn_positive_fraction (float): proportion of positive anchors in a mini-batch during training of the RPN
在训练RPN期间，小批量中的正的锚的比例
box_roi_pool (MultiScaleRoIAlign): the module which crops and resizes the feature maps in the locations indicated by the bounding boxes
用于在边界框指示的位置中裁剪和调整特征映射的模块
box_head (nn.Module): module that takes the cropped feature maps as input
将裁剪的特征图作为输入的模块
box_predictor (nn.Module): module that takes the output of box_head and returns the classification logits and box regression deltas.
获取box_head输出并返回分类logits和box回归增量的模块。
box_score_thresh (float): during inference, only return proposals with a classification score greater than box_score_thresh
在推理期间，仅返回分类分数大于box_score_thresh的提案
box_nms_thresh (float): NMS threshold for the prediction head. Used during inference
预测头的NMS阈值。在推理期间使用
box_detections_per_img (int): maximum number of detections per image, for all classes.
对于所有类，每个图像的最大检测数。
box_fg_iou_thresh (float): minimum IoU between the proposals and the GT box so that they can be considered as positive during training of the classification head
提案和GT箱之间的最小IoU，以便在分类头的训练期间可以认为它们是正的
box_bg_iou_thresh (float): maximum IoU between the proposals and the GT box so that they can be considered as negative during training of the classification head
提案和GT箱之间的最大IoU，以便在分类头的训练期间可以认为它们是负的
box_batch_size_per_image (int): number of proposals that are sampled during training of the classification head
在分类头部训练期间抽样的提案数量
box_positive_fraction (float): proportion of positive proposals in a mini-batch during training of the classification head
在分类头训练期间，小批量中的正的提议的比例
bbox_reg_weights (Tuple[float, float, float, float]): weights for the encoding/decoding of the bounding boxes
用于边界框的编码/解码的权重
Example::

import torch
import torchvision
from torchvision.models.detection import FasterRCNN
from torchvision.models.detection.rpn import AnchorGenerator
# load a pre-trained model for classification and return
# only the features
backbone = torchvision.models.mobilenet_v2(pretrained=True).features
# FasterRCNN needs to know the number of
# output channels in a backbone. For mobilenet_v2, it's 1280
# so we need to add it here
backbone.out_channels = 1280
# let's make the RPN generate 5 x 3 anchors per spatial
# location, with 5 different sizes and 3 different aspect
# ratios. We have a Tuple[Tuple[int]] because each feature
# map could potentially have different sizes and
# aspect ratios
anchor_generator = AnchorGenerator(sizes=((32, 64, 128, 256, 512),),
                                   aspect_ratios=((0.5, 1.0, 2.0),))
# let's define what are the feature maps that we will
# use to perform the region of interest cropping, as well as
# the size of the crop after rescaling.
# if your backbone returns a Tensor, featmap_names is expected to
# be [0]. More generally, the backbone should return an
# OrderedDict[Tensor], and in featmap_names you can choose which
# feature maps to use.
roi_pooler = torchvision.ops.MultiScaleRoIAlign(featmap_names=[0],
output_size=7,
sampling_ratio=2)
# put the pieces together inside a FasterRCNN model
model = FasterRCNN(backbone,
   num_classes=2,
   rpn_anchor_generator=anchor_generator,
   box_roi_pool=roi_pooler)
model.eval()
x = [torch.rand(3, 300, 400), torch.rand(3, 500, 400)]
predictions = model(x)