Implements Faster R-CNN.
实现更快的R-CNN
The input to the model is expected to be a list of tensors, each of shape [C, H, W], one for each image, and should be in 0-1 range. Different images can have different sizes.
模型的输入应该是张量列表,每个张量为[C,H,W]形状,每个图像一个,应该在0-1范围内。 不同的图像可以有不同的大小。
The behavior of the model changes depending if it is in training or evaluation mode.
模型的行为会根据其是否处于训练或评估模式而变化。
During training, the model expects both the input tensors, as well as a targets dictionary,containing:
在训练期间,模型需要输入张量以及目标字典,其中包含:
- boxes (Tensor[N, 4]): the ground-truth boxes in [x0, y0, x1, y1] format, with values between 0 and H and 0 and W
- labels (Tensor[N]): the class label for each ground-truth box
The model returns a Dict[Tensor] during training, containing the classification and regression losses for both the RPN and the R-CNN.
该模型在训练期间返回Dict [Tensor],包含RPN和R-CNN的分类和回归损失。
During inference, the model requires only the input tensors, and returns the post-processed predictions as a List[Dict[Tensor]], one for each input image. The fields of the Dict are as follows:
在推理期间,模型仅需要输入张量,并将后处理的预测作为List [Dict [Tensor]]返回,每个输入图像一个。 Dict的字段如下:
- boxes (Tensor[N, 4]): the predicted boxes in [x0, y0, x1, y1] format, with values between 0 and H and 0 and W
- labels (Tensor[N]): the predicted labels for each image
- scores (Tensor[N]): the scores or each prediction
Arguments:
参数:
backbone (nn.Module): the network used to compute the features for the model.
用于计算模型特征的网络。
It should contain a out_channels attribute, which indicates the number of output channels that each feature map has (and it should be the same for all feature maps).
它应该包含一个outchannels属性,该属性指示每个要素图具有的输出通道数(并且对于所有要素图应该相同)
_The backbone should return a single Tensor or and OrderedDict[Tensor].
num_classes (int): number of output classes of the model (including the background).
numclasses (int)_:模型的输出类数(包括背景)。
If box_predictor is specified, num_classes should be None.
如果指定了boxpredictor,则num_classes应为None。
_min_size (int): minimum size of the image to be rescaled before feeding it to the backbone
在将图像馈送到主干之前要重新缩放的图像的最小尺寸
max_size (int): maximum size of the image to be rescaled before feeding it to the backbone
在将图像馈送到主干之前要重新缩放的图像的最大尺寸
image_mean (Tuple[float, float, float]): mean values used for input normalization.
用于输入归一化的平均值。
They are generally the mean values of the dataset on which the backbone has been trained on
image_std (Tuple[float, float, float]): std values used for input normalization.
它们通常是骨架在imagestd上训练的数据集的平均值(元组[float,float,float]):用于输入标准化的标准值。
_They are generally the std values of the dataset on which the backbone has been trained on rpn_anchor_generator (AnchorGenerator): module that generates the anchors for a set of feature maps.
它们通常是在rpn_anchor_generator(AnchorGenerator)上训练骨干的数据集的std值:生成一组特征映射的锚点的模块。
rpn_head (nn.Module): module that computes the objectness and regression deltas from the RPN
从RPN计算对象性和回归增量的模块
rpn_pre_nms_top_n_train (int): number of proposals to keep before applying NMS during training
在训练期间应用NMS之前要保留的提案数量
rpn_pre_nms_top_n_test (int): number of proposals to keep before applying NMS during testing
在测试期间应用NMS之前要保留的提案数量
rpn_post_nms_top_n_train (int): number of proposals to keep after applying NMS during training
在训练期间应用NMS之后要保留的提案数量
rpn_post_nms_top_n_test (int): number of proposals to keep after applying NMS during testing
在测试期间应用NMS之后要保留的提案数量
rpn_nms_thresh (float): NMS threshold used for postprocessing the RPN proposals
用于后处理RPN提议的NMS阈值
rpn_fg_iou_thresh (float): minimum IoU between the anchor and the GT box so that they can be
considered as positive during training of the RPN.
锚和GT盒之间的最小IoU,以便在训练RPN时可以认为它们是正的。
rpn_bg_iou_thresh (float): maximum IoU between the anchor and the GT box so that they can be considered as negative during training of the RPN.
锚和GT盒之间的最大IoU,以便它们可以__在训练RPN期间被视为负值。
rpn_batch_size_per_image (int): number of anchors that are sampled during training of the RPN for computing the loss
在训练RPN期间为计算损失而采样的锚的数量
rpn_positive_fraction (float): proportion of positive anchors in a mini-batch during training of the RPN
在训练RPN期间,小批量中的正的锚的比例
box_roi_pool (MultiScaleRoIAlign): the module which crops and resizes the feature maps in the locations indicated by the bounding boxes
用于在边界框指示的位置中裁剪和调整特征映射的模块
box_head (nn.Module): module that takes the cropped feature maps as input
将裁剪的特征图作为输入的模块
box_predictor (nn.Module): module that takes the output of box_head and returns the classification logits and box regression deltas.
获取box_head输出并返回分类logits和box回归增量的模块。
box_score_thresh (float): during inference, only return proposals with a classification score greater than box_score_thresh
在推理期间,仅返回分类分数大于box_score_thresh的提案
box_nms_thresh (float): NMS threshold for the prediction head. Used during inference
预测头的NMS阈值。 在推理期间使用
box_detections_per_img (int): maximum number of detections per image, for all classes.
对于所有类,每个图像的最大检测数。
box_fg_iou_thresh (float): minimum IoU between the proposals and the GT box so that they can be considered as positive during training of the classification head
提案和GT箱之间的最小IoU,以便在分类头的训练期间可以认为它们是正的
box_bg_iou_thresh (float): maximum IoU between the proposals and the GT box so that they can be considered as negative during training of the classification head
提案和GT箱之间的最大IoU,以便在分类头的训练期间可以认为它们是负的
box_batch_size_per_image (int): number of proposals that are sampled during training of the classification head
在分类头部训练期间抽样的提案数量
box_positive_fraction (float): proportion of positive proposals in a mini-batch during training of the classification head
在分类头训练期间,小批量中的正的提议的比例
bbox_reg_weights (Tuple[float, float, float, float]): weights for the encoding/decoding of the bounding boxes
用于边界框的编码/解码的权重
Example::
import torch
import torchvision
from torchvision.models.detection import FasterRCNN
from torchvision.models.detection.rpn import AnchorGenerator
# load a pre-trained model for classification and return
# only the features
backbone = torchvision.models.mobilenet_v2(pretrained=True).features
# FasterRCNN needs to know the number of
# output channels in a backbone. For mobilenet_v2, it's 1280
# so we need to add it here
backbone.out_channels = 1280
# let's make the RPN generate 5 x 3 anchors per spatial
# location, with 5 different sizes and 3 different aspect
# ratios. We have a Tuple[Tuple[int]] because each feature
# map could potentially have different sizes and
# aspect ratios
anchor_generator = AnchorGenerator(sizes=((32, 64, 128, 256, 512),),
aspect_ratios=((0.5, 1.0, 2.0),))
# let's define what are the feature maps that we will
# use to perform the region of interest cropping, as well as
# the size of the crop after rescaling.
# if your backbone returns a Tensor, featmap_names is expected to
# be [0]. More generally, the backbone should return an
# OrderedDict[Tensor], and in featmap_names you can choose which
# feature maps to use.
roi_pooler = torchvision.ops.MultiScaleRoIAlign(featmap_names=[0],
output_size=7,
sampling_ratio=2)
# put the pieces together inside a FasterRCNN model
model = FasterRCNN(backbone,
num_classes=2,
rpn_anchor_generator=anchor_generator,
box_roi_pool=roi_pooler)
model.eval()
x = [torch.rand(3, 300, 400), torch.rand(3, 500, 400)]
predictions = model(x)