R-CNN 中文翻译 - 《论文翻译》

Rich feature hierarchies for accurate object detection and semantic segmentation
用于精确对象检测和语义分割的丰富特征层次结构
Abstract
摘要
Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years.在规范的PASCAL VOC数据集上测量的对象检测性能在过去几年中已经趋于稳定。 The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. 性能最佳的方法是复杂的集成系统，通常将多个低级图像特征与高级上下文结合在一起。In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012—achieving a mAP of 53.3%. 在本文中，我们提出了一种简单且可扩展的检测算法，该算法相对于先前在VOC 2012上的最佳结果，将平均精度 (mAP) 提高了30% 以上-实现了53.3% 的mAP。Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. 我们的方法结合了两个关键见解 :( 1) 一个可以将大容量卷积神经网络 (CNNs) 应用于自下而上的区域建议，以便本地化和分割对象; (2) 当标记的训练数据不足时，辅助任务的监督预训练，然后是特定领域的微调，会显著提高性能。由于我们将区域建议与CNN相结合，我们将我们的方法称为r-cnn: 具有CNN特征的区域。Source code for the complete system is available at http://www.cs.berkeley.edu/˜rbg/rcnn. 完整系统的源代码可在http://www.cs.berkeley.edu/˜rbg/rcnn. 获得
Figure 1: Object detection system overview.图1: 对象检测系统概述。Our system (1) takes an input image, (2) extracts around 2000 bottom-up region proposals, (3) computes features for each proposal using a large convolutional neural network (CNN), and then (4) classifies each region using class-specific linear SVMs. 我们的系统 (1) 获取一个输入图像，(2) 提取大约2000个自下而上的区域建议，(3) 使用大型卷积神经网络 (CNN) 计算每个建议的特征，然后 (4) 使用特定类别线性SVMs对每个区域进行分类。R-CNN achieves a mean average precision (mAP) of 53.7% on PASCAL VOC 2010. R-cnn在PASCAL VOC 2010上实现了53.7% 的平均精度 (mAP)。For comparison, 为了进行比较，[34] reports 35.1% mAP using the same region proposals, 报告35.1% 使用相同区域建议的地图，but with a spatial pyramid and bag-of-visual-words approach. 但是用空间金字塔和视觉词汇包的方法。The popular deformable part models perform at 33.4%流行的可变形零件模型的性能为33.4%

Introduction 介绍
Features matter. 特点很重要。The last decade of progress on various visual recognition tasks has been based considerably on the use of SIFT [27] and HOG [7].最近十年在各种视觉识别任务上的进展很大程度上是基于SIFT [27] 和HOG [7] 的使用。 But if we look at performance on the canonical visual recognition task, 但是如果我们看看规范视觉识别任务的性能，PASCAL VOC object detection [13], PASCAL VOC对象检测it is generally acknowledged that progress has been slow during 2010-2012, with small gains obtained by building ensemble systems and employing minor variants of successful methods.人们普遍认为，在2010-2012期间进展缓慢，通过建立集成系统和采用成功方法的较小变体获得的收益很小。
SIFT and HOG are blockwise orientation histograms, a representation we could associate roughly with complex cells in V1, the first cortical area in the primate visual pathway. SIFT和HOG是块方向直方图，我们可以将其与V1中的复杂细胞大致相关，V1是灵长类动物视觉通路中的第一个皮质区域。But we also know that recognition occurs several stages downstream, which suggests that there might be hierarchical, multi-stage processes for computing features that are even more informative for visual recognition.但是我们也知道识别发生在下游的几个阶段，这表明可能存在用于计算特征的分层、多阶段过程，这些过程对于视觉识别来说甚至是更有用的。
Fukushima’s “neocognitron” [17], a biologicallyinspired hierarchical and shift-invariant model for pattern recognition, was an early attempt at just such a process. Fukushima’s “neocognitron” [17] 是一种生物学启发的用于模式识别的分层和移位不变模型，是对这一过程的早期尝试。The neocognitron, however, lacked a supervised training algorithm.然而，neocognitron缺乏监督训练算法。 Building on Rumelhart et al. [30], LeCun et al. 基于Rumelhart等人 [30]，LeCun等人[24] showed that stochastic gradient descent via backpropagation was effective for training convolutional neural networks (CNNs), a class of models that extend the neocognitron.表明通过反向传播的随机梯度下降对于训练卷积神经网络 (CNNs) 是有效的，卷积神经网络是一类扩展了 neocognitron的模型。
CNNs saw heavy use in the 1990s (e.g., [25]), but then fell out of fashion with the rise of support vector machines.CNNs在20世纪90年代代中期大量使用 (例如 [25])，但随后随着支持向量机的兴起而过时。 In 2012, Krizhevsky et al. [23] rekindled interest in CNNs by showing substantially higher image classification accuracy on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [9, 10]. 2012年，Krizhevsky等人 [23] 通过在ImageNet大规模视觉识别挑战 (ILSVRC) 上显示出高得多的图像分类精度，重新点燃了人们对CNNs的兴趣 [9,10]。Their success resulted from training a large CNN on 1.2 million labeled images, together with a few twists on LeCun’s CNN (e.g., max(x, 0) rectifying non-linearities and “dropout” regularization).他们的成功源于在120万张贴有标签的图片上训练了一个大的美国有线电视新闻网，以及在LeCun的美国有线电视新闻网 (例如g.，最大值 (x，0) 校正非线性和 “退出” 正则化)。
The significance of the ImageNet result was vigorously debated during the ILSVRC 2012 workshop. 在ILSVRC 2012研讨会上，对ImageNet结果的重要性进行了激烈的辩论。The central issue can be distilled to the following: 中心问题可以总结为以下内容:To what extent do the CNN classification results on ImageNet generalize to object detection results on the PASCAL VOC Challenge?关于ImageNet的CNN分类结果在多大程度上推广到关于PASCAL VOC挑战的对象检测结果？
We answer this question by bridging the gap between image classification and object detection. 我们通过缩小图像分类和目标检测之间的差距来回答这个问题。This paper is the first to show that a CNN can lead to dramatically higher object detection performance on PASCAL VOC as compared to systems based on simpler HOG-like features. 本文首次表明，与基于更简单的HOG-like的系统相比，CNN可以在PASCAL VOC上带来更高的对象检测性能。To achieve this result, we focused on two problems: localizing objects with a deep network and training a high-capacity model with only a small quantity of annotated detection data.为了实现这一结果，我们专注于两个问题: 使用深度网络定位对象和仅使用少量注释检测数据训练大容量模型。
Unlike image classification, detection requires localizing (likely many) objects within an image.与图像分类不同，检测需要在图像中定位 (可能是许多) 对象。 One approach frames localization as a regression problem. 一种方法框架定位为回归问题。However, work from Szegedy et al. [33], concurrent with our own, indicates that this strategy may not fare well in practice (they report a mAP of 30.5% on VOC 2007 compared to the 58.5% achieved by our method). 然而，Szegedy等人的工作 [33]，与我们自己的同时，表明该策略在实践中可能表现不佳 (他们报告的VOC 2007的地图为30.5%，而我们的方法实现的地图为58.5%)。An alternative is to build a sliding-window detector.另一种选择是建立一个滑动窗口检测器。 CNNs have been used in this way for at least two decades, typically on constrained object categories, such as faces [29, 35] and pedestrians [31].CNNs已经以这种方式使用了至少二十年，通常用于受约束的对象类别，例如人脸 [29,35] 和行人 [31]。 In order to maintain high spatial resolution, these CNNs typically only have two convolutional and pooling layers. 为了保持高空间分辨率，这些CNNs通常只有两个卷积和池层。We also considered adopting a sliding-window approach.我们还考虑采用滑动窗口方法。 However, units high up in our network, which has five convolutional layers, have very large receptive fields (195 × 195 pixels) and strides (32×32 pixels) in the input image, which makes precise localization within the sliding-window paradigm an open technical challenge.然而，在我们的网络中，有五个卷积层的高单元在输入图像中具有非常大的感受领域(195 × 195像素) 和取得 (32 × 32像素)，这使得滑动窗口范式中的精确定位成为一个开放的技术挑战。
Instead, we solve the CNN localization problem by operating within the “recognition using regions” paradigm [19], which has been successful for both object detection [34] and semantic segmentation [5]. 相反，我们通过在 “使用区域识别” 范式 [19] 内操作来解决CNN定位问题，这在对象检测 [34] 和语义分割 [5] 方面都是成功的。At test time, our method generates around 2000 category-independent region proposals for the input image, extracts a fixed-length feature vector from each proposal using a CNN, and then classifies each region with category-specific linear SVMs. 在测试时间，我们的方法为输入图像生成大约2000个与类别无关的区域建议，使用CNN从每个建议中提取固定长度的特征向量，然后用特定类别的线性SVMs对每个区域进行分类。We use a simple technique (affine image warping) to compute a fixed-size CNN input from each region proposal, regardless of the region’s shape.我们使用一种简单的技术 (仿射图像变形) 来计算来自每个区域建议的固定大小的CNN输入，而不管该区域的形状如何。 Figure 1 presents an overview of our method and highlights some of our results. 图1概述了我们的方法并强调了我们的一些结果。Since our system combines region proposals with CNNs, we dub the method R-CNN: Regions with CNN features.由于我们的系统将区域建议与CNNs结合在一起，我们将R-CNN方法称为: 具有CNN特征的区域。
A second challenge faced in detection is that labeled data is scarce and the amount currently available is insufficient for training a large CNN.检测中面临的第二个挑战是标记数据稀缺，并且当前可用的数量不足以训练大型CNN。 The conventional solution to this problem is to use unsupervised pretraining, followed by supervised fine-tuning (e.g., [31]). 这个问题的传统解决方案是使用无监督的预训练，然后是有监督的微调 (例如，[31])。The second principle contribution of this paper is to show thatsupervised pre-training on a large auxiliary dataset (ILSVRC), followed by domain specific finetuning on a small dataset (PASCAL), is an effective paradigm for learning high-capacity CNNs when data is scarce. 本文的第二个主要贡献是展示在大型辅助数据集 (ILSVRC) 上的监督预训练，然后在小型数据集 (PASCAL) 上进行特定领域的微调，是在数据稀缺时学习大容量CNNs的有效范例。In our experiments, fine-tuning for detection improves mAP performance by 8 percentage points. 在我们的实验中，用于检测的微调将mAP 性能提高了8个百分点。After fine-tuning, our system achieves a mAP of 54% on VOC 2010 compared to 33% for the highly-tuned, HOGbased deformable part model (DPM) [15, 18]. 经过微调后，我们的系统在VOC 2010上实现了54% 的映射，而在高度调谐的基于hogbase的可变形零件模型 (DPM) 中，该映射为33% [15,18]。We also point readers to contemporaneous work by Donahue et al. [11], who show that Krizhevsky’s CNN can be used (without finetuning) as a blackbox feature extractor, yielding excellent performance on several recognition tasks including scene classification, fine-grained sub-categorization, and domain adaptation. 我们也将读取器指向同时代的Donahue等人 [11] 的作品，他们展示了Krizhevsky的CNN可以被用作 (无需微调) 黑盒特征提取器，在多个识别任务 (包括场景分类、细粒度子分类和域适应) 上产生出色的性能。
Our system is also quite efficient. 我们的系统也相当高效。The only class-specific computations are a reasonably small matrix-vector product and greedy non-maximum suppression. 唯一特定类别的计算是相当小的矩阵向量乘积和贪婪的非最大抑制。This computational property follows from features that are shared across all categories and that are also two orders of magnitude lower dimensional than previously used region features (cf. [34]).此计算属性遵循在所有类别中共享的特征，并且该特征也比以前使用的区域特征低两个数量级。
Understanding the failure modes of our approach is also critical for improving it, and so we report results from the detection analysis tool of Hoiem et al. [21]. 理解我们方法的故障模式对于改善它也是至关重要的，因此我们报告了Hoiem等人的检测分析工具的结果 [21]。As an immediate consequence of this analysis, we demonstrate that a simple bounding box regression method significantly reduces mislocalizations, which are the dominant error mode. 作为该分析的直接结果，我们证明了简单的边界框回归方法显著减少了错误定位，这是主要的错误模式。
Before developing technical details, we note that because R-CNN operates on regions it is natural to extend it to the task of semantic segmentation. 在开发技术细节之前，我们注意到，由于r-cnn在区域上运行，因此将其扩展到语义分割任务是很自然的。With minor modifications, we also achieve competitive results on the PASCAL VOC segmentation task, with an average segmentation accuracy of 47.9% on the VOC 2011 test set.通过较小的修改，我们还在PASCAL VOC分割任务中获得了具有竞争力的结果，在VOC 2011测试集中的平均分割精度为47.9%。
2、Object detection with R-CNN 用r-cnn检测对象
Our object detection system consists of three modules.我们的目标检测系统由三个模块组成。 The first generates category-independent region proposals. 第一个生成独立于类别的区域建议。These proposals define the set of candidate detections available to our detector.这些建议定义了可供我们的检测器使用的候选检测集。 The second module is a large convolutional neural network that extracts a fixed-length feature vector from each region. 第二个模块是从每个区域提取固定长度特征向量的大型卷积神经网络。The third module is a set of classspecific linear SVMs. 第三个模块是一组特定类别的线性SVMs。In this section, we present our design decisions for each module, describe their test-time usage, detail how their parameters are learned, and show results on PASCAL VOC 2010-12.在本节中，我们介绍了每个模块的设计决策，描述了它们的测试时间使用情况，详细介绍了它们的参数是如何学习的，并展示了PASCAL VOC 2010-12的结果。
2.1. Module design 模块设计
Region proposals.区域建议。 A variety of recent papers offer methods for generating category-independent region proposals. 最近的各种论文提供了生成独立于类别的区域提案的方法。Examples include: objectness [1], selective search [34], category-independent object proposals [12], constrained parametric min-cuts (CPMC) [5], multi-scale combinatorial grouping [3], and Cires¸an et al. [6], who detect mitotic cells by applying a CNN to regularly-spaced square crops, which are a special case of region proposals. 例子包括: 客观 [1]，选择性搜索 [34]，与类别无关的对象建议 [12]，约束参数最小值 (CPMC) [5]，多尺度组合分组 [3]，以及Cires¸an等人 [6]，他们通过将CNN应用于规则间隔的方形作物来检测有丝分裂细胞，这是区域提案的特例。While R-CNN is agnostic to the particular region proposal method, we use selective search to enable a controlled comparison with prior detection work (e.g., [34, 36]).虽然r-cnn与特定区域建议方法无关，但我们使用选择性搜索来实现与先前检测工作的受控比较 (例如，[34,36])。
Feature extraction.特征提取。 We extract a 4096-dimensional feature vector from each region proposal using the Caffe [22] implementation of the CNN described by Krizhevsky et al. [23]. 我们使用Krizhevsky等人 [23] 描述的Caffe [22] 实施的CNN，从每个区域计划中提取一个4096维特征向量。Features are computed by forward propagating a mean-subtracted 227 × 227 RGB image through five convolutional layers and two fully connected layers.通过通过五个卷积层和两个完全连接的层向前传播平均减去的227 × 227 RGB图像来计算特征。 We refer readers to [22, 23] for more network architecture details.更多网络架构详情请读者参考 [22,23]。

Figure 2: Warped training samples from VOC 2007 train
来自VOC 2007列车的变形训练样本
In order to compute features for a region proposal, we must first convert the image data in that region into a form that is compatible with the CNN (its architecture requires inputs of a fixed 227 × 227 pixel size).为了计算区域建议的特征，我们必须首先将该区域的图像数据转换为与CNN兼容的形式 (其架构需要输入固定的227 × 227像素大小)。 Of the many possible transformations of our arbitrary-shaped regions, we opt for the simplest. 在我们任意形状区域的许多可能转换中，我们选择最简单的。Regardless of the size or aspect ratio of the candidate region, we warp all pixels in a tight bounding box around it to the required size. 无论候选区域的大小或纵横比如何，我们都会将其周围的紧边界框中的所有像素扭曲为所需的大小。Prior to warping, we dilate the tight bounding box so that at the warped size there are exactly p pixels of warped image context around the original box (we use p = 16).在变形之前，我们扩大紧边界框，以便在扭曲的大小下，在原始框周围恰好有变形图像上下文的p像素 (我们使用p = 16)。 Figure 2 shows a random sampling of warped training regions. 图2显示了扭曲训练区域的随机抽样。The supplementary material discusses alternatives to warping.补充材料讨论了变形的替代方法。
2.2 Test-time detection 测试时间检测
At test time, 在测试时间，we run selective search on the test image to extract around 2000 region proposals (we use selective search’s “fast mode” in all experiments). 我们对测试图像进行选择性搜索，以提取大约2000个区域建议 (我们在所有实验中使用选择性搜索的 “快速模式”)。We warp each proposal and forward propagate it through the CNN in order to read off features from the desired layer. 我们扭曲每个提议，并通过CNN向前传播它，以便从所需层读取特征。Then, for each class, we score each extracted feature vector using the SVM trained for that class. 然后，对于每个类，我们使用为该类训练的SVM对每个提取的特征向量进行评分。Given all scored regions in an image, we apply a greedy non-maximum suppression (for each class independently) that rejects a region if it has an intersection-over-union (IoU) overlap with a higher scoring selected region larger than a learned threshold.给定图像中所有已评分的区域，我们应用贪婪的非最大抑制 (针对每个类独立)，如果区域具有相交并集 (IoU)，则拒绝该区域与大于学习阈值的得分较高的选定区域重叠。
Run-time analysis. 运行时分析。Two properties make detection efficient.两种特性使检测效率高。 First, all CNN parameters are shared across all categories.首先，所有CNN参数在所有类别中共享。 Second, the feature vectors computed by the CNN are low-dimensional when compared to other common approaches, such as spatial pyramids with bag-of-visual-word encodings. 其次，与其他常见方法 (例如具有视觉字袋编码的空间金字塔) 相比，由CNN计算的特征向量是低维的。The features used in the UVA detection system [34], for example, are two orders of magnitude larger than ours (360k vs. 4k-dimensional).例如，UVA检测系统 [34] 中使用的特征比我们的大两个数量级 (360k对4k维)。
The result of such sharing is that the time spent computing region proposals and features (13s/image on a GPU or 53s/image on a CPU) is amortized over all classes.这种共享的结果是，计算区域建议和功能所花费的时间 (GPU上的13s/image或CPU上的53s/image) 在所有类别上摊销。 The only class-specific computations are dot products between features and SVM weights and non-maximum suppression.唯一的类特定计算是特征和SVM权重之间的点积和非最大抑制。 In practice, all dot products for an image are batched into a single matrix-matrix product. 在实践中，图像的所有点积都被分批成一个单矩阵矩阵积。The feature matrix is typically 2000×4096 and the SVM weight matrix is 4096×N, where N is the number of classes.特征矩阵通常为2000 × 4096，SVM权重矩阵为4096 × N，其中N是类的数量
This analysis shows that R-CNN can scale to thousands of object classes without resorting to approximate techniques, such as hashing. 该分析表明，r-cnn可以扩展到数千个对象类，而无需求助于近似技术，例如散列。Even if there were 100k classes, the resulting matrix multiplication takes only 10 seconds on a modern multi-core CPU.即使有100k类，矩阵乘法只需10秒现代多核CPU。 This efficiency is not merely the result of using region proposals and shared features. 这种效率不仅仅是使用区域建议和共享功能的结果。The UVA system, due to its high-dimensional features, would be two orders of magnitude slower while requiring 134GB of memory just to store 100k linear predictors, compared to just 1.5GB for our lower-dimensional features.UVA系统，由于其高维特征，将两个数量级慢而需要134GB内存只是存储100k线性预测，仅为1。5GB用于我们的低维功能。
It is also interesting to contrast R-CNN with the recent work from Dean et al. on scalable detection using DPMs and hashing [8]. 将r-cnn与Dean等人最近在使用DPMs和散列 [8] 的可伸缩检测上的工作进行对比也很有趣。They report a mAP of around 16% on VOC 2007 at a run-time of 5 minutes per image when introducing 10k distractor classes. With our approach, 10k detectors can run in about a minute on a CPU, and because no approximations are made mAP would remain at 59% (Section 3.2).当引入10k干扰类时，他们报告了一个关于VOC 2007的大约16% 的mAP，每个图像运行时间为5分钟。通过我们的方法，10k探测器可以在一个中央处理器上在一分钟内运行，并且因为没有大约，mAP将保持在59% (第3.2节)。
2.3. Training 训练
Supervised pre-training. 监督预培训。We discriminatively pre-trained the CNN on a large auxiliary dataset (ILSVRC 2012) with image-level annotations (i.e., no bounding box labels). 我们在具有图像级注释 (即没有边界框标签) 的大型辅助数据集 (ILSVRC 2012) 上区分预训练了CNN。Pretraining was performed using the open source Caffe CNN library [22]. 使用开源Caffe CNN库进行预训练 [22]。In brief, our CNN nearly matches the performance of Krizhevsky et al. [23], obtaining a top-1 error rate 2.2 percentage points higher on the ILSVRC 2012 validation set. 简而言之，我们的CNN几乎与Krizhevsky等人的表现相匹配 [23]，在ILSVRC 2012验证集中获得了top1的错误率高2.2个百分点。This discrepancy is due to simplifications in the training process.这种差异是由于培训过程的简化。
Domain-specific fine-tuning. 特定于域的微调。To adapt our CNN to the new task (detection) and the new domain (warped VOC windows), we continue stochastic gradient descent (SGD) training of the CNN parameters using only warped region proposals from VOC. 为了使我们的CNN适应新任务 (检测) 和新域 (扭曲的VOC窗口)，我们继续随机梯度下降 (SGD) 仅使用来自VOC的扭曲区域建议训练CNN参数。Aside from replacing the CNN’s ImageNet-specific 1000-way classification layer with a randomly initialized 21-way classification layer (for the 20 VOC classes plus background), the CNN architecture is unchanged.除了将CNN的特定于ImageNet的1000路分类层替换为随机初始化的21路分类层 (对于20个VOC类别加上背景) 之外，CNN架构保持不变。 We treat all region proposals with ≥ 0.5 IoU overlap with a ground-truth box as positives for that box’s class and the rest as negatives.我们将所有与地面真相框重叠 ≥ 0.5个IoU的区域提案视为该框类的积极因素，其余为消极因素。 We start SGD at a learning rate of 0.001 (1/10th of the initial pre-training rate), which allows fine-tuning to make progress while not clobbering the initialization. 我们以0.001的学习率 (初始预训练率的1/10) 启动SGD，这允许微调在不破坏初始化的情况下取得进展。In each SGD iteration, we uniformly sample 32 positive windows (over all classes) and 96 background windows to construct a mini-batch of size 128.在每个SGD迭代中，我们统一采样32个积极窗口 (在所有类别中) 和96个背景窗口，以构建大小为128的小批量。We bias the sampling towards positive windows because they are extremely rare compared to background.我们将采样偏向正窗，因为与背景相比，它们极其罕见。
Object category classifiers.对象类别分类器。 Consider training a binary classifier to detect cars. 考虑训练一个二进制分类器来检测汽车。It’s clear that an image region tightly enclosing a car should be a positive example. 很明显，紧挨着汽车的图像区域应该是一个积极的例子。Similarly, it’s clear that a background region, which has nothing to do with cars, should be a negative example. 同样，很明显，与汽车无关的背景区域应该是一个负面的例子。Less clear is how to label a region that partially overlaps a car. 不太清楚如何标记部分与汽车重叠的区域。We resolve this issue with an IoU overlap threshold, below which regions are defined as negatives. 我们用IoU重叠阈值解决这个问题，低于该阈值区域被定义为负数。The overlap threshold, 0.3,was selected by a grid search over {0, 0.1,…, 0.5} on a validation set. 重叠阈值0.3是通过网格搜索在 {0,0.1，…，0.5} 验证集。We found that selecting this threshold carefully is important. 我们发现仔细选择这个阈值很重要。Setting it to 0.5, as in [34], decreased mAP by 5 points. Similarly, setting it to 0 decreased mAP by 4 points.将其设置为0.5，如 [34] 中所示，将mAP降低了5个点。同样，将其设置为0会使mAP减少4个点。 Positive examples are defined simply to be the ground-truth bounding boxes for each class.积极的例子被简单地定义为每个类的基本事实边界框。

Table 1: Detection average precision (%) on VOC 2010 test. VOC 2010测试的检测平均精度 (%)。R-CNN is most directly comparable to UVA and Regionlets since all methods use selective search region proposals. R-cnn与UVA和regionlet最直接可比，因为所有方法都使用选择性搜索区域建议。Bounding box regression (BB) is described in Section 3.4. 边界盒回归 (BB) 在第3.4节中进行了描述。At publication time, SegDPM was the top-performer on the PASCAL VOC leaderboard.在出版时，SegDPM是PASCAL VOC排行榜上表现最好的。 †DPM and SegDPM use context rescoring not used by the other methods.†DPM和SegDPM使用其他方法未使用的上下文分离。
Once features are extracted and training labels are applied, we optimize one linear SVM per class.一旦提取了特征并应用了训练标签，我们将优化每个类一个线性SVM。Since the training data is too large to fit in memory, we adopt the standard hard negative mining method [15, 32].由于训练数据太大，无法放入内存，我们采用标准的硬负挖掘方法 [15,32] Hard negative mining converges quickly and in practice mAP stops increasing after only a single pass over all images.硬负挖掘很快收敛，并且在实践中，mAP在仅对所有图像进行一次通过后就停止增加。
In supplementary material we discuss why the positive and negative examples are defined differently in fine-tuning versus SVM training. 在补充材料中，我们讨论了为什么在微调与SVM训练中，正面和负面示例的定义不同。We also discuss why it’s necessary to train detection classifiers rather than simply use outputs from the final layer (fc8) of the fine-tuned CNN.我们还讨论了为什么有必要训练检测分类器，而不是简单地使用微调CNN的最后一层 (fc8) 的输出。
2.4 Results on PASCAL VOC 2010-12 PASCAL VOC 2010-12结果
Following the PASCAL VOC best practices [13], we validated all design decisions and hyperparameters on the VOC 2007 dataset (Section 3.2). 遵循PASCAL VOC最佳实践 [13]，我们验证了VOC 2007数据集上的所有设计决策和超参数 (第3.2节)。For final results on the VOC 2010-12 datasets, we fine-tuned the CNN on VOC 2012 train and optimized our detection SVMs on VOC 2012 trainval.为了获得VOC 2010-12数据集的最终结果，我们对VOC 2012列上的CNN进行了微调，并优化了VOC 2012列上的SVMs检测。 We submitted test results to the evaluation server only once for each of the two major algorithm variants (with and without bounding box regression).对于两种主要算法变体 (有和没有边界框回归)，我们只向评估服务器提交了一次测试结果。
Table 1 shows complete results on VOC 2010. 表1显示了VOC 2010的完整结果。We compare our method against four strong baselines, including SegDPM [16], which combines DPM detectors with the output of a semantic segmentation system [4] and uses additional inter-detector context and image-classifier rescoring. 我们将我们的方法与四个强基线进行比较，包括SegDPM [16]，它将DPM检测器与语义分割系统 [4] 的输出相结合，并使用传统的检测器间上下文和图像分类器分析。The most germane comparison is to the UVA system from Uijlings et al. [34], since our systems use the same region proposal algorithm. 最密切相关的比较是与Uijlings等人 [34] 的UVA系统，因为我们的系统使用相同的区域建议算法。To classify regions, their method builds a four-level spatial pyramid and populates it with densely sampled SIFT, Extended OpponentSIFT, and RGBSIFT descriptors, each vector quantized with 4000-word codebooks.为了对区域进行分类，他们的方法建立了一个四级空间金字塔，并用密集采样的SIFT、扩展的OpponentSIFT和RGB SIFT描述符填充它，每个向量用4000字的码本量化。 Classification is performed with a histogram intersection kernel SVM. 使用直方图相交核SVM进行分类。Compared to their multi-feature, non-linear kernel SVM approach, we achieve a large improvement in mAP, from 35.1% to 53.7% mAP, while also being much faster (Section 2.2). 与它们的多特征、非线性核SVM方法相比，我们在地图上实现了从35.1% 到53.7% 的大幅改进，同时速度也快得多 (第2.2节)。Our method achieves similar performance (53.3% mAP) on VOC 2011/12 test.我们的方法在VOC 2011/12测试中获得了相似的性能 (53.3% mAP)。
3. Visualization, ablation, and modes of error 可视化、消融和误差模式
3.1 Visualizing learned features 可视化学习的功能
First-layer filters can be visualized directly and are easy to understand [23]. 第一层过滤器可以直接可视化，并且易于理解 [23]。They capture oriented edges and opponent colors. 它们捕捉定向边缘和对手的颜色。Understanding the subsequent layers is more challenging.理解后续层更具挑战性。 Zeiler and Fergus present a visually attractive deconvolutional approach in [37]. Zeiler和Fergus在 [37] 中提出了一种视觉意义上的反意志方法。We propose a simple (and complementary) non-parametric method that directly shows what the network learned.我们提出了一种简单的 (互补的) 非参数方法，该方法可以直接显示网络所学到的内容。
The idea is to single out a particular unit (feature) in the network and use it as if it were an object detector in its own right.这个想法是在网络中挑出一个特定的单元 (特征)，并像使用目标检测器一样使用它。 That is, we compute the unit’s activations on a large set of held-out region proposals (about 10 million), sort the proposals from highest to lowest activation, perform nonmaximum suppression, and then display the topscoring regions. 也就是说，我们在大量保留区域建议 (大约1000万) 上计算单元的激活，将建议从最高激活到最低激活排序，执行非最大抑制，然后显示得分最高的区域。Our method lets the selected unit “speak for itself” by showing exactly which inputs it fires on. We avoid averaging in order to see different visual modes and gain insight into the invariances computed by the unit.我们的方法让选定的单元通过准确显示它触发的输入来 “为自己说话”。为了观察不同的视觉模式并深入了解由单元计算的不变性，我们避免了平均老化。
We visualize units from layer pool5, which is the maxpooled output of the network’s fifth and final convolutional layer. 我们可视化来自第5层池的单元，这是网络的第五层和最终卷积层的maxpooled输出。The pool5 feature map is 6 × 6 × 256 = 9216- dimensional. Ignoring boundary effects, each pool5 unit has a receptive field of 195×195 pixels in the original 227×227 pixel input. pool5特征图为6 × 6 × 256 = 9216维。忽略边界效应，每个pool5单元在原始227 × 227像素输入中具有195 × 195像素的感受野。A central pool5 unit has a nearly global view, while one near the edge has a smaller, clipped support.一个中央池5单元有一个几乎全局的视图，而一个靠近边缘的单元有一个更小的clipped的支撑。
Each row in Figure 3 displays the top 16 activations for a pool5 unit from a CNN that we fine-tuned on VOC 2007 trainval.图3中的每一行都显示了来自CNN的pool5单元的前16次激活，我们在VOC 2007 trainval上对其进行了微调。 Six of the 256 functionally unique units are visualized (the supplementary material includes more). 256个功能独特的单元中有6个是可见的 (补充材料包括更多)。These units were selected to show a representative sample of what the network learns. In the second row, we see a unit that fires on dog faces and dot arrays.选择这些单元是为了展示网络学习内容的代表性样本。在第二行中，我们看到一个在狗的面孔和点阵上触发的单元。 The unit corresponding to the third row is a red blob detector. 第三行对应的单位是红色There are also detectors for human faces and more abstract patterns such as text and triangular structures with windows. 也有人脸探测器和更抽象的图案，如带窗户的文本和三角形结构。The network appears to learn a representation that combines a small number of class-tuned features together with a distributed representation of shape, texture, color, and material properties.网络似乎学习了一种表示形式，该表示形式结合了少量的类调整特征以及形状、纹理、颜色和材料属性的分布式表示形式。 The subsequent fully connected layer fc6 has the ability to model a large set of compositions of these rich features.随后的完全连接层fc6能够对这些丰富特征的大量组合物进行建模。
Figure 3: Top regions for six pool5 units. 六个pool5单元的顶部区域。Receptive fields and activation values are drawn in white. 接受场和激活值用白色绘制.Some units are aligned to concepts, such as people (row 1) or text (4). 一些单元与概念一致，例如人 (第1行) 或文本 (4)。Other units capture texture and material properties, such as dot arrays (2) and specular reflections (6).其他单元捕获纹理和材料属性，例如点阵 (2) 和镜面反射 (6)。

Table 2: Detection average precision (%) on VOC 2007 test. Rows 1-3 show R-CNN performance without fine-tuning. VOC 2007测试的检测平均精度 (%)。第1-3行显示没有微调的r-cnn性能.Rows 4-6 show results for the CNN pre-trained on ILSVRC 2012 and then fine-tuned (FT) on VOC 2007 trainval. 第4-6行显示了在ILSVRC 2012上预先训练的CNN结果，然后在VOC 2007火车上进行了微调 (FT)。Row 7 includes a simple bounding box regression (BB) stage that reduces localization errors (Section 3.4).第7行包括一个简单的边界盒回归 (BB) 阶段，以减少定位错误 (第3.4节)。 Rows 8-10 present DPM methods as a strong baseline. 第8-10行将DPM方法作为强基线。The first uses only HOG, while the next two use different feature learning approaches to augment or replace HOG.第一个只使用HOG，而接下来的两个使用不同的特征学习方法来增加或取代HOG。
3.2. Ablation studies
Performance layer-by-layer, without fine-tuning. 性能逐层，无需微调。To understand which layers are critical for detection performance, we analyzed results on the VOC 2007 dataset for each of the CNN’s last three layers. 为了了解哪些层对检测性能至关重要，我们分析了CNN最后三个层中每个层的VOC 2007数据集的结果。Layer pool5 was briefly described in Section 3.1.第3.1节简要描述了第5层池。 The final two layers are summarized below. 最后两层总结如下。Layer fc6 is fully connected to pool5. fc6层与ppol5完全连接。To compute features, it multiplies a 4096×9216 weight matrix by the pool5 feature map (reshaped as a 9216-dimensional vector) and then adds a vector of biases. 为了计算特征，它将一个4096 × 9216的权重矩阵乘以pool5特征图 (重塑为9216维向量)，然后添加偏差向量。This intermediate vector is component-wise half-wave rectified (x ← max(0, x)).该中间矢量是按分量半波整流的 (x ← max(0，x))。 Layer fc7 is the final layer of the network. It is implemented by multiplying the features computed by fc6 by a 4096 × 4096 weight matrix, and similarly adding a vector of biases and applying half-wave rectification.fc7层是网络的最后一层。它是通过将fc6计算出的特征乘以4096 × 4096权重矩阵，并类似地添加偏差向量并应用半波整流来实现的。
We start by looking at results from the CNN without fine-tuning on PASCAL, i.e. all CNN parameters were pretrained on ILSVRC 2012 only. 我们从CNN的结果开始，没有对PASCAL进行微调，即所有CNN参数仅在ILSVRC 2012上进行了预训练。Analyzing performance layer-by-layer (Table 2 rows 1-3) reveals that features from fc7 generalize worse than features from fc6. 逐层分析性能 (表2第1-3行) 表明，fc7的特征概括起来比fc6的特征差。This means that 29%, or about 16.8 million, of the CNN’s parameters can be removed without degrading mAP. 这意味着可以删除29% 个 (约1680万个) CNN参数，而不会降低mAP。More surprising is that removing both fc7 and fc6 produces quite good results even though pool5 features are computed using only 6% of the CNN’s parameters. 更令人惊讶的是，即使仅使用CNN的6% 个参数计算pool5特征，移除fc7和fc6也会产生相当好的结果。Much of the CNN’s representational power comes from its convolutional layers, rather than from the much larger densely connected layers. 美国有线电视新闻网的大部分代表性力量来自其卷积层，而不是来自更大的密集连接层。This finding suggests potential utility in computing a dense feature map, in the sense of HOG, of an arbitrary-sized image by using only the convolutional layers of the CNN. 这一发现表明，仅通过使用美国有线电视新闻网的卷积层，在计算任意大小的图像的密集特征图 (HOG) 方面，具有潜在的效用。This representation would enable experimentation with sliding-window detectors, including DPM, on top of pool5 features.这种表现将使滑动窗口探测器，包括DPM，在pool5特征之上的实验成为可能。
Performance layer-by-layer, with fine-tuning. 性能层层，微调。We now look at results from our CNN after having fine-tuned its parameters on VOC 2007 trainval. 现在，我们在对VOC 2007 trainval上的参数进行微调后，查看CNN的结果。The improvement is striking (Table 2 rows 4-6): fine-tuning increases mAP by 8.0 percentage points to 54.2%. 改进是连续的 (表2第4-6行): 微调将地图增加8.0个百分点至54.2%。The boost from fine-tuning is much larger for fc6 and fc7 than for pool5, which suggests that the pool5 features learned from ImageNet are general and that most of the improvement is gained from learning domain-specific non-linear classifiers on top of them.fc6和fc7的微调提升比pool5大得多，这表明从ImageNet中学到的pool5特征是普遍的，并且大多数改进是从学习它们之上的特定领域的非线性分类器中获得的。
Comparison to recent feature learning methods.与最近特征学习方法的比较。 Relatively few feature learning methods have been tried on PASCAL VOC detection. 在PASCAL VOC检测中尝试的特征学习方法相对较少。We look at two recent approaches that build on deformable part models. For reference, we also include results for the standard HOG-based DPM [18].我们看看最近建立在可变形零件模型上的两种方法。作为参考，我们还包括了基于HOG的标准DPM的结果 [18]。
The first DPM feature learning method, DPM ST [26], augments HOG features with histograms of “sketch token” probabilities. 第一个DPM特征学习方法，DPM ST [26]，用 “草图标记” 概率的柱状图增加了HOG特征。Intuitively, a sketch token is a tight distribution of contours passing through the center of an image patch. 直观地说，草图标记是穿过图像补丁中心的轮廓的紧密分布。Sketch token probabilities are computed at each pixel by a random forest that was trained to classify 35×35 pixel patches into one of 150 sketch tokens or background.草图标记概率由一个随机林在每个像素上计算，该林被训练成将35 × 35像素补丁分类为150个草图标记或背景之一。
The second method, DPM HSC [28], replaces HOG with histograms of sparse codes (HSC). 第二种方法，DPM HSC [28]，用稀疏码直方图 (HSC) 代替HOG。To compute an HSC, sparse code activations are solved for at each pixel using a learned dictionary of 100 7 × 7 pixel (grayscale) atoms.为了计算HSC，使用学习的100个7 × 7像素 (灰度) 原子字典解决了每个像素的稀疏码激活。 The resulting activations are rectified in three ways (full and both half-waves), spatially pooled, unit 2 normalized, and then power transformed (x ← sign(x)|x| α).由此产生的激活通过三种方式 (全波和半波) 进行校正，空间汇集，单元2标准化，然后功率转换 (x ← sign(x)|x| α)。
All R-CNN variants strongly outperform the three DPM baselines (Table 2 rows 8-10), including the two that use feature learning. 所有r-cnn变体的表现都强烈优于三个DPM基线 (表2第8-10行)，包括使用功能学习的两个。Compared to the latest version of DPM, which uses only HOG features, our mAP is more than 20 percentage points higher: 54.2% vs. 33.7%—a 61% relative improvement. 与仅使用HOG功能的最新版本的DPM相比，我们的地图高出20多个百分点: 54.2% 比33.7%-相对提高了61%。The combination of HOG and sketch tokens yields 2.5 mAP points over HOG alone, while HSC improves over HOG by 4 mAP points (when compared internally to their private DPM baselines—both use nonpublic implementations of DPM that underperform the open source version [18]). HOG和草图标记的组合仅在HOG上就产生了2.5个地图点，而HSC比HOG提高了4个映射点 (当内部与它们的私有DPM基线相比时 — 两者都使用了DPM的非公共实现，其性能低于开源版本 [18])。These methods achieve mAPs of 29.1% and 34.3%, respectively.这些方法分别实现29.1% 和34.3% 的映射。
3.3. Detection error analysis检测误差分析
We applied the excellent detection analysis tool from Hoiem et al. [21] in order to reveal our method’s error modes, understand how fine-tuning changes them, and to see how our error types compare with DPM. 我们应用了来自Hoiem等人 [21] 的优秀的检测分析工具来揭示我们方法的错误模式，理解微调是如何改变它们的，并查看我们的错误类型与DPM的比较。A full summary of the analysis tool is beyond the scope of this paper and we encourage readers to consult [21] to understand some finer details (such as “normalized AP”). 分析工具的完整摘要超出了本文的范围，我们鼓励读者查阅 [21] 以了解一些更精细的细节 (例如 “标准化AP”)。Since the analysis is best absorbed in the context of the associated plots, we present the discussion within the captions of Figure 4 and Figure 5.由于该分析最好地吸收在相关图的上下文中，我们在图4和图5的标题中介绍了讨论。

Figure 4: Distribution of top-ranked false positive (FP) types. 排名靠前的假阳性 (FP) 类型的分布。Each plot shows the evolving distribution of FP types as more FPs are considered in order of decreasing score.每张图显示了FP类型的演变分布，因为更多的FPs被认为是按分数降低的顺序。 Each FP is categorized into 1 of 4 types: Loc—poor localization (a detection with an IoU overlap with the correct class between 0.1 and 0.5, or a duplicate); 每个FP被分为4种类型中的1种: Loc差定位 (IoU与正确的类别在0.1到0.5之间重叠的检测，或重复);
Sim—confusion with a similar category; Sim-与类似类别的混淆;Oth—confusion with a dissimilar object category; BG—a FP that fired on background. Oth-与不同对象类别的混淆; BG-在背景上触发的FP。Compared with DPM (see [21]), significantly more of our errors result from poor localization, rather than confusion with background or other object classes, indicating that the CNN features are much more discriminative than HOG. 与DPM相比 (见 [21])，我们的错误明显更多是由不良的本地化导致的，而不是与背景或其他对象类混淆，表明CNN特征比HOG更具区别性。Loose localization likely results from our use of bottom-up region proposals and the positional invariance learned from pre-training the CNN for whole-image classification.松散的定位可能是我们使用自下而上的区域建议的结果，以及从全图像分类的CNN预训练中学到的位置不变性。 Column three shows how our simple bounding box regression method fixes many localization errors.第三列显示了我们简单的边界框回归方法如何修复许多本地化错误。
3.4. Bounding box regression 边界框回归
Based on the error analysis, we implemented a simple method to reduce localization errors. 基于误差分析，我们实现了一个简单的方法来减少定位错误。Inspired by the bounding box regression employed in DPM [15], we train a linear regression model to predict a new detection window given the pool5 features for a selective search region proposal. 受DPM [15] 中采用的边界框回归的启发，我们训练了一个线性回归模型来预测一个新的检测窗口，给出了一个选择性搜索区域建议的pool5特征。Full details are given in the supplementary material. Results in Table 1, Table 2, and Figure 4 show that this simple approach fixes a large number of mislocalized detections, boosting mAP by 3 to 4 points.补充材料中给出了全部细节。表1、表2和图4中的结果表明，这种简单的方法修复了大量的错误定位检测，将mAP提高了3到4点。
4. Semantic segmentation语义分割
Region classification is a standard technique for semantic segmentation, allowing us to easily apply R-CNN to the PASCAL VOC segmentation challenge. 区域分类是语义分割的标准技术，使我们能够轻松地将r-cnn应用于PASCAL VOC分割挑战。To facilitate a direct comparison with the current leading semantic segmentation system (called O2P for “second-order pooling”) [4], we work within their open source framework. O2P uses CPMC to generate 150 region proposalsper image and then predicts the quality of each region, for each class, using support vector regression (SVR). 为了促进与当前领先的语义分割系统 (称为 “二阶池” 的O2P) [4] 的直接比较，我们在它们的开源框架内工作。O2P使用CPMC生成150个区域proposalsper图像，然后使用支持向量回归 (SVR) 预测每个区域的质量。The high performance of their approach is due to the quality of the CPMC regions and the powerful second-order pooling of multiple feature types (enriched variants of SIFT and LBP). 他们的方法的高性能是由于CPMC区域的质量和多种特征类型 (SIFT和LBP的丰富变体) 的强大二阶池。We also note that Farabet et al. [14] recently demonstrated good results on several dense scene labeling datasets (not including PASCAL) using a CNN as a multi-scale per-pixel classifier.我们同样注意到，Farabet等人 [14] 最近证明了在几个密集场景标记数据集 (不包括PAS CAL) 上使用CNN作为多尺度每像素分类器的良好结果。

Figure 5: Sensitivity to object characteristics.对对象特征的敏感性。 Each plot shows the mean (over classes) normalized AP (see [21]) for the highest and lowest performing subsets within six different object characteristics (occlusion, truncation, bounding box area, aspect ratio, viewpoint, part visibility). 每个图显示平均 (类) 归一化AP (见 [21]) 最高最低执行子集六不同对象特征 (闭塞、截断边界框区域，纵横比、视点、零件可见性)。We show plots for our method (R-CNN) with and without fine-tuning (FT) and bounding box regression (BB) as well as for DPM voc-release5. 我们展示了有或没有微调 (FT) 和边界盒回归 (BB) 的方法 (r-cnn) 以及DPM voc-release5的图。Overall, fine-tuning does not reduce sensitivity (the difference between max and min), but does substantially improve both the highest and lowest performing subsets for nearly all characteristics. 总的来说，微调不会降低灵敏度 (最大值和最小值之间的差异)，但会显著改善几乎所有特性的最高和最低性能子集。This indicates that fine-tuning does more than simply improve the lowest performing subsets for aspect ratio and bounding box area, as one might conjecture based on how we warp network inputs. Instead, fine-tuning improves robustness for all characteristics including occlusion, truncation, viewpoint, and part visibility.这表明微调不仅仅是简单地改善长宽比和边界框区域的性能最低的子集，正如人们可能基于我们如何扭曲网络输入来推测的那样。相反，微调提高了所有特征的鲁棒性，包括遮挡、截断、视点和零件可见性。
CNN features for segmentation. We evaluate three strategies for computing features on CPMC regions, all of which begin by warping the rectangular window around the region to 227 × 227. CNN细分功能。我们评估了三种计算CPMC区域特征的策略，所有这些策略都是从将区域周围的矩形窗口翘曲到227 × 227开始的。The first strategy (full) ignores the region’s shape and computes CNN features directly on the warped window, exactly as we did for detection. 第一种策略 (完全) 忽略了区域的形状，并直接在扭曲的窗口上计算CNN特征，就像我们在检测时所做的那样。However, these features ignore the non-rectangular shape of the region. 然而，这些特征忽略了区域的非矩形形状。Two regions might have very similar bounding boxes while having very little overlap. Therefore, the second strategy (fg) computes CNN features only on a region’s foreground mask. 两个区域可能有非常相似的边界框，而重叠很少。因此，第二策略 (fg) 仅计算区域前景掩模上的CNN特征。We replace the background with the mean input so that background regions are zero after mean subtraction. 我们将背景替换为平均输入，以便在平均减法后背景区域为零。The third strategy (full+fg) simply concatenates the full and fg features; our experiments validate their complementarity第三种策略 (full + fg) 简单地连接了full和fg特征; 我们的实验验证了它们的实用性

Table 3: Segmentation mean accuracy (%) on VOC 2011 validation. VOC 2011验证的分段平均准确度 (%)。Column 1 presents O2P; 2-7 use our CNN pre-trained on ILSVRC 2012.第1栏显示O2P; 2-7使用我们在ILSVRC 2012上预先培训的CNN。
Results on VOC 2011.挥发性有机化合物2011的结果。 Table 3 shows a summary of our results on the VOC 2011 validation set compared with O2P. (See supplementary material for complete per-category results.) Within each feature computation strategy, layer fc6 always outperforms fc7 and the following discussion refers to the fc6 features. 表3显示了我们对VOC 2011验证集与O2P相比的结果的总结。(有关每个类别的完整结果，请参阅补充材料。)在每个特征计算策略中，层fc6总是优于fc7，下面的讨论涉及fc6特征。The fg strategy slightly outperforms full, indicating that the masked region shape provides a stronger signal, matching our intuition. fg策略稍微优于full，表明蒙面区域形状提供了更强的信号，符合我们的直觉。However, full+fg achieves an average accuracy of 47.9%, our best result by a margin of 4.2% (also modestly outperforming O2P), indicating that the context provided by the full features is highly informative even given the fg features. 然而，full + fg的平均准确度达到47.9%，我们的最佳成绩是3月份的4.2% (也略优于O2P)，表明即使考虑到fg特征，由全部特征提供的上下文也是信息丰富的。Notably, training the 20 SVRs on our full+fg features takes an hour on a single core, compared to 10+ hours for training on O2P features.值得注意的是，在我们的完整 + fg功能上训练20个svr在单个核心上需要一个小时，而在O2P功能上训练需要10个多小时。
In Table 4 we present results on the VOC 2011 test set, comparing our best-performing method, fc6 (full+fg), against two strong baselines. 在表4中，我们给出了VOC 2011测试集的结果，将我们表现最好的方法fc6 (全 + fg) 与两个强基线进行了比较。Our method achieves the highest segmentation accuracy for 11 out of 21 categories, and the highest overall segmentation accuracy of 47.9%, averaged across categories (but likely ties with the O2P result under any reasonable margin of error). 我们的方法实现了21个类别中11个类别的最高分割精度，最高的总体分割精度为47.9%，跨类别平均 (但在任何合理的误差范围内可能与O2P结果相关)。Still better performance could likely be achieved by fine-tuning.通过微调，可能会获得更好的性能。
5. Conclusion 结论
In recent years, object detection performance had stagnated. 近年来，目标检测性能停滞不前。The best performing systems were complex ensembles combining multiple low-level image features with high-level context from object detectors and scene classifiers. 性能最好的系统是将多个低级图像特征与来自对象检测器和场景分类器的高级上下文相结合的复杂集合。This paper presents a simple and scalable object detection algorithm that gives a 30% relative improvement over the best previous results on PASCAL VOC 2012.本文提出了一种简单且可扩展的对象检测算法，该算法比以前在PASCAL VOC 2012上的最佳结果相对提高了30%。
We achieved this performance through two insights. 我们通过两种见解实现了这一绩效。The first is to apply high-capacity convolutional neural networks to bottom-up region proposals in order to localize and segment objects. 第一个是将大容量卷积神经网络应用于自下而上的区域建议，以便定位和分割对象。The second is a paradigm for training large CNNs when labeled training data is scarce. 第二个是当标记的训练数据稀缺时训练大型CNNs的范例。We show that it is highly effective to pre-train the network— with supervision—for a auxiliary task with abundant data (image classification) and then to fine-tune the network for the target task where data is scarce (detection). 我们表明，在监督下对网络进行预训练是非常有效的，用于数据丰富的辅助任务 (图像分类) 然后针对数据稀缺的目标任务 (检测) 微调网络。We conjecture that the “supervised pre-training/domain-specific finetuning” paradigm will be highly effective for a variety of data-scarce vision problems.我们推测 “监督预训练/特定领域的微调” 范例将对各种数据稀缺的视觉问题非常有效。

Table 4: Segmentation accuracy (%) on VOC 2011 test.VOC 2011测试的分割精度 (%)。 We compare against two strong baselines: the “Regions and Parts” (R&P) method of [2] and the second-order pooling (O2P) method of [4]. 我们与两个强基线进行了比较: [2] 的 “区域和部分” (R & P) 方法和 [4] 的二阶汇集 (O2P) 方法。Without any fine-tuning, our CNN achieves top segmentation performance, outperforming R&P and roughly matching O2P.没有任何微调，我们的美国有线电视新闻网实现了最高的细分性能，优于R & P，大致匹配O2P。
We conclude by noting that it is significant that we achieved these results by using a combination of classical tools from computer vision and deep learning (bottomup region proposals and convolutional neural networks). 最后，我们注意到，通过结合使用计算机视觉和深度学习的经典工具 (自底区域建议和卷积神经网络)，我们获得了这些结果是很重要的。Rather than opposing lines of scientific inquiry, the two are natural and inevitable partners.这两者不是对立的科学探究路线，而是自然和不可避免的伙伴。
Acknowledgments. 致谢。This research was supported in part by DARPA Mind’s Eye and MSEE programs, by NSF awards IIS-0905647, IIS-1134072, and IIS-1212798, MURI N000014-10-1-0933, and by support from Toyota. The GPUs used in this research were generously donated by the NVIDIA Corporation.此研究部分DARPA心目中和硕士程序，NSF奖IIS-0905647，IIS-1134072，IIS-1212798木里N000014-10-1-0933，支持丰田。本研究中使用的图形处理器由英伟达公司慷慨捐赠。