原文地址:https://arxiv.org/pdf/1605.03170.pdf 作者:Eldar Insafutdinov, Bernt Schiele 项目地址:略

本文贡献:
We propose (1) improved body part detectors that generate effective bottom-up proposals for body parts; (2) novel image-conditioned pairwise terms that allow to assemble the proposals into a variable number of consistent body part configurations; and (3) an incremental optimization strategy that explores the search space more efficiently thus leading both to better performance and significant speed-up factors.
最终的方法:Deeper(采用残差网络),stronger(introduce novel image-conditioned pairwise terms),and faster(采用增量优化方法等)

在优化的时候,作者采用了 Brach and cut 方式 => 更加高效,快速!

关节点检测器

Stride. 作者发现 8 px 的步长得到的模型效果相对更好;所以需要调整 stride(原来的 stride 更大,需要更小的 stride)。恢复分辨率的常用方式:上采样、反卷积以及空洞卷积。通常来说,就是空洞卷积的方式效果更好,但是本文由于 GPU Memory 的限制,不能直接采用这种方式(???)所以作者采用了一种混合的方式:First, we remove the final classification as well as average pooling layer. Then, we decrease the stride of the first convolutional layers of the conv5 bank from 2 px to 1 px to prevent down-sampling. Next, we add holes to all 3x3 convolutions in conv5 to preserve their receptive field. This reduces the stride of the full CNN to 16 px. Finally, we add deconvolutional layers for 2x up-sampling and connect the final output to the output of the conv3 bank.(结果分辨率更高,并且 stride 更小,结果更加精细)

空洞卷积提高分辨率:主要就是降低卷积步长,比如原来 stride = 2。空洞卷积时就可以将 stride 设为 1,显然输出的结果是更加 dense 的;但是直接这样做的话,会使得原来 stride = 2 的网络“感受野”相对于后者来说,这样后者就损失了感受野。为了解决这个问题,就采用空洞的方式来增大感受野。(注意由于采用了全卷积的网络,虽然改变 stride 时会改变中间层的数据维度,但是由于都是卷积层,没有 fc 层,不会出现维度不匹配的情况。)空洞的目的:输出更加 dense 的 feature map 的同时,保证感受野。

感受野. 有研究者,认为感受野并不是那么重要,他们采用多分辨率也可以得到比较好的结果。由于 ResNet 很深,所以感受野非常大,无需讨论。We empirically find that re-scaling the original image such that an upright standing person is 340 px high leads to best performance.(如果是首先进行人体检测的话,可以将其作为一个先验知识,在对数据进行预处理时,进行 resize)
中级监督. 其优点:1、解决梯度消失问题;2、中间级的 score maps 作为后续级的输入将更好的编码关节点之间的空间关系。To address the second concern, we make a slightly different choice: we add part loss layers inside the conv4 bank of ResNet. We argue that it is not strictly necessary to use scoremaps as inputs for the subsequent stages.

Image-Conditioned Pairwise Terms

As discussed in Sec. 3, a large receptive field for the CNN-based part detectors allows to accurately predict the presence of a body part at a given location. However, it also contains enough evidence to reason about locations of other parts in the vicinity. __We draw on this insight and propose to also use deep networks to make pairwise part-to-part predictions.
image.png
image.png

模型

Our approach is inspired by the body part location refinement described in Sec. 3. In addition to predicting offsets for the current joint, we directly regress from the current location to the relative positions of all other joints. 对于 Score Map 上被标记为 positive 的点 2016ECCV_DeeperCut - 图3,其类别为 2016ECCV_DeeperCut - 图4 ;对于其他属于类别 2016ECCV_DeeperCut - 图5 的点,我们堆其他类别的点定义相对位置:2016ECCV_DeeperCut - 图6. We add an extra layer that predicts relative position 2016ECCV_DeeperCut - 图7 and train it with a smooth L1 loss function. We thus perform joint training of body part detectors (cross-entropy loss), location regression (L1 loss) and pairwise regression (L1 loss) by linearly combining all three loss functions. The targets t are normalized to have zero mean and unit variance over the training set.
image.png

这里和之前的 DeepCut 是非常类似的。

Incremental Optimization

之前采用的 DeepCut 方法在理论分析非常优雅,但是实际不太实用。
image.png