多人人体姿态估计 - 2017_CVPR_Towards Accurate Multi-person Pose Estimation in the Wild - 《深度学习笔记》

论文原文：Towards Accurate Multi-person Pose Estimation in the Wild 博客参考：https://blog.csdn.net/bryant_meng/article/details/109066838

本文采用双阶段的方式：In the first stage, we predict the location and scale of boxes which are likely to contain people; for this we use the Faster RCNN detector. In the second stage, we estimate the keypoints of the person potentially contained in each proposed bounding box. For each keypoint type we predict dense heatmaps and offsets using a fully convolutional ResNet. To combine these outputs we introduce a novel aggregation procedure to obtain highly localized keypoint predictions. We also use a novel form of keypoint-based NonMaximum-Suppression (NMS), instead of the cruder box-level NMS, and a novel form of keypoint-based confidence score estimation, instead of box-level scoring.

在此之前，很多人体关节点估计的文章都是进行单人人体姿态估计（在数据集中，是已知人体位置的 GT 的），本文作者提出了一种先进行人体检测，然后进行单人人体姿态估计的双阶段人体姿态估计的方法。

人体检测器 作者采用以 ResNet 为 backbone 的 Faster-RCNN；

关节点检测 作者采用全卷积的 ResNet- 101 同时得到关节点的大概 heatmaps 和偏移量。设每个人关节点数为 K，那么将会产生 K 个 heatmaps，每个 map 对应一个关节点；与之对应的，每个关节点将预测一个二维的偏移量，即每个关节点对应的一个 heatmap 将关联两个偏移 maps（x, y 各一个）；所以，最终网络输出 channel = 3*K。网络细节：直接采用公开的全卷积的 ResNet 网络，其最后一层时 globalAveragePooling，作者直接将最后一层替换为 1 x 1 卷积层，得到 3 x K 张特征图。注意：这里的 heatmap 不是高斯分布，作者将关节点特定范围内的区域都记为 positive（），所以实际上得到的 heatmaps 是做一个 binary classification.

灵感来源于 two-stage 的 object detection；

融合关节点 heatmap 和 offset 两者按照下面的公式进行融合。其中表示了当前坐标和真实关节点坐标之间的偏差。 2017_CVPR_Towards Accurate Multi-person Pose Estimation in the Wild - 图3 是一个双线性插值核， This is a form of Hough voting: each point j in the image crop grid casts a vote with its estimate for the position of every keypoint, with the vote being weighted by the probability that it is in the disk of influence of the corresponding keypoint. The normalizing factor equals the area of the disk and ensures that if the heatmaps and offsets were perfect, then would be a unit-mass delta function centered at the position of the k-th keypoint.

G 函数从何而来？理想的

人体图像切割：为了防止人体在变换图像 size 过程中的形变，作者通过延展预测的 boxes，来得到固定长宽比的检测框，然后 resize 到指定大小。

模型训练 包括两个 loss，一个是 heatmaps 的 loss，另一个是 Offset 的 loss。其中 Offset 的损失只考虑指定区域内的预测：

总的损失函数，是两者的加权和：

Pose Rescoring pose scores 可以用来做 NMS，如果单单采用人体检测的 boxes scores 效果不如本文这种结合了关节点预测 scores 的方式。

OKS-Based Non Maximum Suppression 检测人时，用到 NMS 后处理，根据 IoU 来抑制重叠度较高的框。作者在 IoU-NMS 的基础上，对 pose estimation 的最终结果还进行了一下 OKS-Based NMS（measure overlap using the object keypoint similarity（OKS） for two candidate pose detection！）

最终结果

人体检测器质量的影响

人检测器的质量还是有一定影响的。 PS：测试人形检测器对关键点检测影响大不大时，可以用 GT 人来测关键点检测的 AP

backbone 和 input size 的影响

OKS-NMS 阈值的选取