Self-supervised Keypoint Correspondences for Multi-Person Pose Estimation and Tracking in Videos

paper code(coming)

原文链接：https://www.yuque.com/jinluzhang/researchblog/pn70wg

Summary🔖

论文来自ECCV2020，本篇论文提出一种使用keypoints correspondence（关键点通信）的视频2D多人姿态估计和跟踪方法，该方法属于pose track课题，但是对视频信息的使用是值得借鉴的。

在训练时，作者不使用video作为训练数据，而是使用裁剪后的图像作为相邻两帧输入到网络中，使用关键点通信恢复missed keypoints and pose，再将训练好的网络用于video中。

论文提出了一种解决视频pose track和pose estimation中补充missed keypoint的思路，即利用keypoints correspondence（关键点通信）。

Motivation👓

video数据集标注困难
光流和行人重识别在pose track上各有各自的缺陷
Method💡

We therefore propose to learn a network that infers keypoint correspondences for multiple persons

在进行使用时，method分为两个过程，首先是进行top-down的2d pose detection，然后对多帧序列基于关键点通信进行补充缺失的关键点，并将关键点进行一一对应，最后得到track结果。

由于在motivation中提到的视频数据标注稀疏，因此在训练阶段作者把训练对象设定为了image，将image通过裁剪操作模拟video序列片段中关键点消失的现象，并通过Keypoint Correspondence Network进行学习。

Keypoint Correspondence Network

关键点通信网络分为两个阶段：Siamese和Refinement，该网络的目的是训练对于video中相邻两帧关键点对应的能力。

Siamese Stage

流程如图：

网络由两个分支组成，第一个分支输入原图以及检测到的p个pose keypoints，第二个分支输入裁剪后的图片，图片尺寸为256256，均由17 Layers的Google-Net进行提取，产生3264*64（通道为32，长宽为64）的feature map。

上面的分支产生p个关键点区域，大小为3233，并将关键点区域作为卷积核对下面分支的feature map进行卷积操作，对得到的feature map进行softmax，得到affinity map（每个map大小为64*64，一共有p个），公式如下：

Refinement Stage

使用3-17层的Google-Net，对两个分支的特征图F1, F2以及affinity map三者进行级联，产生大小为64*64的p个map，这一步是使用F1, F2对产生的affinity map进行微调，以进一步提高affinity map的效果。

Training Loss

在训练该网络的过程中，需要先制作自监督信息，ground-truth map生成方式如下：【ECCV2020】Self-supervised Keypoint Correspondences for Multi-Person Pose Estimation and Tracking in Videos - 图5 为IMAGE2中的关键点

Loss使用binary cross entropy loss，计算方式：

Recover Missed Keypoints

对于输入的相邻两帧，本文将前后两帧输入到关键点通信网络中，然后通过级联得到的feature map中的点，映射回原图分辨率。

在映射过程中有两个问题：

concatenate map的分辨率低于原图
当前帧f中可能包含一些在f-1帧中没有（被遮挡或者怎样）的关键点，从而在f-1帧中不会体现

因此，文中对所有映射回来的和原本估计的关键点用一个bounding box包含起来，进行重新估计

Track

进行pose track，抽象来说就是对于前一帧检测到的Q个人和当前帧检测到的P个人，进行相似度匹配，相似度计算公式如下，这里仅在f-1帧在f帧上的关键点级联feature map 【ECCV2020】Self-supervised Keypoint Correspondences for Multi-Person Pose Estimation and Tracking in Videos - 图9 的亲和度affinity value大于一个阈值时，才考虑进行track。

Evaluation🧪

Dataset

Posetrack 2017 and Posetrack 2018
The datasets have 292 and 593 videos for training and 214 and 375 videos for evaluation, respectively.

Metrics

mAP and MOTA evaluation metrics，两者都是higher is better
【ECCV2020】Self-supervised Keypoint Correspondences for Multi-Person Pose Estimation and Tracking in Videos - 图12
:是FP,缺失数（漏检数），即在第t帧中该目标没有假设位置与其匹配。
:是FN，误判数，即在第t帧中给出的假设位置没有跟踪目标与其匹配。
【ECCV2020】Self-supervised Keypoint Correspondences for Multi-Person Pose Estimation and Tracking in Videos - 图17 :是ID Sw，误配数，即在第t帧中跟踪目标发生ID切换的次数，多发生在这档情况下。

Setup

Top-Down pose estimation network

姿态估计网络使用cascaded RCNN得到人体检测结果，该网络提取384*288的人体检测框；
之后将检测到的结果送入2 stage的估计网络中，两个stage均使用Google-Net得到heatmap，使用第二个阶段得到的heatmap作为结果。

在MS-COCO训练：

We train the pose estimation framework on the MS-COCO dataset [26] for 260 epochs with a base learning rate of 1e􀀀3. The learning rate is reduced to after 200 epochs. During training we apply random ippings and rotations to input crops.

在POSETRACK2017微调：

We finetune the pose estimation framework on the PoseTrack2017 dataset [1] for 12 epochs.

Keypoint Correspondence framework

首先训练Siamese module，然后训练refinement module：

Both modules are trained for 100 epochs with base learning rate of 1e-4 reduced to 1e-5 after 50 epochs.

有一点比较有意思：在posetrack上fine-tune发现没有性能上的增长，这说明对于通用数据集而言，该网络已经可以学习到足够的知识

Result

第一个实验证明的是使用Keypoint Correspondence Network相比其他baseline的在track任务上的效果

第二个实验证明的是recover missed keypoint对于姿态估计和跟踪是有正向作用的

第三个实验证明的是相比于其他SOAT方法，本方法在姿态估计任务上的准确度

最后一个实验证明的是在同时进行pose和track任务时，本方法的效果与其他SOTA的对比
MOTA scores 在POSETRACK17, 18test set上分别达到了 70.5 and 67.9

Conclusion⭐️

Contribution

多帧多人自监督姿态估计算法框架（main）
恢复missed pose keypoint的方法，以及将回复出来的关键点与已检测到的关键点进行关联的方法
在仅使用MS-COCO作为额外数据进行训练keypoint correspondence network的情况下，在PoseTrack 2017 and 2018上达到了SOTA

Rethink❓
在训练时使用image进行自监督，是否可以考虑直接用于video数据集中的相邻两帧？
在evaluate过程中，对于当前帧只使用F-1帧进行关键点对应，是否可以扩展到更多帧数或者之后的几帧？
在进行2D pose estimation时，作者使用top-down方法，但是没有使用现有的SOTA的2D Detector，这个也许可以作为改进的方向之一。
文中的自监督是仅针对任务数据集而言使用自监督，在训练keypoint correspondence network时，使用的是MS-COCO这类通用数据集进行学习，这样做的好处是无需在指定任务上进行训练，从而增加对数据的要求难度，这种做法可以作为下一步思考的方向
Notes📝

有一个问题，在第三个和第四个实验中， map的指标明显在相同数据集上的结果不同，但是其他方法如HRNET和FlowTrack是完全相同的数据，这是因为第三个实验是作者自己测试的，而第四个实验是在PoseTrack的test server数据集上进行的。

Track📚
利用相邻帧的时间信息
- Bertasius, G., Feichtenhofer, C., Tran, D., Shi, J., Torresani, L.: Learning temporal pose estimation from sparsely-labeled videos. In: NeurIPS (2019)
- Guo, H., Tang, T., Luo, G., Chen, R., Lu, Y., Wen, L.: Multi-domain pose network for multi-person pose estimation and tracking. In: CVPR (2018)
- Xiu, Y., Li, J.,Wang, H., Fang, Y., Lu, C.: Pose Flow: Ecient online pose tracking.BMVC (2018)
自监督时序对应关系
- Wang, X., Jabri, A., Efros, A.A.: Learning correspondence from the cycleconsistency of time. In: CVPR (2019)
- Li, X., Liu, S., Mello, S.D., Wang, X., Kautz, J., Yang, M.H.: Joint-task self-supervised learning for temporal correspondence. In: NeurIPS (2019)
correspondence learning
- Kim, S., Min, D., Ham, B., Jeon, S., Lin, S., Sohn., K.: Fully convolutional self-similarity for dense semantic correspondence. CVPR (2017)
- Han, K., R.S.R, Ham, B., Wong, K., Cho, M., Schmid, C., Ponce., J.: Scnet: Learningsemantic correspondence. ICCV (2017)
- Choy, C., Gwak, J., Savarese, S., Chandraker, M.: Universal correspondence network. NIPS (2016)

ref:

https://zhuanlan.zhihu.com/p/75776828

【ECCV2020】Self-supervised Keypoint Correspondences for Multi-Person Pose Estimation and Tracking in Videos