基于视频序列 - CVPR2019_3D human pose estimation in video with temporal convolutions and semi-supervised training - 《深度学习笔记》

本文简介
相关工作
Temporal dilated convolutional model
Semi-supervised approach
- Trajectory model
Bone length L2 loss

论文原文：3D human pose estimation in video with temporal convolutions and semi-supervised training 作者信息：Dario Pavllo, Christoph Feichtenhofer, David Grangier, Michael Auli (facebook AI Research) 工程地址：https://github.com/facebookresearch/VideoPose3D

本文简介

本文方法取得的效果优于已有的监督学习方法以及半监督学习方法。半监督时，对于未标注的数据，通过估计3D信息之后映射回2D的方式。

3D估计实现方式，首先进行2D关节点的检测，然后在此基础上进行3D关节点的估计，但是这种方式有些问题：

While splitting up the problem arguably reduces the difficulty of the task, it is inherently ambiguous as multiple 3D poses can map to the same 2D keypoints. Previous work tackled this ambiguity by modeling temporal information with recurrent neural networks

前面的一些工作采用的RNNs等网络来建模这种时序信息，本文采用的时域的空洞卷积实现：Compared to approaches relying on RNNs, it provides higher accuracy, simplicity, as well as efficiency, both in terms of computational complexity as well as the number of parameters

作者灵感来源：Our method is inspired by cycle consistency in unsupervised machine translation, where round-trip translation into an intermediate language and back into the original language should be close to the identity function. Specifically, we predict 2D keypoints for an unlabeled video with an off the shelf 2D keypoint detector, predict 3D poses, and then map these back to 2D space.

Compared to previous semi-supervised approaches, we only require camera intrinsic parameters rather than ground-truth 2D annotations or multi-view imagery with extrinsic camera parameters.（也就是在半监督过程中需要获取相机的内部参数罗？？？那如果不知道相机内部参数，采用默认参数应该对结果有一定的影响）

Temporal dilated convolutional model

Our model is a fully convolutional architecture with residual connections that takes a sequence of 2D poses as input and transforms them through temporal convolutions.

Convolutional models enable parallelization over both the batch and the time dimension while RNNs cannot be parallelized over time.

不采用RNN，采用卷积的原因：1、卷积能够并行计算；2、不论序列有多长，卷积模型的梯度回传长度一定，有利于降低梯度消失的概率；3、卷积结构能够对时域上的感受野精确控制；4、dilated convolution 有利于建模长期依赖；

网络输入层输入 CVPR2019_3D human pose estimation in video with temporal convolutions and semi-supervised training - 图3 个关节点的在时域上坐标的连接结果，然后对其进行 kernel size 为的时域卷积，卷积输出通道数为。在这之后，采用的模块。每个模块首先进行 CVPR2019_3D human pose estimation in video with temporal convolutions and semi-supervised training - 图7 卷积，其 kernel size 为，dilation factor 为，紧接着的是 kernel size 为 1 的卷积。除了最好一层，所有的卷积后面都有 BN，Relu 以及 dropout 层。每一层都会按照指数级 CVPR2019_3D human pose estimation in video with temporal convolutions and semi-supervised training - 图10 增加感受野，但是参数增长速度是线性的。其中的是超参数，但是他们被设置以致于使得对于任何输出帧的感受野形成一种树状，能够覆盖所有的输入帧（如图1所示）最后一个输出包含所有输入序列的一个3D姿态，同时考虑了过去和未来的时间信息.为了评估这个方法在实时场景的应用，我们试验了因果卷积的方法（只对过去的信息进行卷积）。

卷积网络一般采用零填充来获得和输入大小一样的输出，然而在早期实验中发现这会导致边际效应，增加了损失值。所以，我们通过左右帧关节点的复制填充这个输入序列。

在图2显示了本文你体系结构的一个实例，其中感受野大小为243帧，B = 4 块。对于卷积层，我们设置 W = 3，C = 1024 输出通道，我们使用 dropout drop 率 p = 0.25。

Semi-supervised approach

因为获得实际的3D姿态预测的标注很困难，本文引进了一种半监督训练方法去提高在实际3D姿态标注有限情况下的姿态预测准确率。本文利用现有的2D姿态检测器和未标注的视频，将反向映射损失加入到监督损失函数中。去解决无标签数据的自动编码问题的关键思想是：将 3D 姿态预测作为编码器，把预测的姿态反向映射到 2D 姿态，基于此进行一个重建损失的计算。

Figure 3: Semi-supervised training with a 3D pose model that takes a sequence of possibly predicted 2D poses as input. We regress the 3D trajectory of the person and add a soft-constraint to match the mean bone lengths of the unlabeled predictions to the labeled ones. Everything is trained jointly. WMPJPE stands for “Weighted MPJPE”.

对于有标签的数据采用实际的3D姿态作为目标，训练一个有监督的损失。未标签的数据被用于去执行一个自动编码器损失，即将预测的3D姿态反向映射到2D，然后检查它输入的连续性。

Trajectory model

Due to the perspective projection, the 2D pose on the screen depends both on the trajectory (i.e. the global position of the human root joint) and the 3D pose (the position of all joints with respect to the root joint). Without the global position, the subject would always be reprojected at the center of the screen with a fixed scale. We therefore also regress the 3D trajectory of the person, so that the back-projection to 2D can be performed correctly. To this end, we optimize a second network which regresses the global trajectory in camera space. The latter is added to the pose before projecting it back to 2D. The two networks have the same architecture but do not share any weights as we observed that they affect each other negatively when trained in a multi-task fashion. As it becomes increasingly difficult to regress a precise trajectory if the subject is further away from the camera, we optimize a weighted mean per-joint position error (WMPJPE) loss function for the trajectory:

即，我们对每个样本给一个实际相机空间深度 CVPR2019_3D human pose estimation in video with temporal convolutions and semi-supervised training - 图14 的倒数作为权重。因为在远处物体中，相关的2D关节点都集中在一起，所以回归远处的物体也不是我们的目标。（越远的关节点预测的误差占比越小）