Learning Temporal Pose Estimation from Sparsely-Labeled Videos

Published 6 Jun 2019 in cs.CV | (1906.04016v3)

Abstract: Modern approaches for multi-person pose estimation in video require large amounts of dense annotations. However, labeling every frame in a video is costly and labor intensive. To reduce the need for dense annotations, we propose a PoseWarper network that leverages training videos with sparse annotations (every k frames) to learn to perform dense temporal pose propagation and estimation. Given a pair of video frames---a labeled Frame A and an unlabeled Frame B---we train our model to predict human pose in Frame A using the features from Frame B by means of deformable convolutions to implicitly learn the pose warping between A and B. We demonstrate that we can leverage our trained PoseWarper for several applications. First, at inference time we can reverse the application direction of our network in order to propagate pose information from manually annotated frames to unlabeled frames. This makes it possible to generate pose annotations for the entire video given only a few manually-labeled frames. Compared to modern label propagation methods based on optical flow, our warping mechanism is much more compact (6M vs 39M parameters), and also more accurate (88.7% mAP vs 83.8% mAP). We also show that we can improve the accuracy of a pose estimator by training it on an augmented dataset obtained by adding our propagated poses to the original manual labels. Lastly, we can use our PoseWarper to aggregate temporal pose information from neighboring frames during inference. This allows our system to achieve state-of-the-art pose detection results on the PoseTrack2017 and PoseTrack2018 datasets. Code has been made available at: https://github.com/facebookresearch/PoseWarper.

Abstract PDF Upgrade to Chat

Authors (5)

Citations (71)

View on Semantic Scholar

Summary

The paper presents PoseWarper, a method that leverages sparsely labeled videos to propagate pose information using deformable convolutions.
It achieves high accuracy with a compact 6 million parameter network, recording 88.7% mAP compared to 83.8% for optical flow methods.
The approach enhances pose detection on PoseTrack datasets and reduces manual labeling, paving the way for efficient video-based pose estimation.

Summary of "Learning Temporal Pose Estimation from Sparsely-Labeled Videos"

The paper presents a novel method for multi-person pose estimation in video sequences, termed PoseWarper, which addresses the challenge of needing dense annotations in existing pose estimation approaches. The key innovation is leveraging sparse annotations in the training phase, significantly reducing the labeling effort compared to traditional methods. The PoseWarper network employs deformable convolutions to learn pose warping between two frames, effectively propagating pose information from a labeled frame to an unlabeled one. This is achieved by using a labeled Frame A and an unlabeled Frame B in training to predict the human pose in Frame A using features from Frame B.

Major Results

Through extensive empirical evaluation, the authors demonstrate the efficiency of PoseWarper. During inference, PoseWarper can reverse its application direction to propagate pose information across entire videos, resulting in full pose annotations even with minimal manual labeling. The PoseWarper is notably compact, with only 6 million parameters compared to 39 million for competing optical flow-based methods and yet achieves superior accuracy, recording an $88.7\%$ mAP compared to $83.8\%$ for optical flow methods. Additionally, training a pose estimator on a dataset augmented by propagated poses from PoseWarper yielded improved accuracy. Furthermore, using PoseWarper to aggregate temporal pose information at inference boosted pose detection performance, achieving state-of-the-art results on PoseTrack2017 and PoseTrack2018 datasets.

Technical Contribution

The PoseWarper architecture achieves its objectives via a backbone CNN, specifically a High-Resolution Network (HRNet-W48), combined with deformable convolutions for warping. The elegance of the solution lies in its simplicity and efficiency: by utilizing dilated convolutions to capture motion cues at varying spatial scales, the network can effectively learn motion offsets and apply them to rewarp pose heatmaps across frames. The approach is framed as a learning task where the network must spatially and temporally align features from one frame to another for accurate pose estimation.

Practical and Theoretical Implications

From a practical perspective, PoseWarper addresses a significant bottleneck in video-based pose estimation by effectively minimizing the necessity for labor-intensive data annotation, making the deployment of pose estimation systems more feasible in real-world applications. The methodology also aligns well with trends towards more unsupervised and less data-dependent AI systems. Theoretically, the work opens avenues for exploring self-supervised learning objectives to further reduce annotation requirements and extend the capabilities of temporal information aggregation in machine learning models.

Future Directions

Future developments derived from this work may focus on improving label propagation mechanisms for frames with substantial temporal gaps and incorporating more robust self-supervised learning paradigms to reduce reliance on labeled frame pairs. The concept of leveraging sparsely-labeled data presents a promising path forward for various AI applications where data is abundant but labels are scarce. Additionally, extending the temporal aggregation capabilities of PoseWarper to handle more significant occlusions or appearance changes over longer intervals in video could further enhance its robustness and applicability across diverse video datasets.

To conclude, PoseWarper represents a significant progression in the field of temporal pose estimation, exemplified by its compact and effective design, which wisely utilizes spatial-temporal deformations to advance the state of the art in video pose detection.

Markdown Report Issue