Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 164 tok/s

Gemini 2.5 Pro 46 tok/s Pro

GPT-5 Medium 21 tok/s Pro

GPT-5 High 27 tok/s Pro

GPT-4o 72 tok/s Pro

Kimi K2 204 tok/s Pro

GPT OSS 120B 450 tok/s Pro

Claude Sonnet 4.5 34 tok/s Pro

2000 character limit reached

3D human pose estimation in video with temporal convolutions and semi-supervised training (1811.11742v2)

Published 28 Nov 2018 in cs.CV

Abstract: In this work, we demonstrate that 3D poses in video can be effectively estimated with a fully convolutional model based on dilated temporal convolutions over 2D keypoints. We also introduce back-projection, a simple and effective semi-supervised training method that leverages unlabeled video data. We start with predicted 2D keypoints for unlabeled video, then estimate 3D poses and finally back-project to the input 2D keypoints. In the supervised setting, our fully-convolutional model outperforms the previous best result from the literature by 6 mm mean per-joint position error on Human3.6M, corresponding to an error reduction of 11%, and the model also shows significant improvements on HumanEva-I. Moreover, experiments with back-projection show that it comfortably outperforms previous state-of-the-art results in semi-supervised settings where labeled data is scarce. Code and models are available at https://github.com/facebookresearch/VideoPose3D

Citations (934)

View on Semantic Scholar

Summary

The paper introduces a temporal convolutional model that processes up to 243-frame sequences to improve 3D human pose estimation accuracy.
It employs a semi-supervised back-projection method with a bone length consistency term to leverage unlabeled data and ensure anatomically plausible results.
Results on Human3.6M show an 11% reduction in MPJPE, underlining the method's effectiveness in dynamic action scenarios.

3D Human Pose Estimation in Video with Temporal Convolutions and Semi-Supervised Training

In "3D Human Pose Estimation in Video with Temporal Convolutions and Semi-Supervised Training", Pavllo et al. present a method for accurate 3D human pose estimation from video. This work builds on the established two-step approach of detecting 2D keypoints followed by lifting them to 3D, while leveraging the capabilities of fully convolutional networks (FCNs) for temporal modeling and introducing a novel semi-supervised training method known as back-projection.

Methodology

The approach centers around a temporal convolutional model. Temporal convolutions have been shown to be effective in capturing dependencies over time in sequences, and their deterministic nature provides advantages over recurrent neural networks (RNNs), such as lower computational complexity and better parallelism. By using dilated convolutions, the model can capture long-range dependencies efficiently without an exponential increase in the number of parameters. The architecture allows the processing of keypoint sequences of up to 243 frames, capturing extensive temporal context while maintaining computational efficiency.

The semi-supervised method, back-projection, enhances the model's performance in scenarios where labeled 3D data is limited. The technique uses unlabeled video data to create an auto-encoding problem where the 3D poses estimated from 2D keypoints are projected back to 2D and compared with the original input. A critical component of this method is a bone length consistency term, ensuring that the predicted 3D poses remain anatomically plausible during training.

Results

On the benchmark Human3.6M dataset, the fully convolutional model significantly improves performance over existing state-of-the-art methods, achieving a mean per-joint position error (MPJPE) of 46.8 mm under Protocol 1, which translates to a reduction of 6 mm (11%) compared to the best previous approach. This architecture excels particularly in dynamic actions, underscoring the advantages of temporal convolutions in handling sequential data.

The method also shows substantial improvements under semi-supervised settings, where the amount of labeled data is deliberately reduced. When labeled data is scarce, the back-projection method outperforms a strong supervised baseline, with improvements up to 15 mm in MPJPE. These results demonstrate the method's robustness and ability to generalize in low-resource scenarios.

Across different 2D keypoint detectors, the model shows consistent improvements, with the best performance achieved using finely-tuned Mask R-CNN or Cascaded Pyramid Networks (CPN) detectors over fine-tuned stacked hourglass networks, suggesting the critical role of precise 2D keypoint estimation in driving 3D pose accuracy.

Implications and Future Directions

This research provides new insights into the efficacy of temporal convolutions for 3D pose estimation, highlighting their efficiency and precision compared to RNNs. The integration of a semi-supervised training strategy also addresses a critical challenge in AI and machine learning: the dependency on large quantities of labeled data. By leveraging back-projection and bone length consistency, the model ensures accurate and plausible 3D pose estimations even from minimal labeled datasets.

A potential future direction involves further integrating geometric constraints and leveraging multi-view video data to enhance the robustness and accuracy of the model in more complex environments. Additionally, the method's adaptability could be explored across different domains such as sports analytics, behavior analysis, and AR/VR applications, where real-time and accurate human pose estimation is critical.

In summary, the work by Pavllo et al. demonstrates significant advancements in 3D human pose estimation through the use of temporal convolutions and a practical semi-supervised training approach. These innovations not only set new benchmarks on standard datasets but also promise broader applicability in diverse and resource-constrained real-world scenarios. The presented methods reflect substantial progress in the translation of sequential data into coherent and realistic 3D pose estimations, laying the groundwork for further research and development in this domain.