Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Dual networks based 3D Multi-Person Pose Estimation from Monocular Video (2205.00748v3)

Published 2 May 2022 in cs.CV

Abstract: Monocular 3D human pose estimation has made progress in recent years. Most of the methods focus on single persons, which estimate the poses in the person-centric coordinates, i.e., the coordinates based on the center of the target person. Hence, these methods are inapplicable for multi-person 3D pose estimation, where the absolute coordinates (e.g., the camera coordinates) are required. Moreover, multi-person pose estimation is more challenging than single pose estimation, due to inter-person occlusion and close human interactions. Existing top-down multi-person methods rely on human detection (i.e., top-down approach), and thus suffer from the detection errors and cannot produce reliable pose estimation in multi-person scenes. Meanwhile, existing bottom-up methods that do not use human detection are not affected by detection errors, but since they process all persons in a scene at once, they are prone to errors, particularly for persons in small scales. To address all these challenges, we propose the integration of top-down and bottom-up approaches to exploit their strengths. Our top-down network estimates human joints from all persons instead of one in an image patch, making it robust to possible erroneous bounding boxes. Our bottom-up network incorporates human-detection based normalized heatmaps, allowing the network to be more robust in handling scale variations. Finally, the estimated 3D poses from the top-down and bottom-up networks are fed into our integration network for final 3D poses. To address the common gaps between training and testing data, we do optimization during the test time, by refining the estimated 3D human poses using high-order temporal constraint, re-projection loss, and bone length regularizations. Our evaluations demonstrate the effectiveness of the proposed method. Code and models are available: https://github.com/3dpose/3D-Multi-Person-Pose.

Citations (20)

Summary

  • The paper introduces a dual-network framework that integrates top-down and bottom-up methods to robustly estimate 3D poses from monocular video.
  • The approach employs high-resolution joint detection, normalized heatmaps, and test-time optimization to overcome occlusions and scale variations.
  • Evaluations on datasets like MuPoTS-3D and Human3.6M demonstrate significant accuracy improvements over existing state-of-the-art methods.

Dual Networks Based 3D Multi-Person Pose Estimation from Monocular Video

The paper "Dual networks based 3D Multi-Person Pose Estimation from Monocular Video" by Yu Cheng, Bo Wang, and Robby T. Tan presents a comprehensive framework for addressing the challenges in 3D multi-person pose estimation from monocular videos. Distinctly, the work integrates both the top-down and bottom-up approaches to leverage their strengths while mitigating their individual weaknesses. This paper has significant implications for real-world applications where precise human pose estimation is invaluable, such as in surveillance, human-computer interaction, and sports analysis.

Problem Context

The problem of estimating 3D poses from monocular video is non-trivial, especially in multi-person scenarios due to potential inter-person occlusions and variations in scale. Traditional top-down approaches that rely on human detection suffer when detection errors occur, while bottom-up approaches are susceptible to inaccuracies when the subjects are at small scales. This paper addresses these challenges through an integrated dual-network design, a strategic test-time optimization, and a semi-supervised learning paradigm to cope with the scarcity of labeled 3D data.

Methodology

The proposed solution is a hybrid framework integrating both top-down and bottom-up methodologies using distinct networks. The top-down network focuses on robustly handling multiple individuals within the entire bounding box, compensating for bounding box inaccuracies, and is incorporated with high-resolution pose detection for human joints. Meanwhile, the bottom-up network processes images to comprehensively capture global context and scale variations through the inclusion of normalized heatmaps using human-detection data. Notably, the results from the two networks are integrated via a sophisticated integration network, providing a robust estimation of 3D poses.

Moreover, the paper introduces a novel test-time optimization technique that serves to refine estimated poses by reducing discrepancies between training and testing data. This is achieved through high-order temporal constraints, re-projection losses, and bone length regularization — all contributing to improved generalization capabilities of the model on unseen data. Additionally, a dual-person pose discriminator is employed to ensure plausible interactions between close persons, adding another layer of robustness to the results.

Results

The paper provides rigorous evaluations using datasets like MuPoTS-3D, JTA, Human3.6M, and 3DPW. The results demonstrate superior performance in 3D pose estimation accuracy over existing state-of-the-art methods. Particularly, the proposed system achieves substantial improvements in scenarios involving occlusions and scale variations, testament to the effectiveness of an integrated approach. Metrics such as PCK, PCKabs_{abs}, and F1 scores show significant advances, with the dual-network framework outperforming both top-down and bottom-up methods individually.

Implications and Future Work

The implications of this work are impactful for advancing the field of multi-person 3D pose estimation, particularly in simplifying the processing requirements for monocular video data. Practically, the adopted methodology can enhance applications ranging from motion capture in sports to improving interaction mechanisms in augmented and virtual reality platforms.

Theoretically, this work inspires future exploration into hybrid frameworks that adapt dynamically to the context of the application, perusing not only the efficacies of multi-camera systems but also single-camera setups made more viable with such integrated approaches. Future developments could entail improving the computational efficiency and robustness to partial occlusions, possibly through novel data augmentation techniques and further refining discriminative ability for natural interactions in diverse environmental contexts.

Overall, the systemic incorporation of dual networks and validated improvements in real-world-like benchmarks significantly forward the domain of computer vision in multi-person pose estimation, setting a firm foundation for both theoretical enhancements and practical implementations.