Multi-Person 3D Human Pose Estimation from Monocular Images (1909.10854v1)

Published 24 Sep 2019 in cs.CV

Abstract: Multi-person 3D human pose estimation from a single image is a challenging problem, especially for in-the-wild settings due to the lack of 3D annotated data. We propose HG-RCNN, a Mask-RCNN based network that also leverages the benefits of the Hourglass architecture for multi-person 3D Human Pose Estimation. A two-staged approach is presented that first estimates the 2D keypoints in every Region of Interest (RoI) and then lifts the estimated keypoints to 3D. Finally, the estimated 3D poses are placed in camera-coordinates using weak-perspective projection assumption and joint optimization of focal length and root translations. The result is a simple and modular network for multi-person 3D human pose estimation that does not require any multi-person 3D pose dataset. Despite its simple formulation, HG-RCNN achieves the state-of-the-art results on MuPoTS-3D while also approximating the 3D pose in the camera-coordinate system.

Citations (56)

View on Semantic Scholar

Summary

Multi-Person 3D Human Pose Estimation from Monocular Images

The paper "Multi-Person 3D Human Pose Estimation from Monocular Images" introduces a novel approach to tackle the challenging problem of accurately predicting 3D human poses for multiple individuals from a single camera image, particularly in unconstrained environments. The method, proposed as HG-RCNN, builds upon the Mask-RCNN framework augmented by elements from the Hourglass architecture. This hybrid system demonstrates significant improvements in 3D pose estimation without relying on multi-person 3D pose datasets, addressing a prevalent issue of limited availability of annotated data in real-world settings.

Key Contributions

Modular Architecture: The HG-RCNN architecture employs a two-stage process suitable for multi-person 3D pose estimation. Initially, 2D keypoints are predicted within Regions of Interest (RoI) detected by Mask-RCNN. These keypoints are then lifted into 3D space through a subsequent model. This modular approach allows the use of rich multi-person 2D pose datasets and single-person 3D pose datasets, circumventing the need for direct multi-person 3D annotated data.
Weak-Perspective Projection: The paper offers a method to position the estimated 3D poses within camera coordinates using a weak-perspective projection model. This involves joint optimization of focal lengths and root translations, enhancing spatial reasoning about human interactions in the scene.
State-of-the-Art Results: Despite using a simplified structure, HG-RCNN achieves state-of-the-art results on the MuPoTS-3D dataset, indicating its robustness and efficacy across various settings.

Numerical Results

MuPoTS-3D Benchmark: The HG-RCNN demonstrates outstanding performance, achieving 71.3% in 3D PCK, surpassing previous methods significantly. Particularly notable are the handling of occlusions and clutter in real-world scenes, showcasing improvements in sequences with substantial occlusion (e.g., TS18 and TS19).
Human3.6M Dataset: The method yields competitive results on the Human3.6M dataset, achieving 65.2 mm MPJPE, which underlines its capability of generalizing well to standard pose benchmarks without requiring task-specific fine-tuning.

Implications and Future Directions

This work marks a significant advancement in bridging the gap between existing research and practical applications of multi-person 3D pose estimation. It opens up several avenues for practical deployment in areas such as augmented reality (AR), virtual reality (VR), human-computer interaction (HCI), sports analytics, and surveillance technologies.

Future research can enhance the structural constraints and bounding box consistency employed, addressing error sources such as unseen poses and inter-person occlusions for improved accuracy. Additionally, integrating temporal cues could further refine pose estimation in dynamic scenarios, paving the way for comprehensive human motion analysis systems.

HG-RCNN sets a precedent for further explorations in real-time scene parsing and enriched human-action recognition pipelines. As the field of AI continues to evolve, such modular and efficient methods will play a pivotal role in developing systems that understand and interact with complex human-centric environments.

Related Papers

YouTube

Show All Videos