Multi-person 3D Pose Estimation in Crowded Scenes Based on Multi-View Geometry (2007.10986v1)

Published 21 Jul 2020 in cs.CV

Abstract: Epipolar constraints are at the core of feature matching and depth estimation in current multi-person multi-camera 3D human pose estimation methods. Despite the satisfactory performance of this formulation in sparser crowd scenes, its effectiveness is frequently challenged under denser crowd circumstances mainly due to two sources of ambiguity. The first is the mismatch of human joints resulting from the simple cues provided by the Euclidean distances between joints and epipolar lines. The second is the lack of robustness from the naive formulation of the problem as a least squares minimization. In this paper, we depart from the multi-person 3D pose estimation formulation, and instead reformulate it as crowd pose estimation. Our method consists of two key components: a graph model for fast cross-view matching, and a maximum a posteriori (MAP) estimator for the reconstruction of the 3D human poses. We demonstrate the effectiveness and superiority of our proposed method on four benchmark datasets.

Citations (34)

View on Semantic Scholar

Summary

The paper presents a novel two-stage approach combining graph-based cross-view feet matching with MAP optimization to resolve joint ambiguities.
The paper validates its method on four benchmark datasets, achieving a mean per joint position error of 50.0 mm on challenging sequences.
The paper highlights significant implications for real-time surveillance and autonomous systems by enhancing pose accuracy in densely crowded environments.

Multi-person 3D Pose Estimation in Crowded Scenes Based on Multi-View Geometry

The paper "Multi-person 3D Pose Estimation in Crowded Scenes Based on Multi-View Geometry" addresses the challenging task of accurately estimating 3D human poses in densely populated environments using multi-camera systems. Prevailing methods leveraging epipolar constraints demonstrate limitations in such crowded scenarios due to joint mismatch and ambiguity introduced by occlusions. This work reformulates the problem into what is termed "crowd pose estimation," offering a novel methodological approach with promising results.

Key Methodological Contributions

This paper introduces a two-fold approach consisting of: (1) a graph-based model for rapid cross-view feet matching and (2) a Maximum a Posteriori (MAP) estimator to reconstruct 3D human poses effectively.

Graph Model for Cross-View Matching:
- The authors emphasize a novel approach that prioritizes feet matching across multiple camera views. By focusing on feet, which typically maintain contact with the ground, the method overcomes the ambiguous epipolar constraints employed for other joints. This involves applying homography to rectify ground planes among views, simplifying the matching process.
- The matching problem is cast into a binary linear program solved using the Jonker-Volgenant algorithm, optimizing costs based on feet location, stride size, and stride direction. This approach provides an efficient and robust solution for matching individuals in dense crowds.
MAP Estimation for 3D Pose Reconstruction:
- To enhance the robustness of triangulation, the paper adopts a MAP optimization framework. This formulation accounts for the uncertainty in 2D joint detection and incorporates a bone-length prior to regularize the pose estimation.
- The iterative refinement process starts with a vanilla triangulation initialization and employs the trust region method for MAP refinement, ensuring accuracy in challenging, ambiguous scenes.

Experimental Evaluation

The proposed method has been evaluated on four benchmark datasets—LOEWENPLATZ, Chariot Mk I, Wildtrack, and CMU Panoptic Dataset. It demonstrates competitive performance, outperforming state-of-the-art algorithms in accuracy and precision, especially in scenarios characterized by heavy occlusions and dense crowding. Notably, the method achieves a mean per joint position error (MPJPE) of 50.0 mm on CMU Panoptic Dataset's "Ultimatum" sequences, illustrating its efficacy in real-world settings.

Implications and Future Directions

The redefinition of the multi-person 3D pose estimation problem as "crowd pose estimation" distinguishes this work and places emphasis on novel constraint formulations tailored to crowded scenes. The approach not only proposes improvements in computational efficiency and robustness but also sets a foundation for advancing real-time applications in surveillance, autonomous driving, and interactive systems that incorporate dense human activity monitoring.

Future research may explore the integration of temporal information for dynamic scenes, as well as the adaptation of the methodology to leverage uncalibrated camera systems for more flexible applications. Further enhancements in the confidence estimation of detected keypoints could refine the MAP optimization process, providing an even more robust 3D human pose estimation solution in varied and challenging environments.

PDF Markdown

Related Papers

YouTube

Show All Videos