Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
121 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Simultaneously Recovering Multi-Person Meshes and Multi-View Cameras with Human Semantics (2412.18785v1)

Published 25 Dec 2024 in cs.CV

Abstract: Dynamic multi-person mesh recovery has broad applications in sports broadcasting, virtual reality, and video games. However, current multi-view frameworks rely on a time-consuming camera calibration procedure. In this work, we focus on multi-person motion capture with uncalibrated cameras, which mainly faces two challenges: one is that inter-person interactions and occlusions introduce inherent ambiguities for both camera calibration and motion capture; the other is that a lack of dense correspondences can be used to constrain sparse camera geometries in a dynamic multi-person scene. Our key idea is to incorporate motion prior knowledge to simultaneously estimate camera parameters and human meshes from noisy human semantics. We first utilize human information from 2D images to initialize intrinsic and extrinsic parameters. Thus, the approach does not rely on any other calibration tools or background features. Then, a pose-geometry consistency is introduced to associate the detected humans from different views. Finally, a latent motion prior is proposed to refine the camera parameters and human motions. Experimental results show that accurate camera parameters and human motions can be obtained through a one-step reconstruction. The code are publicly available at~\url{https://github.com/boycehbz/DMMR}.

Summary

  • The paper presents a framework that jointly recovers multi-person human meshes and calibrates uncalibrated camera parameters.
  • It leverages pose-geometry consistency and a latent motion prior to handle occlusions and noisy inputs in multi-view data.
  • The method achieves competitive accuracy on benchmark datasets, enabling practical applications in sports broadcasting, VR, and live events.

Overview of Simultaneously Recovering Multi-Person Meshes and Multi-View Cameras with Human Semantics

The paper addresses the challenge of dynamic multi-person mesh recovery from uncalibrated multi-view video input, a domain with practical applications in sports broadcasting, virtual reality, and video gaming. The traditional constraints of multi-view frameworks, such as the dependency on pre-calibrated cameras, impose limitations on their operational efficiency and applicability. This paper aims to overcome the bottleneck of camera calibration by proposing a method that simultaneously retrieves multiple human body meshes and optimizes camera parameters without prior calibration.

Methodological Contributions

The authors identify two principal challenges in multi-person motion capture using uncalibrated cameras: inter-person interactions that introduce ambiguities and occlusions, and the absence of dense correspondences, which are typically necessary to maintain camera geometry consistency. To tackle these, they propose a novel framework that incorporates motion prior knowledge and human semantics to jointly estimate camera parameters and human meshes from 2D images.

  1. Initialization and Estimation:
    • The process begins with the utilization of upright standing human cues from 2D images to estimate intrinsic camera parameters. This technique avoids the need for conventional calibration tools like checkerboards.
    • The initial extrinsic parameters are then obtained using 3D poses, which are aligned across views based on pose similarity.
  2. Pose-Geometry Consistent Association:
    • To associate detected human positions across different views, a cross-view pose-geometry consistency method is introduced. This approach integrates pose similarity with geometric constraints, enabling robust association despite occlusions and inaccuracies.
  3. Latent Motion Prior:
    • A variational autoencoder-based latent motion model is proposed to ensure temporal coherence and robustness to noisy inputs in the motion reconstruction process. This model is distinguished by its compact representation, which can be trained on short sequences and applied to longer ones.
    • A local linear constraint within the latent space ensures that the prior can effectively minimize motion artifacts such as jittering, thereby enhancing the smoothness and coherence of the output.
  4. Simultaneous Optimization:
    • Utilizing the motion prior, a joint optimization strategy is employed to iteratively refine camera parameters and human meshes. Incorporating a progressive optimization strategy, the framework effectively handles the non-convex nature of the problem, providing accurate mesh recovery and calibration from detected human semantics.

Numerical Results and Validation

The effectiveness of the proposed method is demonstrated on several benchmark datasets, including Campus and Shelf, Panoptic, MHHI, and others. In comparisons with several state-of-the-art methods, this approach shows competitive performance in terms of accuracy in multi-person 3D pose estimation (PCP metric), camera calibration accuracy, and human mesh recovery under various challenging scenarios, including those with occlusions and large-scale scenes.

Practical and Theoretical Implications

Practically, this method holds significant promise for real-world applications where pre-calibration of cameras is infeasible. Its ability to recover 3D human geometry and scene calibration concurrently opens avenues for its use in uncontrolled environments, such as in live events or situations where camera setup is not fixed.

From a theoretical perspective, this work contributes to advancing the understanding of how human semantics can be effectively leveraged for camera calibration. By bridging the gap between structure-from-motion techniques and semantic information extracted from scene objects, the paper sets a foundation for future research aimed at exploring more intricate spatiotemporal interactions in visual data.

In conclusion, while the framework substantially advances the state-of-the-art in mesh recovery and camera calibration, further research could explore extending this to dynamic camera environments and incorporating more diverse motion priors to handle even more complex human motion dynamics.