CameraHMR: Aligning People with Perspective (2411.08128v1)

Published 12 Nov 2024 in cs.CV

Abstract: We address the challenge of accurate 3D human pose and shape estimation from monocular images. The key to accuracy and robustness lies in high-quality training data. Existing training datasets containing real images with pseudo ground truth (pGT) use SMPLify to fit SMPL to sparse 2D joint locations, assuming a simplified camera with default intrinsics. We make two contributions that improve pGT accuracy. First, to estimate camera intrinsics, we develop a field-of-view prediction model (HumanFoV) trained on a dataset of images containing people. We use the estimated intrinsics to enhance the 4D-Humans dataset by incorporating a full perspective camera model during SMPLify fitting. Second, 2D joints provide limited constraints on 3D body shape, resulting in average-looking bodies. To address this, we use the BEDLAM dataset to train a dense surface keypoint detector. We apply this detector to the 4D-Humans dataset and modify SMPLify to fit the detected keypoints, resulting in significantly more realistic body shapes. Finally, we upgrade the HMR2.0 architecture to include the estimated camera parameters. We iterate model training and SMPLify fitting initialized with the previously trained model. This leads to more accurate pGT and a new model, CameraHMR, with state-of-the-art accuracy. Code and pGT are available for research purposes.

Summary

The paper introduces CameraHMR, which integrates a full perspective camera model and the HumanFoV prediction to significantly improve 3D human pose and shape estimation.
The method upgrades the 4DHumans dataset and employs dense surface keypoint detection to provide more accurate pseudo ground truth for training.
Results show state-of-the-art performance on key benchmarks, achieving superior 3D joint and body mesh accuracy across varied camera perspectives.

A Technical Summary of CameraHMR: Aligning People with Perspective

The paper "CameraHMR: Aligning People with Perspective" introduces a novel method for improving 3D human pose and shape estimation from monocular images by addressing inaccuracies stemming from improper camera models. The authors assert that the prevalent use of weak-perspective camera models in existing human pose and shape (HPS) methods contributes a significant source of error. To resolve this, they propose two key innovations: the incorporation of a full perspective camera model and the development of a field-of-view prediction model named HumanFoV.

Key Contributions

HumanFoV Model: The authors present HumanFoV, a regression model trained to predict the field of view (FoV) from images containing people. This model is an integral part of their proposed solution, allowing for more accurate estimation of camera intrinsics, essential for correct 3D reconstruction. By leveraging a dataset of 500,000 varied images, the HumanFoV model demonstrates robust performance across multiple human-centric benchmarks, successfully generalizing to diverse field-of-view settings.
4DHumans Dataset Augmentation: The paper modifies the 4DHumans dataset, enhancing it from a weak perspective to a full perspective representation. This transition is facilitated using the HumanFoV model to estimate missing camera intrinsics, thereby increasing the fidelity of pseudo ground truth (pGT) data, crucial for training new HPS models.
Dense Surface Keypoint Detection: To overcome the limitation of sparse 2D joints, the authors introduce a keypoint detector trained on the BEDLAM dataset to estimate 138 dense surface keypoints. This development contributes to significant improvements in 3D shape accuracy, particularly beneficial for non-standard body shapes and poses.
CameraHMR Model: The revised HMR model, named CameraHMR, employs updated training with a full perspective camera model, incorporating HumanFoV's camera intrinsics predictions, achieving state-of-the-art (SOTA) results on multiple benchmarks.

Results and Implications

CameraHMR sets a new SOTA in terms of both 3D human pose and shape accuracy as well as 2D alignment. The authors report notable advancements in several benchmarks, including 3DPW, EMDB, and SPEC datasets. The model achieves significant improvements by accurately estimating 3D joint and body mesh locations, outperforming models utilizing default or incorrect intrinsics. Importantly, this approach achieves high levels of 2D alignment, maintaining accuracy even under large FoV and varied camera perspectives.

Ablative analysis confirms the importance of using predicted camera intrinsics, showing marked improvement over default parameters, especially for datasets with diverse camera setups like SPEC-SYN. In evaluations on the SSP-3D dataset, enhanced shape accuracy is observed, highlighting the robustness of the pseudo-ground truth generated through the authors' enhanced training process.

Speculation on Future Work

The implications of this work extend beyond achieving superior accuracy in HPS tasks. The approach opens avenues for more refined applications in virtual reality, cinematography, and human-computer interaction, where understanding subtle body dynamics in diverse settings is vital. A natural next step is exploring the integration of such perspective-aware models with real-time systems, potentially enhancing AR experiences or supporting automated motion capture applications.

Moreover, as machine learning models increasingly leverage diverse datasets, the trade-offs between computational complexity and model accuracy will likely continue to be an area of active research. Investigating efficient resource balancing to achieve real-time inference, possibly with edge devices, could drive the practical deployment of models like CameraHMR in consumer-grade technologies.

In conclusion, by introducing a comprehensive approach to accurate FoV estimation and 3D reconstruction grounded in robust theoretical insights, this paper contributes significant advancements to the domain of HPS and sets a foundation for future explorations into perspective-aware modeling techniques.

PDF Markdown

Related Papers

Tweets

https://twitter.com/Michael_J_Black/status/1856993381571797419

YouTube

Show All Videos