WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion (2312.07531v2)

Published 12 Dec 2023 in cs.CV

Abstract: The estimation of 3D human motion from video has progressed rapidly but current methods still have several key limitations. First, most methods estimate the human in camera coordinates. Second, prior work on estimating humans in global coordinates often assumes a flat ground plane and produces foot sliding. Third, the most accurate methods rely on computationally expensive optimization pipelines, limiting their use to offline applications. Finally, existing video-based methods are surprisingly less accurate than single-frame methods. We address these limitations with WHAM (World-grounded Humans with Accurate Motion), which accurately and efficiently reconstructs 3D human motion in a global coordinate system from video. WHAM learns to lift 2D keypoint sequences to 3D using motion capture data and fuses this with video features, integrating motion context and visual information. WHAM exploits camera angular velocity estimated from a SLAM method together with human motion to estimate the body's global trajectory. We combine this with a contact-aware trajectory refinement method that lets WHAM capture human motion in diverse conditions, such as climbing stairs. WHAM outperforms all existing 3D human motion recovery methods across multiple in-the-wild benchmarks. Code will be available for research purposes at http://wham.is.tue.mpg.de/

Citations (31)

View on Semantic Scholar

Summary

The paper proposes a framework that integrates 2D keypoints and dense video features to boost the accuracy of 3D human motion reconstruction.
The paper introduces a contact-aware trajectory estimation method that prevents foot sliding and minimizes temporal drift in global placement.
The paper achieves real-time processing at 200 fps and outperforms benchmarks like 3DPW and EMDB with refined global trajectory accuracy.

An Analysis of WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion

The paper presents a novel framework, WHAM (World-grounded Humans with Accurate Motion), aiming to address current limitations in 3D human motion estimation from monocular video. The primary challenges identified include inaccuracies in global trajectory estimation and computational inefficiency in existing methods, predominantly due to their reliance on camera coordinates and assumptions such as a flat ground plane.

Core Contributions of WHAM

WHAM proposes several strategic improvements over existing methodologies:

Feature Integration: WHAM combines 2D keypoints and pixel features to reconstruct a more precise and aligned 3D motion. The efficacy of using feature integration over temporal sequences significantly enhances accuracy by leveraging both sparse 2D keypoints projected from large-scale AMASS datasets and dense video features from pre-trained models.
Global Trajectory Estimation: It introduces a novel contact-aware trajectory recovery mechanism, which prevents foot sliding and facilitates the accurate placement of humans in a global coordinate system. This is achieved by incorporating SLAM-based camera motion data to disentangle human and camera motion, and subsequently refining the trajectory based on foot-ground contact likelihood.
Efficiency and Model Induction: Unlike computationally intense optimization pipelines, WHAM efficiently processes data in an online manner, promising real-time execution at 200 fps. It employs recursive learning for motion lift and trajectory decoding, eliminating temporal inconsistencies prevalent in video-based methods.

Performance and Results

WHAM outstrips current state-of-the-art methods in several benchmarks. It exhibits superior performance in terms of per-frame accuracy (MPJPE, PA-MPJPE, and PVE) across datasets like 3DPW, RICH, and EMDB. Notably, WHAM’s global trajectory estimation shows reduced root translation and orientation errors, indicating minimal temporal drift which is critical for applications demanding stable long-term motion tracking.

Furthermore, WHAM's unique design allows it to exploit existing networks such as SPIN and CLIFF for feature extraction, achieving improved results with different backbone configurations. The integration into WHAM from vision transformers (ViT) significantly boosts its performance, as seen in WHAM (ViT) and WHAM-B (ViT) configurations.

Theoretical and Practical Implications

Theoretical implications of WHAM's approach demonstrate the potential for enhancing 3D motion recovery frameworks by integrating multiple domains of data — synthetic mocap data and real video features. This fusion and the use of camera motion via SLAM offer pathways to resolving complex multi-body and environment interactions more seamlessly.

Practically, WHAM's advancements could be transformative for various applications, including AR/VR, robotics, and sports analytics, where real-time, precise human motion estimation is critical. WHAM’s scalability and speed pave the way for efficient deployment in edge computing environments, making real-time human motion capture feasible for consumer-grade devices.

Future Directions

WHAM sets the stage for exploring further real-time motion capture applications, particularly extending its use in more complex environments where terrain varies, and more diverse activities are performed. Future work might involve refining contact estimation techniques for comprehensive body parts and integrating scene context for added robustness. Extending WHAM's utility requires careful attention to cases of occlusion, partial visibility, and interactions with other dynamic entities in real-world settings.

In summary, WHAM is a significant step toward overcoming existing challenges in 3D human motion recovery, providing a performant, efficient, and scalable solution adaptable to both academic and industry needs in various real-world scenarios.

PDF Markdown

Related Papers

YouTube

Show All Videos