World-Grounded Human Motion Recovery via Gravity-View Coordinates (2409.06662v1)

Published 10 Sep 2024 in cs.CV and cs.AI

Abstract: We present a novel method for recovering world-grounded human motion from monocular video. The main challenge lies in the ambiguity of defining the world coordinate system, which varies between sequences. Previous approaches attempt to alleviate this issue by predicting relative motion in an autoregressive manner, but are prone to accumulating errors. Instead, we propose estimating human poses in a novel Gravity-View (GV) coordinate system, which is defined by the world gravity and the camera view direction. The proposed GV system is naturally gravity-aligned and uniquely defined for each video frame, largely reducing the ambiguity of learning image-pose mapping. The estimated poses can be transformed back to the world coordinate system using camera rotations, forming a global motion sequence. Additionally, the per-frame estimation avoids error accumulation in the autoregressive methods. Experiments on in-the-wild benchmarks demonstrate that our method recovers more realistic motion in both the camera space and world-grounded settings, outperforming state-of-the-art methods in both accuracy and speed. The code is available at https://zju3dv.github.io/gvhmr/.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces GVHMR, a novel gravity-aligned coordinate system that reduces learning ambiguity in mapping images to poses.
It integrates a transformer with Rotary Positional Embedding to predict global trajectories and refine human orientation in 3D space.
Experimental results demonstrate a 7-10mm error reduction on world-grounded metrics, advancing state-of-the-art motion recovery.

World-Grounded Human Motion Recovery via Gravity-View Coordinates

The paper "World-Grounded Human Motion Recovery via Gravity-View Coordinates" presents a significant contribution to the field of 3D human motion reconstruction. It introduces GVHMR, a novel methodology for recovering world-grounded human motion from monocular videos. This approach aims to address the challenge of world coordinate system ambiguity, which is particularly complex when sequences involve varied spatial reference frames.

Proposed Method

The authors propose a Gravity-View (GV) coordinate system that is defined by the world gravity and the camera view direction. This system is gravity-aligned and uniquely defined for each video frame, thereby reducing learning ambiguity between image and pose mappings. The GVHMR framework comprises several key components:

Gravity-View (GV) Coordinate System: This system is designed to be gravity-aligned for each frame, facilitating easier gravity-aware human orientation estimation.
Transformer Model with Rotary Positional Embedding (RoPE): The model predicts the global trajectory representation $\Gamma_{GV}$ and root velocity $v_{root}$ leveraging a transformer architecture enhanced by RoPE to capture relative positional information.
Global Orientation Recovery Algorithm: This algorithm enables the transformation of per-frame GV human poses into a consistent world coordinate system, mitigating error accumulation in the gravity direction.
Post-Processing for Motion Refinement: Stationary probabilities for hands and feet are predicted to refine foot sliding and global trajectories using inverse kinematics.

Experimental Results

Extensive evaluations were conducted on three different datasets: 3DPW, RICH, and EMDB. Multiple benchmarks were used to evaluate both camera-space and world-grounded accuracy. The paper reports significant improvements over existing state-of-the-art methods, both autoregressive and optimization-based. Key performance metrics include:

World-Grounded Metrics: WA-MPJPE $_{100}$ and W-MPJPE $_{100}$ for error measurement in a globally aligned space. GVHMR achieves lower errors by 7-10mm on average compared to WHAM and other baselines.
Camera-Space Metrics: GVHMR surpasses prior methods in PA-MPJPE, MPJPE, and PVE, achieving improvements of up to 6mm in PA-MPJPE.

Implications and Future Work

The implications of this work are substantial for applications requiring high-quality and consistent global motion, such as text-to-motion generation and humanoid robot imitation learning. The introduction of the GV coordinate system can be expected to inspire further research into more accurate and efficient motion recovery methods that leverage environmental priors like gravity.

Future research could explore various extensions:

Generalization to Various Camera Configurations: Investigate how this model adapts to different types of camera movements including those in cluttered or dynamic environments.
Integration with Scene Context: Enhancing motion recovery by integrating contextual information from the surrounding environment.
Applications in Virtual and Augmented Reality: Utilizing the improved motion recovery systems for real-time applications in VR/AR scenarios where accurate global positioning is critical.

In conclusion, the GVHMR methodology represents a significant advance in regressing world-grounded human motion, providing a novel coordinate system that robustly handles the challenge of gravity alignment. The integration of relative transformers with RoPE enhances sequence handling, and the comprehensive evaluation solidifies its standing as a leading solution in this domain.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/pengsida/status/1836612667185992158

https://twitter.com/dreamingtulpa/status/1834515736774996327

https://twitter.com/carlosedubarret/status/1836682126017659345

https://twitter.com/promatheus_/status/1853231955833167913

https://twitter.com/scottskk/status/1837046741931282491

YouTube

Show All Videos