- The paper introduces GVHMR, a novel gravity-aligned coordinate system that reduces learning ambiguity in mapping images to poses.
- It integrates a transformer with Rotary Positional Embedding to predict global trajectories and refine human orientation in 3D space.
- Experimental results demonstrate a 7-10mm error reduction on world-grounded metrics, advancing state-of-the-art motion recovery.
World-Grounded Human Motion Recovery via Gravity-View Coordinates
The paper "World-Grounded Human Motion Recovery via Gravity-View Coordinates" presents a significant contribution to the field of 3D human motion reconstruction. It introduces GVHMR, a novel methodology for recovering world-grounded human motion from monocular videos. This approach aims to address the challenge of world coordinate system ambiguity, which is particularly complex when sequences involve varied spatial reference frames.
Proposed Method
The authors propose a Gravity-View (GV) coordinate system that is defined by the world gravity and the camera view direction. This system is gravity-aligned and uniquely defined for each video frame, thereby reducing learning ambiguity between image and pose mappings. The GVHMR framework comprises several key components:
- Gravity-View (GV) Coordinate System: This system is designed to be gravity-aligned for each frame, facilitating easier gravity-aware human orientation estimation.
- Transformer Model with Rotary Positional Embedding (RoPE): The model predicts the global trajectory representation ΓGV and root velocity vroot leveraging a transformer architecture enhanced by RoPE to capture relative positional information.
- Global Orientation Recovery Algorithm: This algorithm enables the transformation of per-frame GV human poses into a consistent world coordinate system, mitigating error accumulation in the gravity direction.
- Post-Processing for Motion Refinement: Stationary probabilities for hands and feet are predicted to refine foot sliding and global trajectories using inverse kinematics.
Experimental Results
Extensive evaluations were conducted on three different datasets: 3DPW, RICH, and EMDB. Multiple benchmarks were used to evaluate both camera-space and world-grounded accuracy. The paper reports significant improvements over existing state-of-the-art methods, both autoregressive and optimization-based. Key performance metrics include:
- World-Grounded Metrics: WA-MPJPE100 and W-MPJPE100 for error measurement in a globally aligned space. GVHMR achieves lower errors by 7-10mm on average compared to WHAM and other baselines.
- Camera-Space Metrics: GVHMR surpasses prior methods in PA-MPJPE, MPJPE, and PVE, achieving improvements of up to 6mm in PA-MPJPE.
Implications and Future Work
The implications of this work are substantial for applications requiring high-quality and consistent global motion, such as text-to-motion generation and humanoid robot imitation learning. The introduction of the GV coordinate system can be expected to inspire further research into more accurate and efficient motion recovery methods that leverage environmental priors like gravity.
Future research could explore various extensions:
- Generalization to Various Camera Configurations: Investigate how this model adapts to different types of camera movements including those in cluttered or dynamic environments.
- Integration with Scene Context: Enhancing motion recovery by integrating contextual information from the surrounding environment.
- Applications in Virtual and Augmented Reality: Utilizing the improved motion recovery systems for real-time applications in VR/AR scenarios where accurate global positioning is critical.
In conclusion, the GVHMR methodology represents a significant advance in regressing world-grounded human motion, providing a novel coordinate system that robustly handles the challenge of gravity alignment. The integration of relative transformers with RoPE enhances sequence handling, and the comprehensive evaluation solidifies its standing as a leading solution in this domain.