- The paper introduces a novel diffusion-based approach that overcomes limitations of regression and optimization methods in 3D human motion reconstruction.
- It separates global trajectory inference and local body motion prediction, refining estimates via an iterative, score-guided sampling process.
- Extensive tests on public datasets demonstrate that RoHM achieves higher accuracy and faster inference than state-of-the-art techniques under challenging conditions.
Overview of Robust Human Motion Reconstruction
Human motion capture and reconstruction have a profound impact on numerous fields such as virtual reality, animation, and robotics. However, obtaining accurate 3D human motion from monocular videos—videos captured from a single camera angle—remains challenging, particularly in scenarios with noise, occlusions, or both.
A Novel Motion Reconstruction Approach
A recently developed system named RoHM (Robust Human Motion Reconstruction) innovates in the space of human motion recovery from monocular RGB(-D) videos. Traditional techniques either rely on neural networks for direct regression of 3D motion, which can lead to a lack of global motion coherence, or they depend on complex optimization processes at test time that are computationally expensive and can get trapped in local minima. RoHM sidesteps these issues by using the iterative, generative nature of diffusion models to refine and infer complete, coherent motion from noisy and occluded input data.
Devised by researchers at ETH Zurich and Meta Reality Labs Research, RoHM is particularly adept at reconstructing smooth and plausible motions even when parts of the body are occluded or the initial motion data is heavily corrupted. It achieves consistency in global coordinates and handles multiple tasks—from denoising to spatial and temporal infilling—efficiently and flexibly.
Methodology
RoHM's methodology comprises several notable components that are critical to its performance:
- Diffusion-Based Motion Models: The core of RoHM's framework is a pair of diffusion-based models that take noisy and incomplete input and output refined global trajectories and local body motions.
- Separation of Global and Local Dynamics: Recognizing the complexity of human motion, RoHM separates the reconstruction process into two distinct tasks: inferring global trajectory and predicting local body motion.
- Iterative Inference Scheme: To enhance the reconstructed motions further, the system employs an iterative inference scheme. This involves initial predictions from both global and local models, followed by subsequent iterations that refine these predictions leveraging additional inputs from the results of the previous step.
- Score-Guided Sampling: During the final stages of the testing process, RoHM includes a score-guided sampling technique. This aims to ensure physical plausibility, closely matching image evidence for visible joints and minimizing foot sliding.
Performance and Applications
Extensive testing on public datasets displays that RoHM outperforms state-of-the-art methods in both accuracy and realism. Moreover, it has proven to be significantly faster than optimization-based approaches during inference, while still being flexible enough to accommodate various tasks.
Conclusion and Future Work
RoHM embodies a step forward in 3D human motion reconstruction, pushing the boundaries of what’s possible with monocular video footage. While current formulations do not support real-time online motion capture and omit detailed modeling of hand poses and facial expressions, additional advancements in these areas could make RoHM even more powerful.
Given its robust handling of noise and occlusions, RoHM paves the way for more accurate and plausible virtual representations of human motion, which can expand possibilities within interactive technologies and beyond.