RoHM: Robust Human Motion Reconstruction via Diffusion (2401.08570v2)

Published 16 Jan 2024 in cs.CV

Abstract: We propose RoHM, an approach for robust 3D human motion reconstruction from monocular RGB(-D) videos in the presence of noise and occlusions. Most previous approaches either train neural networks to directly regress motion in 3D or learn data-driven motion priors and combine them with optimization at test time. The former do not recover globally coherent motion and fail under occlusions; the latter are time-consuming, prone to local minima, and require manual tuning. To overcome these shortcomings, we exploit the iterative, denoising nature of diffusion models. RoHM is a novel diffusion-based motion model that, conditioned on noisy and occluded input data, reconstructs complete, plausible motions in consistent global coordinates. Given the complexity of the problem -- requiring one to address different tasks (denoising and infilling) in different solution spaces (local and global motion) -- we decompose it into two sub-tasks and learn two models, one for global trajectory and one for local motion. To capture the correlations between the two, we then introduce a novel conditioning module, combining it with an iterative inference scheme. We apply RoHM to a variety of tasks -- from motion reconstruction and denoising to spatial and temporal infilling. Extensive experiments on three popular datasets show that our method outperforms state-of-the-art approaches qualitatively and quantitatively, while being faster at test time. The code is available at https://sanweiliti.github.io/ROHM/ROHM.html.

Citations (7)

View on Semantic Scholar

Summary

The paper introduces a novel diffusion-based approach that overcomes limitations of regression and optimization methods in 3D human motion reconstruction.
It separates global trajectory inference and local body motion prediction, refining estimates via an iterative, score-guided sampling process.
Extensive tests on public datasets demonstrate that RoHM achieves higher accuracy and faster inference than state-of-the-art techniques under challenging conditions.

Overview of Robust Human Motion Reconstruction

Human motion capture and reconstruction have a profound impact on numerous fields such as virtual reality, animation, and robotics. However, obtaining accurate 3D human motion from monocular videos—videos captured from a single camera angle—remains challenging, particularly in scenarios with noise, occlusions, or both.

A Novel Motion Reconstruction Approach

A recently developed system named RoHM (Robust Human Motion Reconstruction) innovates in the space of human motion recovery from monocular RGB(-D) videos. Traditional techniques either rely on neural networks for direct regression of 3D motion, which can lead to a lack of global motion coherence, or they depend on complex optimization processes at test time that are computationally expensive and can get trapped in local minima. RoHM sidesteps these issues by using the iterative, generative nature of diffusion models to refine and infer complete, coherent motion from noisy and occluded input data.

Devised by researchers at ETH Zurich and Meta Reality Labs Research, RoHM is particularly adept at reconstructing smooth and plausible motions even when parts of the body are occluded or the initial motion data is heavily corrupted. It achieves consistency in global coordinates and handles multiple tasks—from denoising to spatial and temporal infilling—efficiently and flexibly.

Methodology

RoHM's methodology comprises several notable components that are critical to its performance:

Diffusion-Based Motion Models: The core of RoHM's framework is a pair of diffusion-based models that take noisy and incomplete input and output refined global trajectories and local body motions.
Separation of Global and Local Dynamics: Recognizing the complexity of human motion, RoHM separates the reconstruction process into two distinct tasks: inferring global trajectory and predicting local body motion.
Iterative Inference Scheme: To enhance the reconstructed motions further, the system employs an iterative inference scheme. This involves initial predictions from both global and local models, followed by subsequent iterations that refine these predictions leveraging additional inputs from the results of the previous step.
Score-Guided Sampling: During the final stages of the testing process, RoHM includes a score-guided sampling technique. This aims to ensure physical plausibility, closely matching image evidence for visible joints and minimizing foot sliding.

Performance and Applications

Extensive testing on public datasets displays that RoHM outperforms state-of-the-art methods in both accuracy and realism. Moreover, it has proven to be significantly faster than optimization-based approaches during inference, while still being flexible enough to accommodate various tasks.

Conclusion and Future Work

RoHM embodies a step forward in 3D human motion reconstruction, pushing the boundaries of what’s possible with monocular video footage. While current formulations do not support real-time online motion capture and omit detailed modeling of hand poses and facial expressions, additional advancements in these areas could make RoHM even more powerful.

Given its robust handling of noise and occlusions, RoHM paves the way for more accurate and plausible virtual representations of human motion, which can expand possibilities within interactive technologies and beyond.

Related Papers

GitHub

Tweets

https://twitter.com/alexcarliera/status/1749504135408546021

https://twitter.com/gm8xx8/status/1747488446195679481

YouTube

Show All Videos