HuMoR: 3D Human Motion Model for Robust Pose Estimation (2105.04668v2)

Published 10 May 2021 in cs.CV and cs.LG

Abstract: We introduce HuMoR: a 3D Human Motion Model for Robust Estimation of temporal pose and shape. Though substantial progress has been made in estimating 3D human motion and shape from dynamic observations, recovering plausible pose sequences in the presence of noise and occlusions remains a challenge. For this purpose, we propose an expressive generative model in the form of a conditional variational autoencoder, which learns a distribution of the change in pose at each step of a motion sequence. Furthermore, we introduce a flexible optimization-based approach that leverages HuMoR as a motion prior to robustly estimate plausible pose and shape from ambiguous observations. Through extensive evaluations, we demonstrate that our model generalizes to diverse motions and body shapes after training on a large motion capture dataset, and enables motion reconstruction from multiple input modalities including 3D keypoints and RGB(-D) videos.

Citations (276)

View on Semantic Scholar

Summary

The paper proposes a CVAE-based approach that learns motion priors to robustly estimate 3D pose and shape from ambiguous inputs.
It employs test-time optimization with a multi-objective framework, achieving a contact prediction accuracy of 0.88 on benchmark datasets.
Evaluations on diverse motions and the AMASS dataset demonstrate superior generalization and reduced ground penetration compared to baselines.

Overview of "HuMoR: 3D Human Motion Model for Robust Pose Estimation"

This paper introduces HuMoR, a generative 3D Human Motion Model designed for robust estimation of temporal pose and shape, which is formulated as a conditional variational autoencoder (CVAE). HuMoR targets the challenging problems associated with obtaining plausible human motion sequences from noisy or occluded observations. The authors propose a flexible, optimization-based approach that leverages HuMoR as a motion prior, facilitating the robust estimation of plausible poses and shapes from ambiguous inputs. The system is evaluated against diverse motions and body shapes, using a range of input modalities like 3D keypoints and RGB(-D) videos, across extensive datasets.

Model Architecture

HuMoR employs a CVAE architecture to model the dynamics of 3D human motion, specifically learning a distribution of pose transitions between time steps. The CVAE is made up of a prior network to learn a conditional prior over latent variables, an encoder to approximate the posterior, and a decoder that generates predictions. A vector of latent variables informs the transition between states, with the decoder outputting both positional changes and person-ground contacts, which help constrain the pose estimation. This model configuration allows HuMoR to effectively handle a variety of motion sequences and body shapes.

Training and Evaluation

Training is conducted using the AMASS motion capture dataset, which provides a diverse range of human motions inherently allowing the CVAE to generalize beyond training poses. The authors trained the model on clean motion capture data, using multi-step sequences with scheduled sampling to improve long-term generation accuracy. The paper details the training regimen, strategies to avoid posterior collapse, and variance regularization, showing improved accuracy and generalization over baseline models. Evaluations carried out on tasks like future prediction and diverse sampling demonstrate the method's superior accuracy and ability to generalize unseen data, with contact prediction accuracy notably reaching 0.88.

Test-Time Optimization

The paper introduces a novel test-time optimization approach utilizing HuMoR as a robust motion prior. It aims to jointly solve for the 3D attributes of motion, including body shape and contact points, given partial or noisy observations. This approach uses a maximum a-posteriori estimation framed as a multi-objective optimization problem, encoding both observed data and motion plausibility constraints. Extensive quantitative results from fitting applications on i3DB and PROX datasets illustrate HuMoR’s ability to estimate plausible human motion and pose, with fewer and less severe ground penetration occurrences, compared to baseline methods.

Implications and Future Directions

HuMoR demonstrates that combining generative models with optimization techniques can effectively deal with motion estimation challenges, especially in scenarios involving occlusions or noisy data. While current limitations include assumptions of static ground planes and the requirement for more dynamic camera compatibility, the use of CVAE and motion optimization paves a solid foundation for future enhancements. Potential directions may include handling dynamic cameras and terrain, and developing learned approaches that further enhance speed and enable multi-hypothesis outputs, offering broader applicability in real-world environments.

In summary, this work indicates a promising advancement in human motion capture technology, emphasizing the adaptability and robustness of integrating deep generative models into motion estimation frameworks.

PDF Markdown