Physics-based Human Motion Estimation and Synthesis from Videos (2109.09913v2)

Published 21 Sep 2021 in cs.CV

Abstract: Human motion synthesis is an important problem with applications in graphics, gaming and simulation environments for robotics. Existing methods require accurate motion capture data for training, which is costly to obtain. Instead, we propose a framework for training generative models of physically plausible human motion directly from monocular RGB videos, which are much more widely available. At the core of our method is a novel optimization formulation that corrects imperfect image-based pose estimations by enforcing physics constraints and reasons about contacts in a differentiable way. This optimization yields corrected 3D poses and motions, as well as their corresponding contact forces. Results show that our physically-corrected motions significantly outperform prior work on pose estimation. We can then use these to train a generative model to synthesize future motion. We demonstrate both qualitatively and quantitatively improved motion estimation, synthesis quality and physical plausibility achieved by our method on the Human3.6m dataset~\cite{h36m_pami} as compared to prior kinematic and physics-based methods. By enabling learning of motion synthesis from video, our method paves the way for large-scale, realistic and diverse motion synthesis. Project page: \url{https://nv-tlabs.github.io/publication/iccv_2021_physics/}

Authors (6)

Kevin Xie (13 papers)
Tingwu Wang (9 papers)
Umar Iqbal (50 papers)
Yunrong Guo (14 papers)
Sanja Fidler (184 papers)
Florian Shkurti (52 papers)

Citations (75)

View on Semantic Scholar

Summary

Physics-based Human Motion Estimation and Synthesis from Videos

The paper "Physics-based Human Motion Estimation and Synthesis from Videos" presents a novel framework for synthesizing human motion by relying solely on noisy pose estimations obtained from monocular RGB videos, bypassing the need for motion capture data. This research is anchored on the foundational idea that large-scale datasets derived from video recordings can be judiciously utilized to train models that generate physically plausible human motions, thus overcoming the logistical and financial challenges associated with traditional motion capture techniques.

Key Contributions

The core contributions of the paper are manifold. At the forefront is an innovative optimization method that refines pose estimations by enforcing physics constraints. This includes accounting for physical laws such as contact dynamics and ensuring temporal coherence in motion trajectories. Unlike existing methods that depend heavily on motion capture data, the proposed approach can infer accurate 3D body trajectories directly from video data, positioning this paper as crucial in democratizing the tools required for realistic human motion synthesis.

Another vital component of the paper is the introduction of a smooth contact loss function that facilitates the refinement of pose estimates without necessitating external contact detectors or complex nonlinear programming tools. Therefore, this method enhances the adaptability and scalability of learning-based motion synthesis models, paving the way for their application in diverse and real-world scenarios.

Results

Quantitative validation on the Human3.6M dataset indicates that the proposed physics-corrected motions outperform prior kinematic and physics-based models in terms of pose estimation accuracy and physical plausibility. For instance, the paper reported significant improvements in Mean Per Joint Position Error (MPJPE) and global root position errors when compared to models such as HMR, HMMR, and PhysCap. Notably, even without access to motion capture datasets, the corrected poses were sufficient to train generative models that approached the synthesis quality achieved by motion capture-based prediction models.

Implications

The implications of this research are both practical and theoretical. Practically, the ability to derive physically plausible motion data from widely available video sources opens up new avenues for creating more accessible, large-scale datasets that enhance the training of machine learning models in animation, robotics, and virtual reality domains. This democratization aligns with the ongoing need for more inclusive AI technologies, particularly in fields that require realistic and diverse human motion modeling, such as gaming, pedestrian simulations, and reinforcement learning environments.

Theoretically, by refining pose estimations through the imposition of physics constraints, the research contributes to a more profound understanding of the interplay between kinematic and dynamic factors in motion synthesis. The paper also invites speculation on future developments in AI, particularly in refining physical realism in synthesized motions and optimizing contact dynamics—aiding in the creation of lifelike simulations that are not just visually accurate but dynamically plausible as well.

Speculations on Future Directions

Moving forward, this line of inquiry can expand into exploring more complex interactions within crowded scenes or with various objects, aiming to capture nuanced aspects of human-human and human-object interactions. Moreover, as deep learning models continue to evolve, integrating more sophisticated physics simulators with neural network architectures could yield models with an even higher degree of realism. Another future direction could involve leveraging this technology to augment datasets that have sparse annotations, thereby improving the effectiveness of unsupervised or semi-supervised learning methodologies in human pose estimation.

In summary, the paper delivers substantial advancements to the field of human motion synthesis, promoting the use of video-derived data as viable alternatives to motion capture in training accurate, physics-informed models. As AI technologies continue to integrate more closely with heterogeneous data sources, the insights from this research can prove invaluable for developing robust, flexible systems capable of simulating human motion in diverse environments.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos