Action2Motion: Conditioned Generation of 3D Human Motions (2007.15240v1)

Published 30 Jul 2020 in cs.CV

Abstract: Action recognition is a relatively established task, where givenan input sequence of human motion, the goal is to predict its ac-tion category. This paper, on the other hand, considers a relativelynew problem, which could be thought of as an inverse of actionrecognition: given a prescribed action type, we aim to generateplausible human motion sequences in 3D. Importantly, the set ofgenerated motions are expected to maintain itsdiversityto be ableto explore the entire action-conditioned motion space; meanwhile,each sampled sequence faithfully resembles anaturalhuman bodyarticulation dynamics. Motivated by these objectives, we followthe physics law of human kinematics by adopting the Lie Algebratheory to represent thenaturalhuman motions; we also propose atemporal Variational Auto-Encoder (VAE) that encourages adiversesampling of the motion space. A new 3D human motion dataset, HumanAct12, is also constructed. Empirical experiments overthree distinct human motion datasets (including ours) demonstratethe effectiveness of our approach.

Citations (340)

View on Semantic Scholar

Summary

The paper introduces a novel framework that conditions 3D motion generation on action types using a hybrid Lie Algebra representation and temporal VAE.
It reports significant improvements in generating visually natural motions with lower FID scores and higher recognition accuracy across multiple datasets.
The approach sets a new benchmark for 3D animation tools and training data creation for AI models in gaming, film, robotics, and more.

An Expert Overview of "Action2Motion: Conditioned Generation of 3D Human Motions"

The paper "Action2Motion: Conditioned Generation of 3D Human Motions" introduces a novel research problem in the field of computer graphics and machine learning: generating 3D human motion sequences conditioned on specified action categories. The objective is to produce plausible and diverse motion sequences that resemble real human articulations based on a given action type. Offering a unique perspective, this reverse problem of action recognition demands advanced methods that can capture the nuanced dynamics of human motion.

Methodological Advancements

The authors address the problem using a hybrid approach that combines the Lie Algebra representation with a temporal Variational Auto-Encoder (VAE). The use of the Lie Algebra is particularly noteworthy as it allows the representation of 3D human motions in a manner that disentangles the anatomical constraints from the dynamic movements of skeletal joints. By focusing on the manifold of rotation matrices (SO(3)), the Lie Algebra not only ensures natural articulation dynamics but also significantly reduces the complexity and dimensionality of the motion representation.

The temporal VAE is employed to model the dependency across time steps, where the latent variables guide the generation process conditioned on the action type. The novelty lies in integrating the Lie Algebra representation with the temporal VAE, creating a framework that facilitates diverse and realistic motion generation without prior conditions on initial poses.

Numerical Results and Implications

Significant numerical results are presented, demonstrating the effectiveness of the approach across three datasets, including a newly curated HumanAct12 dataset. The paper reports quantitative metrics such as Frechet Inception Distance (FID) and recognition accuracy, establishing that the proposed method generates motions that are not only visually natural but also distinguishable in terms of action categories. When compared to alternative methods such as Two-stage GANs and Conditional GRUs, the proposed methodology yields superior performance, particularly evident in its lower FID scores and higher recognition accuracy across the datasets.

The work highlights that the Lie Algebra representation accelerates training convergence significantly, requiring only a fraction of iterations compared to joint-coordinate-based representations. The ability to generate motions diverse in style yet coherent within the action constraints marks a practical advancement in the fields of animation and virtual reality.

Practical and Theoretical Implications

Practically, the paper opens avenues for more sophisticated and realistic 3D animation creation tools, better aligning with the needs of entertainment industries such as gaming and film. It also sets a new benchmark for generating training data for AI models that require rich, annotated human motion data, potentially benefiting fields like robotics and human-computer interaction.

Theoretically, this work pushes the boundaries of generative models by encapsulating the intricacies of human motion within a conditional generation framework. It lays foundational groundwork for future explorations into more complex interactions involving multiple agents or environments, suggesting a multitude of directions for future research. The authors suggest that extending the system to handle interactions between multiple individuals or with objects in the environment could be fruitful areas for exploration.

In conclusion, "Action2Motion" presents a nuanced and effective approach to the problem of 3D motion generation grounded in action categories. Through its innovative use of mathematical constructs and learning frameworks, it contributes significant findings to the fields of computer graphics and AI, providing both immediate practical benefits and intriguing theoretical insights for future exploration.

PDF Markdown