MaskedMimic: Unified Physics-Based Character Control Through Masked Motion Inpainting (2409.14393v1)

Published 22 Sep 2024 in cs.AI and cs.RO

Abstract: Crafting a single, versatile physics-based controller that can breathe life into interactive characters across a wide spectrum of scenarios represents an exciting frontier in character animation. An ideal controller should support diverse control modalities, such as sparse target keyframes, text instructions, and scene information. While previous works have proposed physically simulated, scene-aware control models, these systems have predominantly focused on developing controllers that each specializes in a narrow set of tasks and control modalities. This work presents MaskedMimic, a novel approach that formulates physics-based character control as a general motion inpainting problem. Our key insight is to train a single unified model to synthesize motions from partial (masked) motion descriptions, such as masked keyframes, objects, text descriptions, or any combination thereof. This is achieved by leveraging motion tracking data and designing a scalable training method that can effectively utilize diverse motion descriptions to produce coherent animations. Through this process, our approach learns a physics-based controller that provides an intuitive control interface without requiring tedious reward engineering for all behaviors of interest. The resulting controller supports a wide range of control modalities and enables seamless transitions between disparate tasks. By unifying character control through motion inpainting, MaskedMimic creates versatile virtual characters. These characters can dynamically adapt to complex scenes and compose diverse motions on demand, enabling more interactive and immersive experiences.

PDF HTML Abstract

MaskedMimic: Unified Physics-Based Character Control Through Masked Motion Inpainting

The paper "MaskedMimic: Unified Physics-Based Character Control Through Masked Motion Inpainting" introduces a novel approach that unifies physics-based character control by treating it as a general motion inpainting problem. The researchers propose a method that can dynamically adapt to a variety of control modalities, including target keyframes, text descriptions, object placements, and scene information. This work employs a single unified model to synthesize motion from partial motion descriptions, offering a significant step forward in the domain of physically simulated character animation.

Approach and Methodology

The core of MaskedMimic’s framework lies in training a versatile, physics-based controller capable of generating coherent animations from sparse and incomplete motion data. The training process is split into two stages. Initially, a fully-constrained controller ( $\pi^{\text{FC}}$ ) is trained via reinforcement learning to imitate large libraries of reference motions across diverse and irregular environments. This controller is designed to observe a comprehensive set of future target poses. The second stage involves distilling this fully-constrained controller into a partially-constrained controller ( $\pi^{\text{PC}}$ ), training it through behavior cloning to function effectively using partial goals by leveraging a randomized masking function.

Fully-Constrained Controller:

The fully-constrained controller is endowed with detailed motion tracking capabilities, allowing it to mimic a spectrum of reference motions from datasets such as AMASS, which include complex human behaviors. These behaviors are transformed into target positions and orientations, serving as the motion inpainting inputs. The training environment comprises different terrains, ensuring robustness and the ability to adapt to new and unseen environments.

Partially-Constrained Controller:

The innovation in MaskedMimic is highlighted by the versatile, partially-constrained controller. This controller is trained using a variational autoencoder (VAE) approach, allowing it to generate multiple possible animations from a given partial constraint. At its core, the VAE consists of a prior, an encoder, and a decoder. The encoder learns a latent residual space describing the full motion, while the prior learns to predict this space from partial observations. Training the model under a variety of masked conditions ensures that it can seamlessly handle diverse input constraints and generate physically plausible movements.

Numerical Results and Claims

The paper provides comprehensive quantitative evaluations, illustrating the effectiveness and versatility of MaskedMimic. Several benchmarks were designed to evaluate the capacities of both $\pi^{\text{FC}}$ and $\pi^{\text{PC}}$ :

Full-body Tracking: The models were assessed on their ability to track complete motion sequences from the AMASS dataset. MaskedMimic demonstrated superior generalization with a success rate of 99.2% on test motions, outperforming baseline models.
VR Tracking: The model was tasked with generating full-body motion using inputs from head and hand sensors mimicking VR setups. MaskedMimic exceeded prior methods like PULSE and ASE, achieving a 98.1% success rate on test sequences.
Irregular Terrains: The model's stability and adaptability were evaluated on diverse irregular terrains, showing consistent performance with success rates above 95% for both train and test datasets, highlighting the model's robustness.

Practical and Theoretical Implications

The research presents several practical and theoretical implications. Practically, the ability to train a single versatile controller that can handle multiple control modalities reduces the complexity associated with developing distinct controllers for different tasks. This unification allows for more seamless character interactions within virtual environments, which is particularly beneficial for gaming, virtual reality, and digital human applications. Theoretically, this work pushes forward the understanding of using motion inpainting as a robust framework for character control, suggesting that motion quality can be maintained or even improved by learning from partial observations and inferring the complete motion path.

Future Directions

The authors acknowledge several limitations and propose areas for future research:

Motion Quality: While the model generates diverse and robust motions, minor artifacts such as jittering need mitigation. Future work could include fine-tuning the model using discriminative rewards to smooth out these inconsistencies.
Goal-Engineering: Automated techniques for goal-engineering could simplify the implementation of complex interactions, leveraging advancements in large-LLMs.
New Capabilities: Future expansions could incorporate dynamic scene interactions, enabling characters to manipulate objects and engage in more intricate multi-agent interactions.

Conclusion

MaskedMimic represents a significant contribution to the field of character animation, unifying multiple control modalities under a single, versatile framework. The innovative combination of reinforcement learning for fully-observed goals and variational autoencoders for partial goals effectively addresses the motion inpainting problem, producing coherent and realistic animations. The methodologies and insights from this research pave the way for more advanced and flexible systems in physically simulated character control.