MotionCLIP: Exposing Human Motion Generation to CLIP Space (2203.08063v1)

Published 15 Mar 2022 in cs.CV and cs.GR

Abstract: We introduce MotionCLIP, a 3D human motion auto-encoder featuring a latent embedding that is disentangled, well behaved, and supports highly semantic textual descriptions. MotionCLIP gains its unique power by aligning its latent space with that of the Contrastive Language-Image Pre-training (CLIP) model. Aligning the human motion manifold to CLIP space implicitly infuses the extremely rich semantic knowledge of CLIP into the manifold. In particular, it helps continuity by placing semantically similar motions close to one another, and disentanglement, which is inherited from the CLIP-space structure. MotionCLIP comprises a transformer-based motion auto-encoder, trained to reconstruct motion while being aligned to its text label's position in CLIP-space. We further leverage CLIP's unique visual understanding and inject an even stronger signal through aligning motion to rendered frames in a self-supervised manner. We show that although CLIP has never seen the motion domain, MotionCLIP offers unprecedented text-to-motion abilities, allowing out-of-domain actions, disentangled editing, and abstract language specification. For example, the text prompt "couch" is decoded into a sitting down motion, due to lingual similarity, and the prompt "Spiderman" results in a web-swinging-like solution that is far from seen during training. In addition, we show how the introduced latent space can be leveraged for motion interpolation, editing and recognition.

Citations (273)

View on Semantic Scholar

Summary

The paper introduces a novel transformer-based auto-encoder that maps 3D human motion to a semantically enriched CLIP-aligned latent space.
It demonstrates superior text-to-motion generation and style transfer by effectively handling both domain-specific and novel actions.
The approach leverages dual alignment of text and image embeddings, paving the way for more intuitive motion generation applications.

MotionCLIP: Exposing Human Motion Generation to CLIP Space

The paper "MotionCLIP: Exposing Human Motion Generation to CLIP Space" presents an innovative approach to generating and editing 3D human motion by aligning it with CLIP (Contrastive Language-Image Pre-training) space. This alignment provides semantically rich and disentangled latent representations, leveraging the expansive knowledge embedded within CLIP, a visual-textual embedding model trained on extensive datasets.

Methodology

MotionCLIP involves a transformer-based auto-encoder that is trained to navigate the human motion manifold while integrating the semantic structure of CLIP. The encoder maps motion sequences to a CLIP-aligned latent space, and the decoder reconstructs the motions. The training process includes text and image alignment losses, connecting motion representations to CLIP text embeddings and synthetically rendered images. This synthesis extends MotionCLIP's capability beyond conventional motion capture datasets by infusing semantic richness inherent in CLIP's structure.

Experimental Evaluation

The researchers conducted comparative studies, notably against JL2P, in text-to-motion applications, demonstrating MotionCLIP's superiority in generating both domain-specific and novel motions. User studies indicated a preference for MotionCLIP's performance in generating unseen sports motions and augmenting motion styles using simple textual descriptions.

Key results include:

Text-to-Motion Generation: The model effectively produces actions from explicit text labels, outperforming existing methods. Notably, it demonstrates robustness in generating out-of-domain sports actions not explicitly present in training datasets.
Motion Style Representation: MotionCLIP competently handles motion style variations from textual prompts, matching dedicated style transfer methods like those by Aberman et al., even when the latter employs more detailed input.
Abstract Language Interpretation: The model illustrates advanced capabilities by interpreting abstract and culturally specific references, such as recognizing and emulating iconic movements associated with well-known figures and media.

Implications and Future Directions

MotionCLIP showcases the potential of aligning motion generation systems with large-scale semantic structures like CLIP. The findings imply significant implications for intuitive and semantically meaningful motion generation in virtual and physical environments, potentially benefiting industries reliant on motion synthesis and robotics.

The paper suggests future explorations into overcoming directional and stylistic limitations. Potential research directions include enhancing architectural complexity and exploring domain adaption schemes to further refine the semantic and stylistic nuances of generated motions. Additionally, the extension of MotionCLIP’s utility to other temporal sequences and tasks suggests a promising avenue for broadening its applicability.

By seamlessly integrating the motion domain with a semantically enriched latent structure, MotionCLIP marks an advancement in leveraging pre-trained models for innovative applications within human motion generation, offering a flexible tool for researchers and practitioners in the field.

PDF Markdown