FLAME: Free-form Language-based Motion Synthesis & Editing (2209.00349v2)

Published 1 Sep 2022 in cs.CV and cs.GR

Abstract: Text-based motion generation models are drawing a surge of interest for their potential for automating the motion-making process in the game, animation, or robot industries. In this paper, we propose a diffusion-based motion synthesis and editing model named FLAME. Inspired by the recent successes in diffusion models, we integrate diffusion-based generative models into the motion domain. FLAME can generate high-fidelity motions well aligned with the given text. Also, it can edit the parts of the motion, both frame-wise and joint-wise, without any fine-tuning. FLAME involves a new transformer-based architecture we devise to better handle motion data, which is found to be crucial to manage variable-length motions and well attend to free-form text. In experiments, we show that FLAME achieves state-of-the-art generation performances on three text-motion datasets: HumanML3D, BABEL, and KIT. We also demonstrate that editing capability of FLAME can be extended to other tasks such as motion prediction or motion in-betweening, which have been previously covered by dedicated models.

Citations (165)

View on Semantic Scholar

Summary

The paper presents FLAME, a diffusion-based generative model that synthesizes and edits variable-length 3D human motions from text.
It integrates transformer architecture and pre-trained language models to capture nuanced textual cues for precise motion synthesis.
Results on HumanML3D, BABEL, and KIT datasets show improved motion-text alignment, diversity, and fidelity over previous methods.

Overview of "FLAME: Free-form Language-based Motion Synthesis & Editing"

The paper introduces FLAME, a diffusion-based generative model designed for text-based motion synthesis and editing. This system addresses the challenging task of producing 3D human motions from textual descriptions, applicable across various industries such as gaming, animation, and robotics. FLAME's architecture is built upon transformer and diffusion models, offering advancements in generating variable-length, high-fidelity motion sequences aligned with natural language prompts, and facilitating modifications without additional fine-tuning.

Model Design and Contributions

FLAME represents an innovative adaptation of diffusion models, traditionally successful in image generation, to the field of motion data. The model's ability to manipulate spatio-temporal sequences is supported by:

Transformer-based Architecture: Utilizing decoder mechanisms to manage the intrinsic temporal characteristics of motion data. This setup includes attention layers that efficiently process variable-length sequences.
Integration of Free-form Language: Employing pre-trained LLMs (PLMs) to capture nuanced textual cues that drive the synthesis of complex motions.
Flexible Editing Capabilities: Allowing frame-wise and joint-wise adjustments on reference motions through classifier-free guidance, a technique enhancing precision when aligning with textual prompts.

The model operates by learning a reverse diffusion process, starting from Gaussian noise and gradually refining it to plausible motion sequences, conditioned on text prompts. The diffusion step information and motion length are tokenized progressively, guiding the denoising transformation.

Experimental Results

FLAME exhibits superior text-to-motion generation capabilities, demonstrated across datasets such as HumanML3D, BABEL, and KIT. Key performance metrics include improved motion-text alignment (mCLIP, FD), motion diversity, and accuracy (measured by R-Precision). FLAME markedly surpasses previous models in both qualitative and quantitative benchmarks, showing high fidelity in alignment and diversity versus ground-truth motions.

Implications and Future Directions

The ability of FLAME to generate motions from natural language offers substantial advancements in animation and simulation technologies. Its unified framework for addressing diverse motion prediction and in-betweening tasks indicates a potential reduction in hardware and software complexity in industry applications. Furthermore, its success in bridging the generative gap between text and motion suggests broader adoption of diffusion models in cross-modal synthesis scenarios.

For future work, enhancements in sampling efficiency could enable real-time applications, addressing a core limitation of diffusion models' computational demand. The potential parallel improvements in textual and visual embeddings could foster advancements in holistic motion understanding and synthesis, forming the basis for more sophisticated human-machine interactions.

Overall, FLAME provides a significant contribution to the field of motion synthesis, showcasing the versatility and efficacy of diffusion processes in generating human-like animations from natural language inputs.

PDF Markdown