InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation (2405.15758v1)

Published 24 May 2024 in cs.CV and cs.AI

Abstract: Recent talking avatar generation models have made strides in achieving realistic and accurate lip synchronization with the audio, but often fall short in controlling and conveying detailed expressions and emotions of the avatar, making the generated video less vivid and controllable. In this paper, we propose a novel text-guided approach for generating emotionally expressive 2D avatars, offering fine-grained control, improved interactivity, and generalizability to the resulting video. Our framework, named InstructAvatar, leverages a natural language interface to control the emotion as well as the facial motion of avatars. Technically, we design an automatic annotation pipeline to construct an instruction-video paired training dataset, equipped with a novel two-branch diffusion-based generator to predict avatars with audio and text instructions at the same time. Experimental results demonstrate that InstructAvatar produces results that align well with both conditions, and outperforms existing methods in fine-grained emotion control, lip-sync quality, and naturalness. Our project page is https://wangyuchi369.github.io/InstructAvatar/.

Authors (8)

Yuchi Wang (11 papers)
Junliang Guo (39 papers)
Jianhong Bai (14 papers)
Runyi Yu (13 papers)
Tianyu He (52 papers)
Xu Tan (164 papers)
Xu Sun (194 papers)
Jiang Bian (229 papers)

Citations (4)

View on Semantic Scholar

Summary

The paper introduces a dual-branch diffusion-based framework that leverages textual instructions and audio input to precisely control avatar emotion and motion.
The methodology integrates a variance autoencoder with a diffusion-based motion generator, enabling fine-grained and naturalistic animation synthesis.
Experimental results demonstrate significant improvements in action unit alignment and lip-sync accuracy over existing methods, enhancing digital avatar realism.

An Overview of InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation

The paper "InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation" investigates the complexities inherent in the realistic and expressive generation of 2D avatars. This research primarily focuses on overcoming existing challenges in the domain, such as limited control over emotions and motion expressions in avatar animations, by introducing text as an intuitive medium for these controls.

Key Contributions

The authors introduce a novel framework, InstructAvatar, which leverages textual instructions to guide both the emotional and physical expressions of avatars. This model allows fine-grained and high-quality output that surpasses traditional methods relying solely on audio cues or coarse emotion labels. The architecture of InstructAvatar is particularly notable for its dual-branch diffusion-based generator, which processes input audio and textual instructions simultaneously to predict avatar animations.

Methodology

InstructAvatar's innovation lies in its use of a natural language interface to provide flexible and precise control over avatar expressions and motions. The paper introduces a training dataset constructed through an automatic annotation pipeline that pairs text instructions with corresponding video examples.

The authors designed an advanced model architecture comprising a variance autoencoder (VAE) and a diffusion-based motion generator. The VAE is tasked with disentangling motion from visual information, while the motion generator, leveraging conformer blocks and cross-attentions, synthesizes the animation conditioned on both audio input and textual guidance.

Some specific methodological innovations include:

Emotion Label Extension: Augmenting simple emotion tags with templates to create naturalistic phrases which encapsulate these emotions accurately.
Action Unit Extraction: Utilizing facial muscle movements (action units or AUs) to provide detailed motion descriptions.
MLLM Paraphrase: Leveraging multi-modal LLMs, such as GPT-4V, to convert AUs into comprehensible text instructions while refining any discrepancies detected visually.

Experimental Results

InstructAvatar was benchmarked against leading methods such as MakeItTalk, EAT, StyleTalk, and DreamTalk. The results position InstructAvatar as superior in emotion control and lip-sync accuracy while maintaining high naturalness in generated animations. The quantitative analysis demonstrates significant improvements over existing methods, particularly in fine-grained emotion control seen through increased AU (Action Unit) alignment scores and reduced lip-sync error metrics.

The subjective evaluations conducted show that users rated InstructAvatar highly for emotion accuracy and natural appearance, reinforcing the model's effectiveness in synthesizing lifelike avatars.

Implications and Future Work

The implications of this paper are twofold. Practically, it enables the creation of more lifelike and responsive avatars in various applications, from entertainment and gaming to telecommunication and virtual reality. Theoretically, this work paves the way for further exploration of text-guided generative models, potentially extending beyond facial animations to other forms of dynamic content generation.

Future investigations could involve enlarging the dataset for better robustness across diverse domains and expanding the framework's applicability to more complex scenarios, like simultaneous emotion and motion control in high-dimensional character models.

In conclusion, InstructAvatar represents a significant step forward in the domain of avatar generation, allowing nuanced control over both facial emotions and movements through a single unified approach enabled by text instructions. This advancement not only demonstrates the practical viability of text-guided generative models but also invites further research and application development in the field of interactive digital avatars.

PDF Markdown