INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations (2412.04037v1)

Published 5 Dec 2024 in cs.CV and cs.AI

Abstract: Imagine having a conversation with a socially intelligent agent. It can attentively listen to your words and offer visual and linguistic feedback promptly. This seamless interaction allows for multiple rounds of conversation to flow smoothly and naturally. In pursuit of actualizing it, we propose INFP, a novel audio-driven head generation framework for dyadic interaction. Unlike previous head generation works that only focus on single-sided communication, or require manual role assignment and explicit role switching, our model drives the agent portrait dynamically alternates between speaking and listening state, guided by the input dyadic audio. Specifically, INFP comprises a Motion-Based Head Imitation stage and an Audio-Guided Motion Generation stage. The first stage learns to project facial communicative behaviors from real-life conversation videos into a low-dimensional motion latent space, and use the motion latent codes to animate a static image. The second stage learns the mapping from the input dyadic audio to motion latent codes through denoising, leading to the audio-driven head generation in interactive scenarios. To facilitate this line of research, we introduce DyConv, a large scale dataset of rich dyadic conversations collected from the Internet. Extensive experiments and visualizations demonstrate superior performance and effectiveness of our method. Project Page: https://grisoon.github.io/INFP/.

Summary

The paper introduces INFP, a novel audio-driven framework for generating interactive and realistic head movements in dyadic conversations without explicit role switching.
INFP utilizes a two-stage model with dynamic role adaptation and a hybrid facial representation to synthesize expressive verbal and non-verbal behaviors from dual-track audio.
The framework enables socially intelligent virtual agents for real-time applications like virtual meetings and conversational AI, supported by the large-scale DyConv dataset.

Overview of the INFP Framework for Audio-Driven Interactive Head Generation

The paper "INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations" presents a novel audio-driven head generation framework that addresses the intricate requirements of synthesizing realistic and interactive head movements during dyadic conversations. Developed by a team from Bytedance, this framework is designed to dynamically generate lifelike and person-generic head videos, enabling realistic verbal and non-verbal communications in a virtual setting.

Technical Contributions

INFP distinguishes itself from prior models by focusing on the natural adaptation of roles and interactive states within dyadic conversations. Traditional models often suffer from rigid role assignments (speaker, listener) and complex requirements for explicit role switching, which can result in unnatural transitions. INFP overcomes these limitations by leveraging the following key innovations:

Dynamic Role Adaptation: The framework enables seamless transitions between speaking and listening states without explicit role assignments. This is achieved through a unified model architecture that processes dual-track dyadic audio inputs to dynamically construct verbal and non-verbal behaviors.
Two-Stage Framework: INFP consists of a Motion-Based Head Imitation stage and an Audio-Guided Motion Generation stage. The first stage learns communicative behaviors from real-life videos, encoding them into a motion latent space. The second stage maps dyadic audio inputs into this latent space using an interactive motion guider and a conditional diffusion transformer, generating interactive head motion in real-time.
Hybrid Facial Representation: To ensure disentangled and expressive motion encoding, the framework employs a novel facial representation, masking non-expressive regions while preserving detailed regions like eyes and lips. This facilitates the extraction of expressive characteristics while minimizing appearance entanglement.
DyConv Dataset: The paper introduces DyConv, a large-scale dataset of dyadic conversations, which supports the development and evaluation of the framework. DyConv surpasses existing datasets in scale, quality, and the richness of interactions, providing a robust benchmark for interactive head generation research.

Implications and Future Directions

The development of INFP holds substantial implications for the creation of socially intelligent agents capable of engaging in multi-turn interactions in real-time scenarios, such as virtual meetings and conversational AI systems. The ability to generate expressive and contextually appropriate behaviors enhances the naturalness of machine-human interactions, providing a more engaging experience for users.

Theoretically, the framework’s approach to dynamic role adaptation without the need for explicit role-switching heralds a shift in conversational AI modeling, emphasizing the importance of real-time adaptability and responsiveness. Practically, the lightweight yet powerful nature of the model allows for applications in bandwidth-constrained environments, ensuring wider applicability.

Potential future work includes extending the model to incorporate multimodal inputs, such as visual or textual cues, and expanding generation capabilities to include full-body gestures. Such advancements could further improve the human-likeness and contextual appropriateness of virtual agents.

Conclusion

The INFP framework marks a significant evolution in audio-driven interactive head generation, offering a sophisticated solution to the challenges of dynamic role adaptation and non-verbal expressiveness in dyadic interactions. By leveraging innovative modeling techniques and a robust dataset, the research provides valuable insights and practical tools for future advancements in virtual agent development and applied conversational AI.

PDF Markdown

Related Papers

GitHub

INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations

Tweets

https://twitter.com/dreamingtulpa/status/1870123394961944717

https://twitter.com/minchoi/status/1870499980047446183

https://twitter.com/caojiaming1/status/1871032727526670817

https://twitter.com/NextMedHealth/status/1870948597086474380

https://twitter.com/AiContentRebel/status/1870869745186062468

https://twitter.com/iamluokai/status/1870864802274488786

HackerNews

INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations (29 points, 21 comments)
INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations (2 points, 0 comments)