DualTalk: Dual-Speaker Interaction for 3D Talking Head Conversations (2505.18096v2)

Published 23 May 2025 in cs.CV, cs.SD, and eess.AS

Abstract: In face-to-face conversations, individuals need to switch between speaking and listening roles seamlessly. Existing 3D talking head generation models focus solely on speaking or listening, neglecting the natural dynamics of interactive conversation, which leads to unnatural interactions and awkward transitions. To address this issue, we propose a new task -- multi-round dual-speaker interaction for 3D talking head generation -- which requires models to handle and generate both speaking and listening behaviors in continuous conversation. To solve this task, we introduce DualTalk, a novel unified framework that integrates the dynamic behaviors of speakers and listeners to simulate realistic and coherent dialogue interactions. This framework not only synthesizes lifelike talking heads when speaking but also generates continuous and vivid non-verbal feedback when listening, effectively capturing the interplay between the roles. We also create a new dataset featuring 50 hours of multi-round conversations with over 1,000 characters, where participants continuously switch between speaking and listening roles. Extensive experiments demonstrate that our method significantly enhances the naturalness and expressiveness of 3D talking heads in dual-speaker conversations. We recommend watching the supplementary video: https://ziqiaopeng.github.io/dualtalk.

Collections

Summary

DualTalk: Dual-Speaker Interaction for 3D Talking Head Conversations

The paper "DualTalk: Dual-Speaker Interaction for 3D Talking Head Conversations" introduces an innovative framework addressing the limitations of current methodologies in 3D talking head generation, which generally focus on either speaker-only or listener-only roles. This research introduces a novel task of multi-round dual-speaker interaction, aiming to synthesize 3D talking heads capable of dynamically switching between speaking and listening roles during conversations.

The DualTalk framework is structured to integrate dynamic behaviors of both speakers and listeners, thereby simulating realistic and coherent dialogue interactions. It comprises four critical modules: the Dual-Speaker Joint Encoder, Cross-Modal Temporal Enhancer, Dual-Speaker Interaction Module, and Expressive Synthesis Module. The approach begins by capturing multimodal features—audio and blendshape data—from both participants through separate encoders, projecting them into a unified feature space. The temporal enhancer employs cross-modal attention combined with LSTM networks to align audio-visual cues over time, maintaining conversational coherence. Furthermore, the interaction module utilizes Transformers to model complex dynamics between speakers, ensuring context-aware responses. The final module fine-tunes the synthesized facial animations to reflect nuanced expressions.

A notable contribution of this paper is the creation of a large-scale dataset comprising 50 hours of multi-round conversations with over 1,000 unique characters, which facilitates the training of models to generate both speaking and listening behaviors. The dataset is characterized by dual-channel audio, enabling the isolation of each participant's voice, which is crucial for realistic dialogue synthesis. This dataset and the associated benchmark are pivotal in evaluating and advancing multi-round conversational capabilities in talking head models.

The experimental results demonstrate significant enhancement in the naturalness and expressiveness of 3D talking heads generated by DualTalk. Compared to existing methods, DualTalk exhibits superior performance metrics like Fréchet Distance (FD), Mean Squared Error (MSE), and Synchronization Index for Diversity (SID), underscoring its effectiveness in achieving realistic lip synchronization and responsive listener feedback.

The implications of DualTalk extend both theoretically and practically. It sets a new standard in interactive avatars by harnessing synchronized dual-speaker dynamics, paving the way toward more lifelike virtual agents that improve emotional and cognitive engagement in applications such as remote collaboration, customer service, and education. Future research could explore expanding DualTalk's abilities in diverse linguistic environments and integrating personality traits to further personalize conversational agents, making them more robust and adaptive across global applications.

In conclusion, DualTalk represents a significant advancement in the field of computer vision and human-computer interaction, offering profound insights and a valuable foundation for the future development of interactive 3D talking head technology.

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (7)

GitHub

DualTalk

Tweets

https://twitter.com/RecPaperBot/status/1934143845647921513