Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 74 tok/s

Gemini 2.5 Pro 39 tok/s Pro

GPT-5 Medium 16 tok/s Pro

GPT-5 High 13 tok/s Pro

GPT-4o 86 tok/s Pro

Kimi K2 186 tok/s Pro

GPT OSS 120B 446 tok/s Pro

Claude Sonnet 4 36 tok/s Pro

2000 character limit reached

Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation (2505.22647v1)

Published 28 May 2025 in cs.CV

Abstract: Audio-driven human animation methods, such as talking head and talking body generation, have made remarkable progress in generating synchronized facial movements and appealing visual quality videos. However, existing methods primarily focus on single human animation and struggle with multi-stream audio inputs, facing incorrect binding problems between audio and persons. Additionally, they exhibit limitations in instruction-following capabilities. To solve this problem, in this paper, we propose a novel task: Multi-Person Conversational Video Generation, and introduce a new framework, MultiTalk, to address the challenges during multi-person generation. Specifically, for audio injection, we investigate several schemes and propose the Label Rotary Position Embedding (L-RoPE) method to resolve the audio and person binding problem. Furthermore, during training, we observe that partial parameter training and multi-task training are crucial for preserving the instruction-following ability of the base model. MultiTalk achieves superior performance compared to other methods on several datasets, including talking head, talking body, and multi-person datasets, demonstrating the powerful generation capabilities of our approach.

Summary

The paper introduces MultiTalk, a framework that generates synchronized multi-person conversational videos using novel audio cross-attention mechanisms and L-RoPE embeddings.
It leverages DiT-based diffusion architectures with adaptive person localization and a two-stage training paradigm to ensure precise audio and visual binding.
Evaluations show MultiTalk outperforms single-person and baseline models in metrics like Sync-C, FID, and FVD, advancing multi-modal video synthesis.

Audio-Driven Multi-Person Conversational Video Generation with MultiTalk

This paper presents MultiTalk, a framework designed for audio-driven multi-person conversational video generation, introducing a scalable approach to generate synchronized and instruction-following conversational videos involving multiple persons from multi-stream audio and structured prompts. The work addresses critical gaps in prior research, notably the limitations of existing audio-driven human animation approaches to single-person scenarios and their inability to disambiguate multiple simultaneous audio tracks for separate subjects.

Task Definition, Core Challenges, and Framework Design

The paper defines the task of audio-driven multi-person conversational video generation, which demands:

Multi-stream audio input handling to support simultaneous speakers,
Correct audio and person binding to prevent misattribution of speech/motion to the wrong character,
Dynamic, adaptive person localization within generated frames to accommodate realistic conversational motion.

Building on DiT-based (Diffusion-in-Transformer) video diffusion architectures with a 3D VAE backbone, MultiTalk introduces a dedicated audio cross-attention mechanism and leverages multi-modal conditioning with text, CLIP image features, and acoustic embeddings.

Innovations: L-RoPE and Adaptive Audio Injection

A central technical contribution is the Label Rotary Position Embedding (L-RoPE), a positional encoding scheme that enables precise association of each audio stream with its corresponding subject's feature region in video latent space. By assigning structured, non-overlapping numerical label intervals to different persons and matching audio embeddings, the method activates localized attention maps that guide the diffusion model to bind each audio condition to its designated region.

Adaptive person localization is achieved by analyzing self-attention maps correlating input reference images and video latents, segmenting person regions dynamically rather than relying on fixed left/right splits, which do not generalize to unconstrained conversational layouts or significant spatial movements.

Training Paradigm and Model Robustness

Training is performed in two stages, with a large single-person corpus used for phase one, and a smaller dual-stream conversational corpus for specialized adaptation. Notable findings include:

Partial parameter fine-tuning (i.e., only updating audio cross-attention and adapter layers) robustly preserves the instruction-following capabilities of the base model, whereas full-model fine-tuning causes severe degradation in following prompts, particularly with limited compute and data. This is corroborated by visual and metric-based ablations.
Multi-task training—jointly learning image-to-video and audio+image-to-video tasks—substantially enhances the model’s ability to execute complex, prompt-driven motions and behaviors. Excluding the I2V data impairs instruction adherence, highlighting the necessity of diverse multi-event data for generalization.

An autoregressive inference procedure extends synthesis to long sequences by conditioning on several previous frames, overcoming the windowing constraints of latent video diffusion.

Quantitative and Qualitative Results

Comprehensive evaluation across talking head, talking body, and new dual-person conversational datasets demonstrates that MultiTalk achieves superior scores across Sync-C, Sync-D, E-FID, FID, and FVD metrics versus competing methods (e.g., AniPortrait, HALLO3, EchoMimic, Sonic, Fantasy Talking). Notably:

MultiTalk matches or exceeds single-person models even when adapted for multi-person scenarios, indicating negligible degradation from multi-stream audio conditioning.
Qualitative analysis evidences improved instruction following, spatial and lip-sync accuracy, and artifact reduction compared to both single-stream and naïve multi-person baselines (e.g., video patch concatenation).

Ablation on L-RoPE configuration reveals minimal sensitivity to numerical label range choices, supporting general applicability.

Limitations and Implications

A gap persists in facial expressiveness realism when driven by synthesized versus real audio, attributed to training data composition. Addressing this domain adaptation is a stated future direction.

Practical implications: MultiTalk enables realistic generative conversational scenes for multi-character movie synthesis, e-commerce avatars, and other multi-agent virtual interaction domains. The adaptive auto-binding of audio to video regions—without per-case manual annotation or partitioning—significantly advances deployability and scalability for in-the-wild multi-person scenarios.

Theoretical implications: The results demonstrate that structured label-based positional embeddings in attention layers can resolve the longstanding permutation-binding problem in multi-entity, multi-modal generative modeling. Furthermore, the strict restriction of trainable parameters during adaptation emerges as a practical paradigm to maintain model reliability under task transfer or data-constrained regimes.

Speculation on Future Directions

Future work is likely to extend:

Robustness to synthetic-to-real audio domain gaps, potentially via adversarial or domain-invariant audio representation learning,
Support for more than two speakers, with permutation-invariant or dynamically assigned label regions,
Finer-grained gesture, gaze, and interaction modeling through large-scale, richly-labeled conversational datasets,
End-to-end integration with real-time speech processing and prompting for interactive applications.

In sum, MultiTalk represents a significant advance in controllable multi-person audio-driven video synthesis with high fidelity, accurate cross-modal binding, and robust instruction-following for scalable real-world deployment.