- The paper introduces MultiTalk, a framework that generates synchronized multi-person conversational videos using novel audio cross-attention mechanisms and L-RoPE embeddings.
- It leverages DiT-based diffusion architectures with adaptive person localization and a two-stage training paradigm to ensure precise audio and visual binding.
- Evaluations show MultiTalk outperforms single-person and baseline models in metrics like Sync-C, FID, and FVD, advancing multi-modal video synthesis.
Audio-Driven Multi-Person Conversational Video Generation with MultiTalk
This paper presents MultiTalk, a framework designed for audio-driven multi-person conversational video generation, introducing a scalable approach to generate synchronized and instruction-following conversational videos involving multiple persons from multi-stream audio and structured prompts. The work addresses critical gaps in prior research, notably the limitations of existing audio-driven human animation approaches to single-person scenarios and their inability to disambiguate multiple simultaneous audio tracks for separate subjects.
Task Definition, Core Challenges, and Framework Design
The paper defines the task of audio-driven multi-person conversational video generation, which demands:
- Multi-stream audio input handling to support simultaneous speakers,
- Correct audio and person binding to prevent misattribution of speech/motion to the wrong character,
- Dynamic, adaptive person localization within generated frames to accommodate realistic conversational motion.
Building on DiT-based (Diffusion-in-Transformer) video diffusion architectures with a 3D VAE backbone, MultiTalk introduces a dedicated audio cross-attention mechanism and leverages multi-modal conditioning with text, CLIP image features, and acoustic embeddings.
Innovations: L-RoPE and Adaptive Audio Injection
A central technical contribution is the Label Rotary Position Embedding (L-RoPE), a positional encoding scheme that enables precise association of each audio stream with its corresponding subject's feature region in video latent space. By assigning structured, non-overlapping numerical label intervals to different persons and matching audio embeddings, the method activates localized attention maps that guide the diffusion model to bind each audio condition to its designated region.
Adaptive person localization is achieved by analyzing self-attention maps correlating input reference images and video latents, segmenting person regions dynamically rather than relying on fixed left/right splits, which do not generalize to unconstrained conversational layouts or significant spatial movements.
Training Paradigm and Model Robustness
Training is performed in two stages, with a large single-person corpus used for phase one, and a smaller dual-stream conversational corpus for specialized adaptation. Notable findings include:
- Partial parameter fine-tuning (i.e., only updating audio cross-attention and adapter layers) robustly preserves the instruction-following capabilities of the base model, whereas full-model fine-tuning causes severe degradation in following prompts, particularly with limited compute and data. This is corroborated by visual and metric-based ablations.
- Multi-task training—jointly learning image-to-video and audio+image-to-video tasks—substantially enhances the model’s ability to execute complex, prompt-driven motions and behaviors. Excluding the I2V data impairs instruction adherence, highlighting the necessity of diverse multi-event data for generalization.
An autoregressive inference procedure extends synthesis to long sequences by conditioning on several previous frames, overcoming the windowing constraints of latent video diffusion.
Quantitative and Qualitative Results
Comprehensive evaluation across talking head, talking body, and new dual-person conversational datasets demonstrates that MultiTalk achieves superior scores across Sync-C, Sync-D, E-FID, FID, and FVD metrics versus competing methods (e.g., AniPortrait, HALLO3, EchoMimic, Sonic, Fantasy Talking). Notably:
- MultiTalk matches or exceeds single-person models even when adapted for multi-person scenarios, indicating negligible degradation from multi-stream audio conditioning.
- Qualitative analysis evidences improved instruction following, spatial and lip-sync accuracy, and artifact reduction compared to both single-stream and naïve multi-person baselines (e.g., video patch concatenation).
Ablation on L-RoPE configuration reveals minimal sensitivity to numerical label range choices, supporting general applicability.
Limitations and Implications
A gap persists in facial expressiveness realism when driven by synthesized versus real audio, attributed to training data composition. Addressing this domain adaptation is a stated future direction.
Practical implications: MultiTalk enables realistic generative conversational scenes for multi-character movie synthesis, e-commerce avatars, and other multi-agent virtual interaction domains. The adaptive auto-binding of audio to video regions—without per-case manual annotation or partitioning—significantly advances deployability and scalability for in-the-wild multi-person scenarios.
Theoretical implications: The results demonstrate that structured label-based positional embeddings in attention layers can resolve the longstanding permutation-binding problem in multi-entity, multi-modal generative modeling. Furthermore, the strict restriction of trainable parameters during adaptation emerges as a practical paradigm to maintain model reliability under task transfer or data-constrained regimes.
Speculation on Future Directions
Future work is likely to extend:
- Robustness to synthetic-to-real audio domain gaps, potentially via adversarial or domain-invariant audio representation learning,
- Support for more than two speakers, with permutation-invariant or dynamically assigned label regions,
- Finer-grained gesture, gaze, and interaction modeling through large-scale, richly-labeled conversational datasets,
- End-to-end integration with real-time speech processing and prompting for interactive applications.
In sum, MultiTalk represents a significant advance in controllable multi-person audio-driven video synthesis with high fidelity, accurate cross-modal binding, and robust instruction-following for scalable real-world deployment.