Seeing Voices: Generating A-Roll Video from Audio with Mirage

This presentation explores Mirage, a 10-billion parameter foundation model that generates realistic A-roll video footage from audio alone. We examine how its unified Diffusion Transformer architecture achieves superior quality through asymmetric self-attention across modalities, enabling natural facial expressions, body movements, and emotional nuance synchronized with speech—all without domain-specific components.
Script
A-roll footage—the primary shots where someone speaks directly to camera—is the backbone of narrative video. The researchers behind Mirage asked: what if you could generate that footage from nothing but audio?
Mirage is a 10-billion parameter model that takes an audio clip and produces video of a person speaking those words with natural expressions and gestures. It infers everything—speaker appearance, emotional tone, even subtle eye movements—directly from acoustic properties.
The technical foundation makes this possible through an elegant design choice.
Mirage employs a Diffusion Transformer that concatenates audio, text, and video tokens into a single sequence. This homogeneous structure means you can add or remove modalities without redesigning the architecture—no specialized speech components, no separate image encoders.
Training uses latent Flow Matching to move from noise to structured video. The authors built a carefully filtered dataset emphasizing high-quality narrative content, ensuring the model learns from expressive, valid examples rather than arbitrary footage.
The results reveal what this unified approach can achieve.
Mirage excels at the hard problems: precise lip sync on challenging sounds like plosives, natural eye movements that feel human, and gestures that actually align with what's being said. It's not just moving pixels—it's capturing performance.
When given only audio with no reference image, Mirage makes intelligent guesses about the speaker and setting. Even if you feed it contradictory text and audio, the model prioritizes what it hears—aligning visuals with vocal tone and prosody.
Better semantic alignment between text prompts and audio does improve results, though the model shows robustness even when they conflict. The authors emphasize subjective quality—how natural the motion feels, how convincing the emotion reads—over simple pixel-level metrics.
Mirage shows that a single unified architecture can generate expressive, synchronized A-roll video from audio alone, opening new pathways for narrative video creation and virtual performance. Visit EmergentMind.com to learn more and create your own videos.