AudioGen-Omni: Unified Multimodal Audio Synthesis
- The paper presents AudioGen-Omni, a unified framework that synthesizes high-fidelity audio by merging text, video, and lyrical cues through a diffusion transformer architecture.
- It employs joint training with adaptive attention mechanisms, including AdaLN and PAAPI, to achieve precise semantic, acoustic, and temporal alignment across modalities.
- The approach demonstrates state-of-the-art performance on metrics such as FD, KL divergence, and Inception Score while ensuring low-latency, flexible audio generation.
AudioGen-Omni is a unified multimodal generative framework based on multimodal diffusion transformers (MMDiT), architected to synthesize high-fidelity audio—including speech and song—coherently and robustly synchronized with diversified video and text inputs. It establishes a new benchmark for conditional audio generation by fusing large-scale video-text-audio corpora through a joint training paradigm with cross-modal attention, targeted positional alignment, and a flexible modality interface. The approach enables state-of-the-art results across Text-to-Audio, Speech, and Song generation tasks, while delivering significant gains in semantic, acoustic, and temporal alignment.
1. Unified Multimodal Diffusion Transformer Architecture
AudioGen-Omni’s generative backbone is organized around Multimodal Diffusion Transformer (MMDiT) blocks, which generalize the diffusion transformer mechanisms seen in models such as SD3 and Flux. Each block implements adaptive joint attention by concatenating query, key, and value matrices from video, audio, text, and lyric/transcription modalities. Scaled dot-product attention is computed across these aggregated representations, enabling fine-grained cross-modal reasoning while preserving modality-specific nuances for partitioned output streams.
A dedicated lyrics-transcription encoder provides modality-unified frame-level representations for both spoken and sung content. This encoder employs multilingual VoiceBPE tokenization (to handle grapheme and phoneme inputs, including non-Roman scripts converted to the phonemic domain), 768-dimensional learnable embeddings, and ConvNeXt-V2 blocks for local temporal modeling.
Global conditioning vectors are constructed by aggregating Fourier-encoded diffusion steps, audio duration information, and average-pooled video/text features, and shared across transformer layers. This structure ensures that high-temporal-resolution visual information is aligned precisely with generated audio events, supporting both semantic coherence and accurate lip-sync.
The model’s core conditional flow matching objective is defined by
where interpolates between audio samples, and is a fixed velocity field.
2. Joint Training Paradigm and Dataset Integration
AudioGen-Omni applies a joint training paradigm that unfreezes all modalities, allowing parameters from audio, video, and text streams to be updated concurrently. Modality masking is used to handle missing inputs, ensuring robust training when not all modalities are present at every update step. This contrasts with text-frozen architectures, which restrict semantic sharing and limit cross-modal generalization.
The corpus integrates large-scale datasets such as VGGSound, Pandas70M, InterVid, and various captioned speech/singing datasets, providing paired video, audio, and textual content. By training across these corpora, the model develops modality-agnostic representations and conditional generation that remains effective—even when only a subset of modalities is provided at inference.
The consequence is improved cross-modal adaptation: the model can conditionally synthesize audio from arbitrary combinations of video, text, and lyrics/transcriptions, supporting tasks such as text-to-audio, video-to-audio, and conditioned song or speech generation.
3. Cross-Modal Conditioning and Alignment
To enable precise cross-modal synchronization, AudioGen-Omni incorporates two critical innovations:
- AdaLN-based Joint Attention: Adaptive Layer Normalization (AdaLN) modulates scaling and shifting within transformer layers using global conditioning vectors, which encode temporal, acoustic, visual, and linguistic information.
- Phase-Aligned Anisotropic Positional Infusion (PAAPI): This mechanism applies rotary positional embedding (RoPE) selectively to modalities with temporal structure (e.g., video, audio latents, lyrics aligned at the frame level). This selective, phase-aligned infusion ensures fine-grained lip-sync and maintains robust temporal ordering across modalities, while isotropic textual inputs remain unaffected by unnecessary positional encodings.
These mechanisms jointly enable the model to synchronize audio events with speech/singing mouth movements and maintain semantic alignment between all conditioning channels.
4. Performance Evaluation and Comparative Metrics
AudioGen-Omni achieves state-of-the-art outcomes on a portfolio of standard tasks and benchmarks. Representative quantitative results include:
Metric | Value | Notes |
---|---|---|
Fréchet Dist. (FD_PaSST) | 58.766 | Lower is better |
Fréchet Dist. (FD_PANNs) | 6.292 | Lower is better |
KL divergence | 1.556 | Lower is better |
Inception Score (IS) | 21.521 | Higher is better |
IB-score | 29.261 | Higher is better (semantic) |
DeSync (temporal align, s) | 0.450 | Lower is better |
Inference for 8s audio (s) | 1.91 | Highly efficient |
These results reflect closeness to natural audio distribution, high audio fidelity, and strong semantic/temporal agreement with visual and textual conditions. Inference efficiency is considerable; for instance, generating 8 seconds of audio only requires 1.91 seconds.
A plausible implication is that the combination of PAAPI and AdaLN-based attention yields measurable improvements not just in sample quality, but also in alignment-sensitive features such as lip-sync—critical in video-based speech and song applications.
5. Modality Flexibility, Generality, and Inference Efficiency
AudioGen-Omni is engineered to accept arbitrary combinations of text, video, and lyrical/transcription cues as conditions for audio synthesis. Its joint-attention framework and masking-based input strategy enable the model to generalize to unseen combinations or missing modality scenarios without specific retraining.
The architecture’s modality-agnostic attention and parameter sharing—guided by global conditioning—enable the synthesis of diverse audio outputs (ambient sound, speech, music, or song), maintaining high fidelity and synchronization. The low-latency inference (sub-2s for 8s of output) facilitates integration into latency-sensitive, real-time, or interactive systems.
6. Real-World Applications
AudioGen-Omni’s capabilities enable several practical applications:
- Audio for Video and Media Production: Generation of mood-appropriate, synchronized soundtracks, lip-synced speech or song, and context-aligned ambient sound for video, film, and gaming.
- Automated Dubbing and Voiceover: Generation of accurately-timed speech streams for video content, including automated translation and voice-style adaptation.
- Education and Accessibility: Synchronized narration and expressive speech/song creation for interactive learning, or context-aware audio description for visually impaired audiences.
- Advertising, Communication, and Social Media: Synthesis of expressive, highly-aligned audio for video-centric campaigns, live events, or immersive communication experiences.
This flexibility and consistency across multiple conditioning paradigms position AudioGen-Omni as a general-purpose omnidirectional solution for audio generation in multimedia systems.
7. Significance and Outlook
AudioGen-Omni marks a substantive advance in multimodal generation by demonstrating that tightly-coupled training across diverse modalities—leveraging MMDiT blocks, AdaLN-based joint attention, and phase-aware positional alignment—substantially elevates semantic, acoustic, and temporal quality in audio synthesis. Its empirical superiority, generality, and inference speed position it for broad adoption in research and industrial applications that require coherent, multimodal audio generation.
Future work may further refine cross-modal alignment with even higher granularity, extend applicability to additional languages and musical genres, and integrate with more fine-grained control signals (e.g., explicit emotional direction or style vectors). The present design evidences that modality-unified training and efficient, structured cross-modal attention are core enabling factors for the next generation of omnidirectional generative models.