ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer and Domain-Specific Representation Alignment

Published 29 May 2026 in eess.AS, cs.AI, and cs.CL | (2605.30965v1)

Abstract: Recent advancements in text-guided audio generation have yielded promising results in diverse domains, including sound effects, speech, and music. However, jointly generating speech with environmental audio remains challenging due to the inherent disparities in their acoustic patterns and temporal dynamics. We propose ImmersiveTTS, an environment-aware text-to-speech (TTS) model that generates natural speech seamlessly integrated within environmental contexts by explicitly modeling cross-modal interactions. Our model builds on a multimodal diffusion transformer and fuses transcript-aligned speech latent with text-conditioned environmental context via joint attention. To enhance semantic consistency, we introduce a domain-specific representation alignment objective tailored to environment-aware TTS, leveraging complementary self-supervised representations from speech and audio encoders. Experimental results show that ImmersiveTTS achieves higher naturalness, intelligibility, and audio fidelity than existing approaches across objective metrics and human listening tests.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces a dual-stream MM-DiT architecture that fuses linguistic and environmental features via cross-attention and a domain-specific REPA loss.
It demonstrates significant performance improvements, achieving lower WER and FAD with only 25 sampling steps compared to 200 in baseline models.
The method enables controllable TTS synthesis by balancing speech naturalness with coherent environmental integration.

ImmersiveTTS: Multimodal Diffusion for Environment-Aware Text-to-Speech with Domain-Specific Representation Alignment

Problem Setting and Motivation

Environment-aware text-to-speech (TTS) remains a challenging subdomain of neural generative modeling due to the necessity of synthesizing intelligible speech alongside semantically coherent and acoustically integrated environmental audio. Contemporary TTS systems typically segregate the modeling of speech from ambient context, resulting in poor cross-modal coherence and diminished realism when synthesizing immersive scenes. While recent multimodal and unified audio generation systems have attempted to bridge the gap, they frequently fall short in capturing fine-grained linguistic structure and in fusing environmental semantics with speech content.

ImmersiveTTS targets this problem through an explicitly cross-modal architecture, leveraging multimodal diffusion transformers (MM-DiT) and a dual-stream backbone to facilitate interaction between linguistic and environmental representations. Further, a domain-specific representation alignment (REPA) loss, driven by self-supervised speech- and audio-specialized encoders, is introduced to regularize learning and enforce semantic and acoustic consistency.

Methodology

Architecture

ImmersiveTTS is built upon a dual-stream MM-DiT backbone (Figure 1), adapted from prior image-text integration work, but here tailored for the complex interplay of speech and environmental context in audio synthesis. Content prompt-aligned linguistic features drive the speech stream, while Flan-T5 embeddings support fine-grained environmental sequencing in the environmental context stream, supplemented with CLAP for global environmental semantics. The two streams iteratively interact through joint cross-attention, enabling the speech generation process to adapt to, and integrate with, environmental cues at multiple levels of abstraction.

Figure 1: The ImmersiveTTS architecture employs a dual-stream MM-DiT backbone to explicitly enable conditioning and interaction between linguistic content and environmental context for environment-aware TTS.

Double-stream DiT blocks facilitate early fusion and cross-modal alignment (Figure 2). After a dedicated phase of double-stream joint modeling, the speech pathway continues to single-stream DiT stages for fine-grained refinement driven by self-attention.

Figure 2: Double-stream DiT blocks perform initial cross-modal integration between speech and environment streams.

Conditioning and Feature Extraction

The environmental stream fuses global semantic context (via CLAP) and local detail (via T5 token embeddings), assigning them separately to AdaLN and sequence input, respectively. The speech pathway conditions the audio latents on transcript-aligned speech features (obtained via a text encoder and monotonic alignment search) and a convolutional mapping that aligns feature domains. This design supports controllable, highly intelligible speech generation aligned with both prompt content and environmental scene descriptions.

Training Objectives: Flow Matching and Domain-Specific REPA

The generative process is formalized as a flow matching ODE between latent Gaussian noise and the data manifold, trained with a rectified flow MSE criterion. Critically, ImmersiveTTS introduces domain-specific representation alignment: the REPA loss draws on frozen, self-supervised teacher encoders—WavLM for speech and ATST-Frame for audio. The model’s speech-side intermediate features are projected (via separate MLPs) into each teacher’s representation space and temporally aligned to form the REPA loss, which directly regularizes synthesis toward both linguistic and environmental ground-truth structure.

Inference and Conditional Generation

Dual classifier-free guidance (CFG) is used at inference, with independently tunable guidance weights for content and environmental streams. This enables flexible controllability over generated attribute fidelity (as analyzed in Figures 4 and 5).

Figure 3: Objective evaluation for various environment guidance scales demonstrating the impact on intelligibility, audio quality, and text-audio alignment.

Figure 4: The impact of content guidance scale on generation trade-offs between WER, FAD, and CLAP.

Experimental Results

ImmersiveTTS is evaluated on AudioCaps and Seed-TTS benchmarks under subjective and objective criteria, including speech naturalness (SN-MOS), environment consistency (EC-MOS), overall naturalness (ON-MOS), word error rate (WER), Frechet audio distance (FAD), and CLAP alignment.

The following outcomes are emphasized:

ImmersiveTTS consistently outperforms VoiceLDM and VoiceDiT across SN-MOS, ON-MOS, WER, FAD, and CLAP, demonstrating superior intelligibility, audio fidelity, and speech-environment integration, with substantially fewer sampling steps (25 NFEs for ImmersiveTTS vs. 200 for baselines).
The dual-teacher REPA strategy (WavLM + ATST-Frame) yields the strongest trade-off between speech accuracy and environment fidelity, outperforming both single-teacher and unified-teacher alternatives in all reported metrics.
The architecture is robust to changes in sampling steps, maintaining high quality with drastically reduced inference cost (Figure 5).
Objective results on speaker identity preservation and mixing-based baselines indicate the unified modeling of ImmersiveTTS effectively preserves identity and coherence compared to post-hoc mixing pipelines.
Figure 5: Comparative performance as a function of sampling steps (NFEs); ImmersiveTTS achieves best efficiency/quality trade-off compared to diffusion-based baselines.

Additional Analyses

Systematic studies on the impact of representation alignment position corroborate the importance of applying REPA at intermediate MM-DiT layers, further enhancing stability and domain-specific guidance. Dual CFG analyses show that overly strong environmental or content weighting can degrade overall quality, suggesting the importance of balanced settings for both intelligibility and scene realism.

Implications and Future Directions

ImmersiveTTS substantiates the efficacy of explicitly cross-modal, transformer-based fusion approaches in environment-aware TTS. The combination of MM-DiT dual-stream modeling and domain-specific REPA sets a rigorous foundation for further advances, enabling:

More effective integration of uncontrollable or compositional environmental cues
Efficient sampling consistent with recent progress in flow-based generative models
Extensibility toward other multimodal tasks such as audio-visual speech synthesis and expressive prosody modeling

Remaining limitations include reliance on synthetic training mixtures and limited granular control of paralinguistic features (e.g., emotion or speaking style). Scaling to real-world, complex interaction scenarios with genuine, naturally recorded background contexts is a future avenue, alongside the integration of richer, controllable prosodic and emotional factors.

Conclusion

ImmersiveTTS achieves a new standard in unified, environment-aware TTS by leveraging MM-DiT dual-stream fusion and complementary domain alignment objectives, empirically establishing clear advances in both objective and subjective quality metrics. The architecture’s systematic fusion of transcript and environment conditioning, underpinned by self-supervised feature alignment, provides a robust architecture-level template for future scalable, multimodal, and highly controllable speech synthesis frameworks.

References

For the full technical details and evaluation, see "ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer and Domain-Specific Representation Alignment" (2605.30965).

Markdown Report Issue