VoiceDiT: Environment-Aware Speech Synthesis
- VoiceDiT is a diffusion-based, multimodal model that synthesizes speech and environmental audio using dual conditioning from text and visual cues.
- It unifies high-accuracy text-to-speech alignment with flexible text-to-audio synthesis to robustly simulate realistic acoustic environments.
- It leverages a novel Dual-Condition Diffusion Transformer and dual dataset training to ensure both clear linguistic fidelity and environmental coherence.
VoiceDiT is a diffusion-based multi-modal generative model for producing environment-aware speech and audio from text and visual prompts. It unifies high-accuracy text-to-speech (TTS) alignment with flexible text-to-audio (TTA) style synthesis, allowing the generation of speech that is both temporally and phonetically aligned to transcript input and naturally embedded into any target acoustic environment—specified via text, audio, or image. VoiceDiT achieves this by combining a novel Dual-Condition Diffusion Transformer (Dual-DiT) architecture with a large-scale, jointly curated corpus of synthetic and real-world speech-plus-environment data, and an auxiliary image-to-audio translation module capable of mapping visual cues into environmental audio latent space (Jung et al., 2024).
1. Motivation and Problem Scope
VoiceDiT addresses two longstanding challenges in generative speech and audio:
- Most TTS models assume noiseless studio signals and fail to generalize when target audio is embedded in real-world environments (e.g., reverberant public spaces, urban environments with noise, or wildlife backgrounds).
- Prior TTA systems achieve convincing environmental generation but cannot guarantee the generated speech is aligned with a specific transcript or maintain strong linguistic accuracy in noisy, multi-modal conditions.
VoiceDiT’s dual conditioning enables flexible input—text (content), audio or images (environment)—while generating outputs that maintain tight speech–transcript alignment and are acoustically coherent with the specified environment (Jung et al., 2024).
2. Datasets and Preprocessing Pipeline
The VoiceDiT framework utilizes a two-stage data approach: large-scale pre-training on synthetic mixtures and fine-tuning on real-world aligned data.
- Pre-training (Synthetic): Combines LibriTTS-R (585 h of single-speaker, clean, transcribed speech) with a diverse noise bank (340 K non-speech clips from WavCaps including crowds, animals, machines), mixing each utterance with random noise at SNR ∈ Uniform[2,10], and sometimes applying a room impulse response (RIR). This process yields approximately 600 K synthetic samples with text, background event, and reverberation annotations.
- Fine-tuning (Real-world): Starts from AudioSet-speech (597 K 10-s audio clips, weakly labeled and transcribed using Whisper Large-v3). Samples with word error rate (WER) above 20% (as computed by ASR alignment) are filtered out, resulting in 400 K highly aligned, transcript-matched audio segments (Jung et al., 2024).
This dual-dataset approach ensures the model is exposed to both well-controlled synthetic combinations and challenging real, noisy scenarios.
3. Dual-Condition Diffusion Transformer Architecture
The VoiceDiT core is the Dual-Condition Diffusion Transformer (Dual-DiT), which integrates separate content (speech) and environmental (style) conditioning:
- Content Condition (c_cont): Produced by a TTS module adapted from Glow-TTS, which encodes text into monotonic, time-aligned mel-feature latents. The process uses a text encoder, duration predictor, upsampling to match time-aligned frames, and two 2D convolutional layers to map to a compact latent .
- Environment Condition (c_env): Provided as either a CLAP (Contrastive Language-Audio Pretraining) audio embedding from a reference audio clip, or obtained by mapping a CLIP image embedding into the CLAP space using the diffusion-based I2A-Translator.
The diffusion process operates in the latent space of a pretrained VAE, with a standard forward process:
and in closed form for clean latent :
Reverse modeling is performed via parameterized mean and fixed variance, trained under reweighted denoising score matching:
The Dual-DiT block is structured as follows:
- Self-Attention Layer:
- Cross-Attention Layer (environmental condition): queries from input, keys/values from (either CLAP embedding or image-derived proxy), post-self-attention and prior to FFN
- Feed-Forward Layers and Residual/LayerNorm: Provide additional capacity and facilitate stable training
- Content Injection: is concatenated channel-wise to the denoising latent before entry to the first transformer block. Empirically, concatenation yields more robust transcript alignment than cross-attention for content.
This dual-stream conditioning allows separately controlling linguistic content and surrounding environmental style.
4. Image-to-Audio Translation Module
Unlike prior TTA models, VoiceDiT incorporates an image-to-audio (I2A) diffusion translator, which maps CLIP-encoded image embeddings to the CLAP-style audio embedding space required for environmental conditioning. The translator is a transformer trained as an expert denoiser in diffusion space:
Inference uses iterative denoising from noise and an image embedding , yielding an environmental embedding that matches the visual cue for environmental sound synthesis (Jung et al., 2024).
5. Training Regimen and Inference Procedure
- Pre-training: 100K steps on synthetic LibriTTS-R + WavCaps, minimizing the combined loss using AdamW (lr=, batch=16, 8×A6000 GPUs).
- Fine-tuning: 20K steps on filtered AudioSet-speech, with only and TTS layers frozen, lr halved.
- Guided Sampling: Dual classifier-free guidance combines content and environment via weighted sum:
with for speech-in-environment mode, and for pure TTA (Jung et al., 2024).
6. Experimental Results and Analysis
VoiceDiT demonstrates state-of-the-art performance in both TTS-in-noise and TTA benchmarks. On AC-Filtered test set (AudioCaps), compared to VoiceLDM and ground truth:
| Model | FAD ↓ | CLAP ↑ | WER % ↓ | Nat. MOS ↑ | Intel. MOS ↑ | Rel. MOS ↑ |
|---|---|---|---|---|---|---|
| GT | — | 0.40 | 17.47 | 4.24 ± 0.10 | 4.08 ± 0.11 | 4.26 ± 0.09 |
| VoiceLDM | 5.56 | 0.21 | 10.39 | 2.94 ± 0.11 | 3.35 ± 0.12 | 3.24 ± 0.11 |
| VoiceDiT | 4.60 | 0.22 | 7.09 | 3.41 ± 0.10 | 4.32 ± 0.08 | 3.86 ± 0.09 |
On zero-shot TTA (AudioCaps) and image-to-audio (VGGSound), VoiceDiT obtains:
- Text→Audio: FAD 3.55, KL 1.87, CLAP 0.45 (outperforms AudioLDM and VoiceLDM in all).
- Image→Audio: FAD 3.02, KL 2.73 (outperforms SpecVQGAN and Im2Wav).
Ablation studies reveal:
- Replacing the Dual-DiT with a U-Net increases FAD and reduces CLAP-score.
- Content alignment (concatenation) is essential; using cross-attention for content degrades WER (to >90%), breaking linguistic accuracy.
- Including environmental cross-attention further improves FAD and WER versus pure concatenation.
7. Discussion, Limitations, and Future Directions
VoiceDiT’s architectural separation of content (via concatenation) and environment (via cross-attention) is empirically justified: concatenation is critical for preserving transcript alignment, while cross-attention on environment yields better environmental fidelity. The use of CLAP and CLIP enables direct environmental modulation from audio or image cues.
Limitations include:
- Domain gap between synthetic and real-world scenes, particularly in highly complex environmental contexts.
- Computational cost and sampling latency inherent to diffusion models, which currently preclude real-time deployment.
- Global nature of current environment-guidance embeddings (CLAP) does not encode fine spatial cues.
Proposed directions for improvement:
- Enhance environmental modeling via integration of scene graphs or semantic visual segmentation.
- Explore fast sampling methods such as one-step consistency models or flow matching for candidate real-time deployment.
- Expand the content condition to include speaker or style tokens for multilingual and cross-style synthesis (Jung et al., 2024).
References
- "VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware Speech Synthesis" (Jung et al., 2024)