EzAudio: Open-Source Text-to-Audio
- EzAudio is an open-source text-to-audio framework that employs a transformer-based latent diffusion model optimized for efficient, high-fidelity sound synthesis.
- It features novel adaptations for the audio domain, including rotary position embeddings, QK-normalized attention, and adaptive SOLA layer normalization to boost stability and efficiency.
- Its three-stage training pipeline leveraging synthetic and human-annotated captions ensures robust prompt alignment and state-of-the-art performance in both objective and subjective evaluations.
EzAudio is an open-source text-to-audio (T2A) generation framework that targets high-fidelity synthesis of natural, prompt-aligned sound effects from text descriptions. It introduces a transformer-based latent diffusion model (EzAudio-DiT) specifically optimized for 1D audio latent representations. EzAudio incorporates efficient architectural strategies, classifier-free guidance (CFG) rescaling, and a three-stage pretraining pipeline leveraging large-scale synthetic caption resources and human-annotated audio. Comprehensive objective and subjective evaluations demonstrate that EzAudio achieves state-of-the-art (SoTA) results among open-source models while maintaining parameter and memory efficiency (Hai et al., 17 Sep 2024).
1. EzAudio-DiT Architecture
EzAudio-DiT is a latent diffusion transformer directly operating on the output of a fully-convolutional VAE trained on raw waveforms at 24 kHz, mapping audio to latents of resolution 50 Hz and 128 channels. Its distinctive adaptations for the audio domain are:
- Rotary Position Embedding (RoPE): Rather than 2D spatial embeddings, EzAudio uses time-axis RoPE on queries and keys. For token index and dimension , the rotation matrix is applied as:
where is a fixed frequency.
- QK-Normed Attention: Prior to dot-product attention, query and key vectors are L2-normalized:
with , .
- Adaptive LayerNorm via SOLA: The vanilla block-specific AdaLN is replaced by "Single Orchestrated by Low-rank Adjustment" (SOLA). Each block modulates a base AdaLN using low-rank projections of the diffusion timestep embedding , reducing parameter and memory costs by up to 30%:
with .
- Long-Skip Connections & Post-Fusion Normalization: Input latents are injected into later transformer blocks, and a LayerNorm stabilizes the merged features, preserving fine waveform details across depth.
- No Down/Up Sampling: Unlike U-Net architectures, diffusion is performed at fixed latent resolution, removing architectural complexity tied to hierarchical spatial scales.
- Parameter & Memory Efficiency: For instance, EzAudio-L (24 blocks, hidden dim 1024) uses 596M params and fits within 24 GB GPU RAM (batch 16), compared to >32 GB for a non-SOLA, non-skip variant with similar capacity.
2. Diffusion Objective and Training
EzAudio leverages the latent diffusion paradigm with velocity prediction and a zero-SNR noise schedule:
- Forward Diffusion: Clean latent is noised:
where , , and is Gaussian noise.
- Reverse (Denoising) Process: The model predicts the "velocity" target:
and minimizes mean squared error:
where is the text conditioning.
This velocity prediction with zero-SNR ensures stable training and improved sample quality for audio latents.
3. Classifier-Free Guidance and Rescaling
Standard classifier-free guidance interpolates model predictions conditioned () and unconditioned () on text:
with increasing adherence but often degrading quality at large . EzAudio employs a rescaling technique to address this tradeoff:
- Rescaling: Vector is scaled to match the standard deviation of , then linearly blended:
With at , improved prompt adherence is achieved while avoiding fidelity loss.
4. Synthetic Caption Generation Strategy
EzAudio's training pipeline is structured into three stages to maximize data reuse and harness large, weakly supervised datasets:
- Masked Diffusion Modeling (Stage 1): Trains the model unconditionally on AudioSet (1.8M clips) by randomly masking contiguous latent spans and requiring reconstruction, facilitating robust representation learning.
- Synthetic Caption Alignment (Stage 2): Aligns latent audio segments with synthetic captions. Sources include Auto-ACD (Sun et al.; 1.5M LLM-refined captions), AS-Qwen-Caps (Chu et al.), and AS-SL-GPT4-Caps (GPT-4). Pairs are filtered via CLAP similarity (threshold 0.40) to remove low-quality alignments. Cross-attention layers are initialized to zero and 10% unconditional text inputs are included to enable CFG.
- Human-Label Fine-Tuning (Stage 3): Refines the model on 48k AudioCaps human-annotated clips for 30k steps, enhancing prompt fidelity and realism.
This pipeline broadens coverage of acoustic events and linguistic styles and increases prompt adherence, especially for rare or compositional prompts.
5. Training Setup and Implementation
- VAE Details: A fully-convolutional 1D VAE with snake activations and a bottleneck generates latents at 50 Hz, 128 channels.
- Model Variants: EzAudio-L (24 DiT blocks, hidden dim 1024, 596M params), EzAudio-XL (28 blocks, hidden dim 1152, 874M params).
- Optimizer: AdamW, with staged learning rates (1e-4, 5e-5, 1e-5) and batch size 128 throughout.
- Training: 8 × A100-40GB, ∼5 days. Stages: 100k steps (Stage 1), 50k (Stage 2), 30k (Stage 3).
- Sampling: Default 50–100 steps; ablation studies at CFG=3; final evaluation at 100 steps, , .
6. Evaluation and Ablation
6.1 Objective Metrics
- Frechet Distance (FD), KL Divergence, Inception Score (IS), and CLAP Score: Standard objective metrics (PANNs features for FD/IS). Lower FD/KL, higher IS/CLAP indicate better sample quality and alignment.
6.2 Convergence and Ablations
- DiT Variants: PixelArt-DiT (AdaLN-Single) shows fast initial convergence but becomes unstable; Stable-Audio-DiT (RoPE+QK-Norm) is faster than CrossDiT but less stable; EzAudio-DiT (AdaLN-SOLA + long skips) converges both quickly and stably to best FD (≈16.0).
- Caption Filtering: Best trade-off at CLAP=0.40 (FD=15.46, KL=1.44, IS=10.11, CLAP=0.294).
- CFG Scaling/Rescaling: Without rescale, increasing improves CLAP but degrades FD; with rescaling and –$0.75$ at , high CLAP is maintained with minimal FD increase.
6.3 SoTA Comparison (AudioCaps Test)
| Model | Params | FD↓ | KL↓ | IS↑ | CLAP↑ |
|---|---|---|---|---|---|
| Tango | 866 M | 19.07 | 1.33 | 7.70 | 0.293 |
| Make-An-Audio-2 | 937 M | 16.16 | 1.42 | 9.93 | 0.284 |
| EzAudio-L | 596 M | 15.59 | 1.38 | 11.35 | 0.319 |
| EzAudio-XL | 874 M | 14.98 | 1.29 | 11.38 | 0.314 |
6.4 Subjective Evaluation
On 30 prompts with 12 expert listeners (5-point MOS), EzAudio-XL achieves overall quality (OVL) ≈ 4.2 and prompt relevance (REL) ≈ 4.3, substantially above open baselines and approaching real audio (OVL/REL ≈ 4.8).
7. Contributions, Open Problems, and Future Directions
EzAudio demonstrates:
- End-to-End 1D Latent Pipeline: Direct waveform latent modeling removes the need for separate vocoders and achieves high temporal resolution.
- Efficient DiT Design: AdaLN-SOLA, long skips, RoPE, and QK-Norm jointly enable fast and stable convergence with lower resource utilization.
- Data-Efficient Pretraining: Multi-stage training exploits unlabeled audio and synthetic captions, maximizing coverage and sample diversity.
- CFG Rescaling: Automation of guidance rescaling eliminates manual tuning.
- Performance: Achieves SoTA objective/subjective results among open-source systems.
Persistent challenges include further reducing dependence on human annotation (e.g., via stronger self-supervised objectives), extending to controllable generation (e.g., time alignment, editing), supporting richer conditioning (voice/melody, cross-modal tasks), and exploring non-Gaussian latent priors to improve low-frequency synthesis. These represent natural directions for the evolution of T2A frameworks grounded in diffusion transformers (Hai et al., 17 Sep 2024).