Papers
Topics
Authors
Recent
2000 character limit reached

EzAudio: Open-Source Text-to-Audio

Updated 26 November 2025
  • EzAudio is an open-source text-to-audio framework that employs a transformer-based latent diffusion model optimized for efficient, high-fidelity sound synthesis.
  • It features novel adaptations for the audio domain, including rotary position embeddings, QK-normalized attention, and adaptive SOLA layer normalization to boost stability and efficiency.
  • Its three-stage training pipeline leveraging synthetic and human-annotated captions ensures robust prompt alignment and state-of-the-art performance in both objective and subjective evaluations.

EzAudio is an open-source text-to-audio (T2A) generation framework that targets high-fidelity synthesis of natural, prompt-aligned sound effects from text descriptions. It introduces a transformer-based latent diffusion model (EzAudio-DiT) specifically optimized for 1D audio latent representations. EzAudio incorporates efficient architectural strategies, classifier-free guidance (CFG) rescaling, and a three-stage pretraining pipeline leveraging large-scale synthetic caption resources and human-annotated audio. Comprehensive objective and subjective evaluations demonstrate that EzAudio achieves state-of-the-art (SoTA) results among open-source models while maintaining parameter and memory efficiency (Hai et al., 17 Sep 2024).

1. EzAudio-DiT Architecture

EzAudio-DiT is a latent diffusion transformer directly operating on the output of a fully-convolutional VAE trained on raw waveforms at 24 kHz, mapping audio to latents of resolution 50 Hz and 128 channels. Its distinctive adaptations for the audio domain are:

  • Rotary Position Embedding (RoPE): Rather than 2D spatial embeddings, EzAudio uses time-axis RoPE on queries and keys. For token index ii and dimension kk, the rotation matrix is applied as:

Q~i,2k:2k+1=[cos(ωki)sin(ωki) sin(ωki)cos(ωki)]Qi,2k:2k+1\widetilde Q_{i,2k:2k+1} = \begin{bmatrix} \cos(\omega_k i) & -\sin(\omega_k i) \ \sin(\omega_k i) & \cos(\omega_k i) \end{bmatrix} Q_{i,2k:2k+1}

where ωk\omega_k is a fixed frequency.

  • QK-Normed Attention: Prior to dot-product attention, query and key vectors are L2-normalized:

Attention(Q,K,V)=softmax((Q^K^T)/d)V\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\bigl((\widehat Q \widehat K^T)/\sqrt{d}\bigr)V

with Q^=Q/Q\widehat Q = Q/\|Q\|, K^=K/K\widehat K = K/\|K\|.

  • Adaptive LayerNorm via SOLA: The vanilla block-specific AdaLN is replaced by "Single Orchestrated by Low-rank Adjustment" (SOLA). Each block modulates a base AdaLN using low-rank projections of the diffusion timestep embedding e(t)e(t), reducing parameter and memory costs by up to 30%:

γi=γ0+Uie(t),βi=β0+Vie(t)\gamma_i = \gamma_0 + U_i e(t), \quad \beta_i = \beta_0 + V_i e(t)

with Ui,ViRr×dU_i, V_i \in \mathbb{R}^{r \times d}.

  • Long-Skip Connections & Post-Fusion Normalization: Input latents are injected into later transformer blocks, and a LayerNorm stabilizes the merged features, preserving fine waveform details across depth.
  • No Down/Up Sampling: Unlike U-Net architectures, diffusion is performed at fixed latent resolution, removing architectural complexity tied to hierarchical spatial scales.
  • Parameter & Memory Efficiency: For instance, EzAudio-L (24 blocks, hidden dim 1024) uses 596M params and fits within 24 GB GPU RAM (batch 16), compared to >32 GB for a non-SOLA, non-skip variant with similar capacity.

2. Diffusion Objective and Training

EzAudio leverages the latent diffusion paradigm with velocity prediction and a zero-SNR noise schedule:

  • Forward Diffusion: Clean latent z0RL×128z_0 \in \mathbb{R}^{L \times 128} is noised:

zt=αtz0+σtϵz_t = \alpha_t z_0 + \sigma_t \epsilon

where αt=αˉt\alpha_t = \sqrt{\bar\alpha_t}, σt=1αˉt\sigma_t = \sqrt{1 - \bar\alpha_t}, and ϵ\epsilon is Gaussian noise.

  • Reverse (Denoising) Process: The model predicts the "velocity" target:

v=αtϵσtz0v = \alpha_t \epsilon - \sigma_t z_0

and minimizes mean squared error:

L=Ez0,ϵ,tv^θ(zt,t,y)(αtϵσtz0)2\mathcal{L} = \mathbb{E}_{z_0, \epsilon, t} \|\hat v_\theta(z_t, t, y) - (\alpha_t \epsilon - \sigma_t z_0)\|^2

where yy is the text conditioning.

This velocity prediction with zero-SNR ensures stable training and improved sample quality for audio latents.

3. Classifier-Free Guidance and Rescaling

Standard classifier-free guidance interpolates model predictions conditioned (vposv_{\mathrm{pos}}) and unconditioned (vnegv_{\mathrm{neg}}) on text:

vcfg=vneg+w(vposvneg)v_{\mathrm{cfg}} = v_{\mathrm{neg}} + w(v_{\mathrm{pos}} - v_{\mathrm{neg}})

with w>1w > 1 increasing adherence but often degrading quality at large ww. EzAudio employs a rescaling technique to address this tradeoff:

  • Rescaling: Vector vcfgv_{\mathrm{cfg}} is scaled to match the standard deviation of vposv_{\mathrm{pos}}, then linearly blended:

vre=vcfgstd(vpos)std(vcfg)v_{\mathrm{re}} = v_{\mathrm{cfg}} \frac{\mathrm{std}(v_{\mathrm{pos}})}{\mathrm{std}(v_{\mathrm{cfg}})}

vcfg=ϕvre+(1ϕ)vcfgv'_{\mathrm{cfg}} = \phi v_{\mathrm{re}} + (1-\phi) v_{\mathrm{cfg}}

With ϕ0.50.75\phi \approx 0.5-0.75 at w=5w=5, improved prompt adherence is achieved while avoiding fidelity loss.

4. Synthetic Caption Generation Strategy

EzAudio's training pipeline is structured into three stages to maximize data reuse and harness large, weakly supervised datasets:

  1. Masked Diffusion Modeling (Stage 1): Trains the model unconditionally on AudioSet (1.8M clips) by randomly masking contiguous latent spans and requiring reconstruction, facilitating robust representation learning.
  2. Synthetic Caption Alignment (Stage 2): Aligns latent audio segments with synthetic captions. Sources include Auto-ACD (Sun et al.; 1.5M LLM-refined captions), AS-Qwen-Caps (Chu et al.), and AS-SL-GPT4-Caps (GPT-4). Pairs are filtered via CLAP similarity (threshold 0.40) to remove low-quality alignments. Cross-attention layers are initialized to zero and 10% unconditional text inputs are included to enable CFG.
  3. Human-Label Fine-Tuning (Stage 3): Refines the model on 48k AudioCaps human-annotated clips for 30k steps, enhancing prompt fidelity and realism.

This pipeline broadens coverage of acoustic events and linguistic styles and increases prompt adherence, especially for rare or compositional prompts.

5. Training Setup and Implementation

  • VAE Details: A fully-convolutional 1D VAE with snake activations and a bottleneck generates latents at 50 Hz, 128 channels.
  • Model Variants: EzAudio-L (24 DiT blocks, hidden dim 1024, 596M params), EzAudio-XL (28 blocks, hidden dim 1152, 874M params).
  • Optimizer: AdamW, with staged learning rates (1e-4, 5e-5, 1e-5) and batch size 128 throughout.
  • Training: 8 × A100-40GB, ∼5 days. Stages: 100k steps (Stage 1), 50k (Stage 2), 30k (Stage 3).
  • Sampling: Default 50–100 steps; ablation studies at CFG=3; final evaluation at 100 steps, w=5w=5, ϕ=0.75\phi=0.75.

6. Evaluation and Ablation

6.1 Objective Metrics

  • Frechet Distance (FD), KL Divergence, Inception Score (IS), and CLAP Score: Standard objective metrics (PANNs features for FD/IS). Lower FD/KL, higher IS/CLAP indicate better sample quality and alignment.

6.2 Convergence and Ablations

  • DiT Variants: PixelArt-DiT (AdaLN-Single) shows fast initial convergence but becomes unstable; Stable-Audio-DiT (RoPE+QK-Norm) is faster than CrossDiT but less stable; EzAudio-DiT (AdaLN-SOLA + long skips) converges both quickly and stably to best FD (≈16.0).
  • Caption Filtering: Best trade-off at CLAP=0.40 (FD=15.46, KL=1.44, IS=10.11, CLAP=0.294).
  • CFG Scaling/Rescaling: Without rescale, increasing ww improves CLAP but degrades FD; with rescaling and ϕ=0.5\phi=0.5–$0.75$ at w=5w=5, high CLAP is maintained with minimal FD increase.

6.3 SoTA Comparison (AudioCaps Test)

Model Params FD↓ KL↓ IS↑ CLAP↑
Tango 866 M 19.07 1.33 7.70 0.293
Make-An-Audio-2 937 M 16.16 1.42 9.93 0.284
EzAudio-L 596 M 15.59 1.38 11.35 0.319
EzAudio-XL 874 M 14.98 1.29 11.38 0.314

6.4 Subjective Evaluation

On 30 prompts with 12 expert listeners (5-point MOS), EzAudio-XL achieves overall quality (OVL) ≈ 4.2 and prompt relevance (REL) ≈ 4.3, substantially above open baselines and approaching real audio (OVL/REL ≈ 4.8).

7. Contributions, Open Problems, and Future Directions

EzAudio demonstrates:

  • End-to-End 1D Latent Pipeline: Direct waveform latent modeling removes the need for separate vocoders and achieves high temporal resolution.
  • Efficient DiT Design: AdaLN-SOLA, long skips, RoPE, and QK-Norm jointly enable fast and stable convergence with lower resource utilization.
  • Data-Efficient Pretraining: Multi-stage training exploits unlabeled audio and synthetic captions, maximizing coverage and sample diversity.
  • CFG Rescaling: Automation of guidance rescaling eliminates manual tuning.
  • Performance: Achieves SoTA objective/subjective results among open-source systems.

Persistent challenges include further reducing dependence on human annotation (e.g., via stronger self-supervised objectives), extending to controllable generation (e.g., time alignment, editing), supporting richer conditioning (voice/melody, cross-modal tasks), and exploring non-Gaussian latent priors to improve low-frequency synthesis. These represent natural directions for the evolution of T2A frameworks grounded in diffusion transformers (Hai et al., 17 Sep 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to EzAudio.