Papers
Topics
Authors
Recent
2000 character limit reached

Make-An-Audio 2: Advanced Text-to-Audio Generation

Updated 26 November 2025
  • Make-An-Audio 2 is a latent diffusion-based text-to-audio generation model that uses temporal parsing and dual text encoders to improve semantic alignment and variable-length audio synthesis.
  • It replaces conventional 2D U-Nets with a temporal transformer-based denoiser, enhancing temporal consistency and reducing artifacts in generated audio.
  • The model leverages LLM-guided data augmentation to expand training samples, achieving superior objective and subjective performance on standard T2A benchmarks.

Make-An-Audio 2 is a latent diffusion-based text-to-audio (T2A) generation model designed to address shortcomings in semantic alignment and temporal consistency that affect prior T2A systems. Traditional approaches, including those relying on 2D spatial structures (e.g., 2D U-Nets), often suffer from misaligned semantics and poor handling of variable-length audio, with limited modeling of temporal information. Make-An-Audio 2 introduces temporal parsing, dual text encoders, a transformer-based denoiser focused on the temporal dimension, and LLM-powered data augmentation to achieve state-of-the-art results in both objective and subjective metrics on standard T2A benchmarks (Huang et al., 2023).

1. Latent Diffusion Framework for T2A Generation

Make-An-Audio 2 models audio synthesis as a latent-space denoising diffusion process over mel-spectrogram representations. Given an input mel-spectrogram xRCa×Tx \in \mathbb{R}^{C_a \times T}, a 1D-convolutional variational autoencoder (VAE) encoder EE compresses xx into a latent code z=E(x)Rd×Lz = E(x) \in \mathbb{R}^{d \times L}, where LL reflects audio duration.

During training, diffusion operates in this latent space over TT discrete steps, with a fixed noise schedule {βt}(0,1)\{\beta_t\} \subset (0,1):

  • Forward (Noising) Process: At step tt,

q(ztzt1)=N(1βtzt1,βtI)q(z_t \mid z_{t-1}) = \mathcal{N}\left(\sqrt{1-\beta_t} z_{t-1},\, \beta_t I\right)

By induction, q(ztz0)=N(αˉtz0,(1αˉt)I)q(z_t \mid z_0) = \mathcal{N}\left(\sqrt{\bar{\alpha}_t} z_0,\, (1-\bar{\alpha}_t) I\right) with αˉts=1t(1βs)\bar{\alpha}_t \equiv \prod_{s=1}^t (1-\beta_s).

  • Reverse (Denoising) Process: A neural network ϵθ\epsilon_\theta parameterizes the denoising step as

pθ(zt1zt,c)=N(μθ(zt,t,c),Σt)p_\theta(z_{t-1} \mid z_t, c) = \mathcal{N}\left(\mu_\theta(z_t, t, c), \Sigma_t\right)

where typically

μθ=11βt(ztβt1αˉtϵθ(zt,t,c))\mu_\theta = \frac{1}{\sqrt{1-\beta_t}} \left(z_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon_\theta(z_t, t, c)\right)

and Σt=βtI\Sigma_t = \beta_t I.

  • Training Objective: The loss is a mean squared error in noise space:

Ldiff=Ez0,ϵ,tϵθ(zt,t,c)ϵ22\mathcal{L}_{\mathrm{diff}} = \mathbb{E}_{z_0, \epsilon, t} \|\epsilon_\theta(z_t, t, c) - \epsilon\|_2^2

with zt=αˉtz0+1αˉtϵz_t = \sqrt{\bar{\alpha}_t} z_0 + \sqrt{1-\bar{\alpha}_t} \epsilon, ϵN(0,I)\epsilon \sim \mathcal{N}(0, I).

  • Classifier-Free Guidance: To control tradeoffs between generation diversity and conditioning faithfulness, classifier-free guidance is used:

ϵ~θ(zt,t,c)=ϵθ(zt,t,c)+s(ϵθ(zt,t,c)ϵθ(zt,t,c))\tilde{\epsilon}_\theta(z_t, t, c) = \epsilon_\theta(z_t, t, c_\varnothing) + s\left(\epsilon_\theta(z_t, t, c) - \epsilon_\theta(z_t, t, c_\varnothing)\right)

with s4s \approx 4 as optimal guidance scale.

This architecture prioritizes efficient learning in the more tractable latent space and allows variable-length audio generation (Huang et al., 2023).

2. Temporal Transformer-Based Diffusion Denoiser

Make-An-Audio 2 replaces the 2D U-Net backbone of earlier methods with a feed-forward transformer acting solely along the temporal axis of the latent variable ztRd×Lz_t \in \mathbb{R}^{d \times L}. This temporal modeling design includes:

  • Initial 1D-convolution for channel mixing: z^=Conv1D(zt)\hat{z} = \mathrm{Conv1D}(z_t),
  • Addition of timestep embeddings τ(t)\tau(t) and conditional text embedding cc using FiLM or simple addition: z^z^+Wtτ(t)+Wcc\hat{z} \leftarrow \hat{z} + W_t \tau(t) + W_c c,
  • Temporal self-attention (“FFT block”): Q=WQz^,K=WKz^,V=WVz^Q = W^Q \hat{z},\, K = W^K \hat{z},\, V = W^V \hat{z}, followed by scaled dot-product attention and residual connections,
  • Position-wise feedforward layers with GELU activation and layer normalization.

The model stacks S=8S=8 FFT blocks with H=8H=8 heads and hidden dimensionality d=576d=576. Attention cost scales as O(L2d)O(L^2 \cdot d), supporting variable-length audio inputs. This structure improves handling of true temporal relationships, avoiding artifacts from 2D “image-like” architectures that ignore audio’s sequential structure (Huang et al., 2023).

3. Temporal Parsing and Dual Text Encoder Fusion

Key to Make-An-Audio 2’s improvements in semantic and temporal alignment is temporal parsing with a dual text encoder system. Natural-language captions yy are transformed via an LLM (e.g., GPT-4) to produce structured representations:

ys=event1paper_contentorder1@@eventNpaper_contentorderNy_s = \langle \text{event}_1 \,\texttt{paper\_content}\, \text{order}_1 \rangle\, @\,\ldots\,@\,\langle \text{event}_N \,\texttt{paper\_content}\, \text{order}_N \rangle

This enables explicit modeling of event order (e.g., <<car door slamming{\{paper_content}\}start>>@<<footsteps{\{paper_content}\}mid>>...). Extraction is performed by prompting the LLM with a fixed template.

This structured text ysy_s is encoded by a fine-tuned T5 “temporal encoder” ftemp(ys)ctempf_{\mathrm{temp}}(y_s) \mapsto c_{\mathrm{temp}}. The original textual caption yy is also encoded via a frozen CLAP model ftext(y)cmainf_{\mathrm{text}}(y) \mapsto c_{\mathrm{main}}. The two embeddings are fused:

c=Wout[cmainctemp]c = W_{\mathrm{out}} [c_{\mathrm{main}} \,\|\, c_{\mathrm{temp}}]

with WoutW_{\mathrm{out}} a learned linear layer and \| denoting vector concatenation. This dual-encoder scheme assigns temporal-reasoning to the LLM+T5 stack while CLAP provides fidelity to semantic style and detail, significantly enhancing alignment for complex temporally-structured prompts (Huang et al., 2023).

4. Data Augmentation via LLM-Guided Synthesis

To address data scarcity for temporally-annotated T2A supervision, Make-An-Audio 2 synthesizes approximately 61k extra training samples using LLM-guided augmentation:

  1. From a database D\mathcal{D} of single-label audio clips {(ai,i)}\{(a_i, \ell_i)\}, N{2,3}N \in \{2,3\} samples are chosen and temporally concatenated (with optional overlap).
  2. Each event is labeled by temporal position (“start”, “mid”, “end”, “all”).
  3. The resultant structured caption ysy_s is generated.
  4. An LLM rewrites ysy_s into coherent natural-language caption ynly_{\mathrm{nl}}.

Pseudocode for this augmentation:

1
2
3
4
5
6
7
8
for m in 1..M_aug:
  N  UniformChoice({2,3})
  pick distinct (a1,l1),,(aN,lN) from D
  [a_mix, {(e_k, o_k)}]  concatenate_and_mix({a_k,l_k})
  y_s  build_structured((e_k, o_k))
  y_nl  LLM( prompt, y_s )
  save (a_mix, y_nl)
end

This strategy yields a large, high-quality, temporally-diverse training set supporting robust learning for variable-length conditional generations (Huang et al., 2023).

5. Training Protocol and Loss Functions

The training corpus comprises 0.92M audio-text pairs (≈3.7k hours), pooled from AudioCaps, WavCaps, AudioSet, ESC-50, FSD50K, TUT, EpidemicSound, AdobeStock, UrbanSound, etc., augmented by the 61k LLM-built pairs and pseudo-prompts.

  • VAE Training: Loss combines
    • L1L_1 reconstruction: Lrec=xD(E(x))1\mathcal{L}_{\mathrm{rec}} = \|x - D(E(x))\|_1
    • GAN realism penalty: LGAN\mathcal{L}_{\mathrm{GAN}}
    • Latent KL-penalty: LKL\mathcal{L}_{\mathrm{KL}}

LVAE=λrecLrec+λGANLGAN+λKLLKL\mathcal{L}_{\mathrm{VAE}} = \lambda_{\mathrm{rec}} \mathcal{L}_{\mathrm{rec}} + \lambda_{\mathrm{GAN}} \mathcal{L}_{\mathrm{GAN}} + \lambda_{\mathrm{KL}} \mathcal{L}_{\mathrm{KL}}

  • Diffusion Training: Mean-squared noise regression loss Ldiff\mathcal{L}_{\mathrm{diff}}, using either learnable schedules {βt}\{\beta_t\} or cosine schedules, and classifier-free guidance with s=4s=4.
  • Text Encoder Training: CLAP is frozen, T5 is fine-tuned on structured captions, WoutW_{\mathrm{out}} is trained end-to-end.
  • Optimization: AdamW optimizer, lr=9.6×105\mathrm{lr}=9.6 \times 10^{-5} for diffusion, batch size 32×832 \times 8 GPUs, total 1.2\sim 1.2M steps. VAE trained for $800$k steps (lr=1.44×104\mathrm{lr}=1.44 \times 10^{-4}).

A table summarizing the training configuration:

Component Architecture Key Hyperparameters
Denoiser FFT transformer 8 blocks, 8 heads, d=576d=576
Conv1D 1D kernel size=7, padding=3
VAE optimizer AdamW lr=1.44×1041.44\times10^{-4}
Diffusion opt. AdamW lr=9.6×1059.6\times10^{-5}
GPUs - 32×832 \times 8

6. Experimental Evaluation and Comparative Results

Make-An-Audio 2 is evaluated on AudioCaps (test) and in zero-shot settings on Clotho and AudioCaps-short. Metrics include Frechet Distance (FD), Inception Score (IS), KL divergence (KL), Frechet Audio Distance (FAD), CLAP-score, and Mean Opinion Scores for quality/faithfulness (MOS-Q/MOS-F).

  • On AudioCaps (100 DDIM steps, 937M parameters): FD=11.75FD=11.75, IS=11.16IS=11.16, KL=1.32KL=1.32, FAD=1.80FAD=1.80, CLAP=0.645CLAP=0.645, MOSMOS-Q=78.31Q=78.31, MOSMOS-F=75.63F=75.63.
  • By comparison, previous SOTA (TANGO, 1.21B params): FD=26.13FD=26.13, IS=8.23IS=8.23, KL=1.37KL=1.37, FAD=1.87FAD=1.87, CLAP=0.650CLAP=0.650, MOSMOS-F=72.76F=72.76.
  • Predecessor Make-An-Audio: FD=18.32FD=18.32, MOSMOS-F=65.45F=65.45.

Zero-shot results on Clotho and short AudioCaps exhibit strong gains over all baselines, particularly in temporal and semantic metrics as well as on variable-length audio synthesis.

Ablations show:

  • Addition of structured T5 to CLAP yields substantial FD/IS improvements.
  • Dual-encoder fusion outperforms single T5 on all alignment criteria.
  • Replacement of 2D-U-Net with FFT backbone restores and exceeds performance for variable-length and temporally-complex data.

In summary, the innovations of Make-An-Audio 2—temporal parsing, dual-encoder fusion, temporal transformer-based diffusion, and augmented training data—are collectively responsible for advances in T2A generation quality and flexibility, establishing new state-of-the-art results for the field (Huang et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Make-An-Audio 2.