Make-An-Audio 2: Advanced Text-to-Audio Generation

Updated 26 November 2025

Make-An-Audio 2 is a latent diffusion-based text-to-audio generation model that uses temporal parsing and dual text encoders to improve semantic alignment and variable-length audio synthesis.
It replaces conventional 2D U-Nets with a temporal transformer-based denoiser, enhancing temporal consistency and reducing artifacts in generated audio.
The model leverages LLM-guided data augmentation to expand training samples, achieving superior objective and subjective performance on standard T2A benchmarks.

Make-An-Audio 2 is a latent diffusion-based text-to-audio (T2A) generation model designed to address shortcomings in semantic alignment and temporal consistency that affect prior T2A systems. Traditional approaches, including those relying on 2D spatial structures (e.g., 2D U-Nets), often suffer from misaligned semantics and poor handling of variable-length audio, with limited modeling of temporal information. Make-An-Audio 2 introduces temporal parsing, dual text encoders, a transformer-based denoiser focused on the temporal dimension, and LLM-powered data augmentation to achieve state-of-the-art results in both objective and subjective metrics on standard T2A benchmarks (Huang et al., 2023).

1. Latent Diffusion Framework for T2A Generation

Make-An-Audio 2 models audio synthesis as a latent-space denoising diffusion process over mel-spectrogram representations. Given an input mel-spectrogram $x \in \mathbb{R}^{C_a \times T}$ , a 1D-convolutional variational autoencoder (VAE) encoder $E$ compresses $x$ into a latent code $z = E(x) \in \mathbb{R}^{d \times L}$ , where $L$ reflects audio duration.

During training, diffusion operates in this latent space over $T$ discrete steps, with a fixed noise schedule $\{\beta_t\} \subset (0,1)$ :

Forward (Noising) Process: At step $t$ ,

$q(z_t \mid z_{t-1}) = \mathcal{N}\left(\sqrt{1-\beta_t} z_{t-1},\, \beta_t I\right)$

By induction, $q(z_t \mid z_0) = \mathcal{N}\left(\sqrt{\bar{\alpha}_t} z_0,\, (1-\bar{\alpha}_t) I\right)$ with $\bar{\alpha}_t \equiv \prod_{s=1}^t (1-\beta_s)$ .

Reverse (Denoising) Process: A neural network $\epsilon_\theta$ parameterizes the denoising step as

$p_\theta(z_{t-1} \mid z_t, c) = \mathcal{N}\left(\mu_\theta(z_t, t, c), \Sigma_t\right)$

where typically

$\mu_\theta = \frac{1}{\sqrt{1-\beta_t}} \left(z_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon_\theta(z_t, t, c)\right)$

and $\Sigma_t = \beta_t I$ .

Training Objective: The loss is a mean squared error in noise space:

$\mathcal{L}_{\mathrm{diff}} = \mathbb{E}_{z_0, \epsilon, t} \|\epsilon_\theta(z_t, t, c) - \epsilon\|_2^2$

with $z_t = \sqrt{\bar{\alpha}_t} z_0 + \sqrt{1-\bar{\alpha}_t} \epsilon$ , $\epsilon \sim \mathcal{N}(0, I)$ .

Classifier-Free Guidance: To control tradeoffs between generation diversity and conditioning faithfulness, classifier-free guidance is used:

$\tilde{\epsilon}_\theta(z_t, t, c) = \epsilon_\theta(z_t, t, c_\varnothing) + s\left(\epsilon_\theta(z_t, t, c) - \epsilon_\theta(z_t, t, c_\varnothing)\right)$

with $s \approx 4$ as optimal guidance scale.

This architecture prioritizes efficient learning in the more tractable latent space and allows variable-length audio generation (Huang et al., 2023).

2. Temporal Transformer-Based Diffusion Denoiser

Make-An-Audio 2 replaces the 2D U-Net backbone of earlier methods with a feed-forward transformer acting solely along the temporal axis of the latent variable $z_t \in \mathbb{R}^{d \times L}$ . This temporal modeling design includes:

Initial 1D-convolution for channel mixing: $\hat{z} = \mathrm{Conv1D}(z_t)$ ,
Addition of timestep embeddings $\tau(t)$ and conditional text embedding $c$ using FiLM or simple addition: $\hat{z} \leftarrow \hat{z} + W_t \tau(t) + W_c c$ ,
Temporal self-attention (“FFT block”): $Q = W^Q \hat{z},\, K = W^K \hat{z},\, V = W^V \hat{z}$ , followed by scaled dot-product attention and residual connections,
Position-wise feedforward layers with GELU activation and layer normalization.

The model stacks $S=8$ FFT blocks with $H=8$ heads and hidden dimensionality $d=576$ . Attention cost scales as $O(L^2 \cdot d)$ , supporting variable-length audio inputs. This structure improves handling of true temporal relationships, avoiding artifacts from 2D “image-like” architectures that ignore audio’s sequential structure (Huang et al., 2023).

3. Temporal Parsing and Dual Text Encoder Fusion

Key to Make-An-Audio 2’s improvements in semantic and temporal alignment is temporal parsing with a dual text encoder system. Natural-language captions $y$ are transformed via an LLM (e.g., GPT-4) to produce structured representations:

$y_s = \langle \text{event}_1 \,\texttt{paper\_content}\, \text{order}_1 \rangle\, @\,\ldots\,@\,\langle \text{event}_N \,\texttt{paper\_content}\, \text{order}_N \rangle$

This enables explicit modeling of event order (e.g., $<$ car door slamming $\{$ paper_content $\}$ start $>$ @ $<$ footsteps $\{$ paper_content $\}$ mid $>$ ...). Extraction is performed by prompting the LLM with a fixed template.

This structured text $y_s$ is encoded by a fine-tuned T5 “temporal encoder” $f_{\mathrm{temp}}(y_s) \mapsto c_{\mathrm{temp}}$ . The original textual caption $y$ is also encoded via a frozen CLAP model $f_{\mathrm{text}}(y) \mapsto c_{\mathrm{main}}$ . The two embeddings are fused:

$c = W_{\mathrm{out}} [c_{\mathrm{main}} \,\|\, c_{\mathrm{temp}}]$

with $W_{\mathrm{out}}$ a learned linear layer and $\|$ denoting vector concatenation. This dual-encoder scheme assigns temporal-reasoning to the LLM+T5 stack while CLAP provides fidelity to semantic style and detail, significantly enhancing alignment for complex temporally-structured prompts (Huang et al., 2023).

4. Data Augmentation via LLM-Guided Synthesis

To address data scarcity for temporally-annotated T2A supervision, Make-An-Audio 2 synthesizes approximately 61k extra training samples using LLM-guided augmentation:

From a database $\mathcal{D}$ of single-label audio clips $\{(a_i, \ell_i)\}$ , $N \in \{2,3\}$ samples are chosen and temporally concatenated (with optional overlap).
Each event is labeled by temporal position (“start”, “mid”, “end”, “all”).
The resultant structured caption $y_s$ is generated.
An LLM rewrites $y_s$ into coherent natural-language caption $y_{\mathrm{nl}}$ .

Pseudocode for this augmentation:

for m in 1..M_aug:
  N ← UniformChoice({2,3})
  pick distinct (a1,l1),…,(aN,lN) from D
  [a_mix, {(e_k, o_k)}] ← concatenate_and_mix({a_k,l_k})
  y_s ← build_structured((e_k, o_k))
  y_nl ← LLM( prompt, y_s )
  save (a_mix, y_nl)
end

This strategy yields a large, high-quality, temporally-diverse training set supporting robust learning for variable-length conditional generations (Huang et al., 2023).

5. Training Protocol and Loss Functions

The training corpus comprises 0.92M audio-text pairs (≈3.7k hours), pooled from AudioCaps, WavCaps, AudioSet, ESC-50, FSD50K, TUT, EpidemicSound, AdobeStock, UrbanSound, etc., augmented by the 61k LLM-built pairs and pseudo-prompts.

VAE Training: Loss combines
- $L_1$ reconstruction: $\mathcal{L}_{\mathrm{rec}} = \|x - D(E(x))\|_1$
- GAN realism penalty: $\mathcal{L}_{\mathrm{GAN}}$
- Latent KL-penalty: $\mathcal{L}_{\mathrm{KL}}$

$\mathcal{L}_{\mathrm{VAE}} = \lambda_{\mathrm{rec}} \mathcal{L}_{\mathrm{rec}} + \lambda_{\mathrm{GAN}} \mathcal{L}_{\mathrm{GAN}} + \lambda_{\mathrm{KL}} \mathcal{L}_{\mathrm{KL}}$

Diffusion Training: Mean-squared noise regression loss $\mathcal{L}_{\mathrm{diff}}$ , using either learnable schedules $\{\beta_t\}$ or cosine schedules, and classifier-free guidance with $s=4$ .
Text Encoder Training: CLAP is frozen, T5 is fine-tuned on structured captions, $W_{\mathrm{out}}$ is trained end-to-end.
Optimization: AdamW optimizer, $\mathrm{lr}=9.6 \times 10^{-5}$ for diffusion, batch size $32 \times 8$ GPUs, total $\sim 1.2$ M steps. VAE trained for $800$k steps ( $\mathrm{lr}=1.44 \times 10^{-4}$ ).

A table summarizing the training configuration:

Component	Architecture	Key Hyperparameters
Denoiser	FFT transformer	8 blocks, 8 heads, $d=576$
Conv1D	1D kernel	size=7, padding=3
VAE optimizer	AdamW	lr= $1.44\times10^{-4}$
Diffusion opt.	AdamW	lr= $9.6\times10^{-5}$
GPUs	-	$32 \times 8$

6. Experimental Evaluation and Comparative Results

Make-An-Audio 2 is evaluated on AudioCaps (test) and in zero-shot settings on Clotho and AudioCaps-short. Metrics include Frechet Distance (FD), Inception Score (IS), KL divergence (KL), Frechet Audio Distance (FAD), CLAP-score, and Mean Opinion Scores for quality/faithfulness (MOS-Q/MOS-F).

On AudioCaps (100 DDIM steps, 937M parameters): $FD=11.75$ , $IS=11.16$ , $KL=1.32$ , $FAD=1.80$ , $CLAP=0.645$ , $MOS$ - $Q=78.31$ , $MOS$ - $F=75.63$ .
By comparison, previous SOTA (TANGO, 1.21B params): $FD=26.13$ , $IS=8.23$ , $KL=1.37$ , $FAD=1.87$ , $CLAP=0.650$ , $MOS$ - $F=72.76$ .
Predecessor Make-An-Audio: $FD=18.32$ , $MOS$ - $F=65.45$ .

Zero-shot results on Clotho and short AudioCaps exhibit strong gains over all baselines, particularly in temporal and semantic metrics as well as on variable-length audio synthesis.

Ablations show:

Addition of structured T5 to CLAP yields substantial FD/IS improvements.
Dual-encoder fusion outperforms single T5 on all alignment criteria.
Replacement of 2D-U-Net with FFT backbone restores and exceeds performance for variable-length and temporally-complex data.

In summary, the innovations of Make-An-Audio 2—temporal parsing, dual-encoder fusion, temporal transformer-based diffusion, and augmented training data—are collectively responsible for advances in T2A generation quality and flexibility, establishing new state-of-the-art results for the field (Huang et al., 2023).

PDF Markdown Chat (Pro)

References (1)

Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation (2023)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Make-An-Audio 2.