Make-An-Audio 2: Advanced Text-to-Audio Generation
- Make-An-Audio 2 is a latent diffusion-based text-to-audio generation model that uses temporal parsing and dual text encoders to improve semantic alignment and variable-length audio synthesis.
- It replaces conventional 2D U-Nets with a temporal transformer-based denoiser, enhancing temporal consistency and reducing artifacts in generated audio.
- The model leverages LLM-guided data augmentation to expand training samples, achieving superior objective and subjective performance on standard T2A benchmarks.
Make-An-Audio 2 is a latent diffusion-based text-to-audio (T2A) generation model designed to address shortcomings in semantic alignment and temporal consistency that affect prior T2A systems. Traditional approaches, including those relying on 2D spatial structures (e.g., 2D U-Nets), often suffer from misaligned semantics and poor handling of variable-length audio, with limited modeling of temporal information. Make-An-Audio 2 introduces temporal parsing, dual text encoders, a transformer-based denoiser focused on the temporal dimension, and LLM-powered data augmentation to achieve state-of-the-art results in both objective and subjective metrics on standard T2A benchmarks (Huang et al., 2023).
1. Latent Diffusion Framework for T2A Generation
Make-An-Audio 2 models audio synthesis as a latent-space denoising diffusion process over mel-spectrogram representations. Given an input mel-spectrogram , a 1D-convolutional variational autoencoder (VAE) encoder compresses into a latent code , where reflects audio duration.
During training, diffusion operates in this latent space over discrete steps, with a fixed noise schedule :
- Forward (Noising) Process: At step ,
By induction, with .
- Reverse (Denoising) Process: A neural network parameterizes the denoising step as
where typically
and .
- Training Objective: The loss is a mean squared error in noise space:
with , .
- Classifier-Free Guidance: To control tradeoffs between generation diversity and conditioning faithfulness, classifier-free guidance is used:
with as optimal guidance scale.
This architecture prioritizes efficient learning in the more tractable latent space and allows variable-length audio generation (Huang et al., 2023).
2. Temporal Transformer-Based Diffusion Denoiser
Make-An-Audio 2 replaces the 2D U-Net backbone of earlier methods with a feed-forward transformer acting solely along the temporal axis of the latent variable . This temporal modeling design includes:
- Initial 1D-convolution for channel mixing: ,
- Addition of timestep embeddings and conditional text embedding using FiLM or simple addition: ,
- Temporal self-attention (“FFT block”): , followed by scaled dot-product attention and residual connections,
- Position-wise feedforward layers with GELU activation and layer normalization.
The model stacks FFT blocks with heads and hidden dimensionality . Attention cost scales as , supporting variable-length audio inputs. This structure improves handling of true temporal relationships, avoiding artifacts from 2D “image-like” architectures that ignore audio’s sequential structure (Huang et al., 2023).
3. Temporal Parsing and Dual Text Encoder Fusion
Key to Make-An-Audio 2’s improvements in semantic and temporal alignment is temporal parsing with a dual text encoder system. Natural-language captions are transformed via an LLM (e.g., GPT-4) to produce structured representations:
This enables explicit modeling of event order (e.g., car door slammingpaper_contentstart@footstepspaper_contentmid...). Extraction is performed by prompting the LLM with a fixed template.
This structured text is encoded by a fine-tuned T5 “temporal encoder” . The original textual caption is also encoded via a frozen CLAP model . The two embeddings are fused:
with a learned linear layer and denoting vector concatenation. This dual-encoder scheme assigns temporal-reasoning to the LLM+T5 stack while CLAP provides fidelity to semantic style and detail, significantly enhancing alignment for complex temporally-structured prompts (Huang et al., 2023).
4. Data Augmentation via LLM-Guided Synthesis
To address data scarcity for temporally-annotated T2A supervision, Make-An-Audio 2 synthesizes approximately 61k extra training samples using LLM-guided augmentation:
- From a database of single-label audio clips , samples are chosen and temporally concatenated (with optional overlap).
- Each event is labeled by temporal position (“start”, “mid”, “end”, “all”).
- The resultant structured caption is generated.
- An LLM rewrites into coherent natural-language caption .
Pseudocode for this augmentation:
1 2 3 4 5 6 7 8 |
for m in 1..M_aug: N ← UniformChoice({2,3}) pick distinct (a1,l1),…,(aN,lN) from D [a_mix, {(e_k, o_k)}] ← concatenate_and_mix({a_k,l_k}) y_s ← build_structured((e_k, o_k)) y_nl ← LLM( prompt, y_s ) save (a_mix, y_nl) end |
This strategy yields a large, high-quality, temporally-diverse training set supporting robust learning for variable-length conditional generations (Huang et al., 2023).
5. Training Protocol and Loss Functions
The training corpus comprises 0.92M audio-text pairs (≈3.7k hours), pooled from AudioCaps, WavCaps, AudioSet, ESC-50, FSD50K, TUT, EpidemicSound, AdobeStock, UrbanSound, etc., augmented by the 61k LLM-built pairs and pseudo-prompts.
- VAE Training: Loss combines
- reconstruction:
- GAN realism penalty:
- Latent KL-penalty:
- Diffusion Training: Mean-squared noise regression loss , using either learnable schedules or cosine schedules, and classifier-free guidance with .
- Text Encoder Training: CLAP is frozen, T5 is fine-tuned on structured captions, is trained end-to-end.
- Optimization: AdamW optimizer, for diffusion, batch size GPUs, total M steps. VAE trained for $800$k steps ().
A table summarizing the training configuration:
| Component | Architecture | Key Hyperparameters |
|---|---|---|
| Denoiser | FFT transformer | 8 blocks, 8 heads, |
| Conv1D | 1D kernel | size=7, padding=3 |
| VAE optimizer | AdamW | lr= |
| Diffusion opt. | AdamW | lr= |
| GPUs | - |
6. Experimental Evaluation and Comparative Results
Make-An-Audio 2 is evaluated on AudioCaps (test) and in zero-shot settings on Clotho and AudioCaps-short. Metrics include Frechet Distance (FD), Inception Score (IS), KL divergence (KL), Frechet Audio Distance (FAD), CLAP-score, and Mean Opinion Scores for quality/faithfulness (MOS-Q/MOS-F).
- On AudioCaps (100 DDIM steps, 937M parameters): , , , , , -, -.
- By comparison, previous SOTA (TANGO, 1.21B params): , , , , , -.
- Predecessor Make-An-Audio: , -.
Zero-shot results on Clotho and short AudioCaps exhibit strong gains over all baselines, particularly in temporal and semantic metrics as well as on variable-length audio synthesis.
Ablations show:
- Addition of structured T5 to CLAP yields substantial FD/IS improvements.
- Dual-encoder fusion outperforms single T5 on all alignment criteria.
- Replacement of 2D-U-Net with FFT backbone restores and exceeds performance for variable-length and temporally-complex data.
In summary, the innovations of Make-An-Audio 2—temporal parsing, dual-encoder fusion, temporal transformer-based diffusion, and augmented training data—are collectively responsible for advances in T2A generation quality and flexibility, establishing new state-of-the-art results for the field (Huang et al., 2023).