Fast Text-to-Audio Generation with Adversarial Post-Training (2505.08175v3)

Published 13 May 2025 in cs.SD, cs.AI, cs.LG, cs.MM, and eess.AS

Abstract: Text-to-audio systems, while increasingly performant, are slow at inference time, thus making their latency unpractical for many creative applications. We present Adversarial Relativistic-Contrastive (ARC) post-training, the first adversarial acceleration algorithm for diffusion/flow models not based on distillation. While past adversarial post-training methods have struggled to compare against their expensive distillation counterparts, ARC post-training is a simple procedure that (1) extends a recent relativistic adversarial formulation to diffusion/flow post-training and (2) combines it with a novel contrastive discriminator objective to encourage better prompt adherence. We pair ARC post-training with a number optimizations to Stable Audio Open and build a model capable of generating $\approx$12s of 44.1kHz stereo audio in $\approx$75ms on an H100, and $\approx$7s on a mobile edge-device, the fastest text-to-audio model to our knowledge.

Summary

The paper introduces ARC post-training that accelerates text-to-audio generation up to 100x by combining adversarial relativistic and contrastive losses.
It employs ping-pong sampling, enabling high-fidelity audio generation in only 8 steps while ensuring strong text-audio semantic alignment.
The approach supports real-world deployment through dynamic Int8 quantization, reducing memory usage and enabling interactive synthesis on edge devices.

This paper introduces Adversarial Relativistic-Contrastive (ARC) post-training, a novel method designed to accelerate text-to-audio (TTA) generation models, specifically those based on rectified flows, without relying on distillation or Classifier-Free Guidance (CFG). The primary motivation is to reduce the inference latency of current high-quality TTA systems, which can take seconds to minutes per generation, making them impractical for real-time creative applications like music production or sound design.

The core idea of ARC is to post-train a pre-trained rectified flow model using an adversarial framework that incorporates two key losses:

Adversarial Relativistic Loss ( $\mathcal{L}_R$ ): This loss extends a relativistic GAN formulation to the post-training of diffusion/flow models. It operates on pairs of real and generated samples that share the same text prompt. The discriminator ( $D_{\bm\psi}$ ) is trained to maximize the difference between its output for the real sample and its output for the generated sample, while the generator ( $G_{\bm\phi}$ ) is trained to minimize this difference. This encourages the generator to produce samples that are perceived as "more real" than the paired real samples in the discriminator's feature space. Unlike standard GANs, which judge realism independently, the relativistic approach uses relative comparisons within paired samples.
Contrastive Loss ( $\mathcal{L}_C$ ): This is a novel component designed to improve text-audio alignment, which is often challenging in adversarial generation without CFG. This loss is applied only to the discriminator. It trains the discriminator to produce a higher score for a real audio sample paired with its correct text prompt than for the same audio sample paired with a randomly shuffled, incorrect prompt from the batch. This encourages the discriminator to become sensitive to prompt adherence and semantic features, guiding the adversarial training towards text-conditional generation rather than just unconditional realism.

The overall training objective for ARC is a joint optimization: $\min_{\bm\phi}\max_{\bm\psi}\mathcal{L}_\text{ARC} (\bm\phi, \bm\psi) = \mathcal{L}_\text{R} (\bm\phi, \bm\psi) + \lambda \cdot \mathcal{L}_\text{C} (\bm\psi)$ , where $\lambda$ balances the two losses. The generator and discriminator are initialized using the weights of a pre-trained rectified flow model, leveraging its initial training stability.

For inference, ARC employs Ping-Pong sampling. This method, adapted from consistency models, alternates between denoising a noisy sample using the few-step generator and re-noising it to a slightly lower noise level. This iterative refinement process allows the model to generate high-fidelity audio in a small number of steps ( $N$ ), unlike standard ODE solvers which require many steps. A key practical benefit is that ARC avoids CFG during inference, reducing memory requirements significantly compared to methods that use it.

The authors implemented ARC on a modified version of the Stable Audio Open (SAO) model (Evans et al., 19 Jul 2024). They used SAO's pre-trained autoencoder and a reduced-size Diffusion Transformer (DiT) (0.34B parameters compared to SAO's 1.06B DiT) for the latent generative model. The discriminator reuses parts of the pre-trained DiT and adds a lightweight convolutional head. The models were finetuned on a large dataset of Freesound samples.

Experimental results demonstrate that ARC achieves substantial acceleration. Using 8 sampling steps with ARC, the model generates approximately 12 seconds of 44.1kHz stereo audio in about 75ms on an H100 GPU, a 100x speedup over the original SAO (100 steps) and 10x faster than the base pre-trained RF model (50 steps).

Crucially, the paper shows that ARC preserves generative diversity better than distillation-based methods like Presto (Lin et al., 14 Jan 2025). While Presto improves quality, it sacrifices diversity, making its outputs less varied for the same prompt. ARC achieves competitive objective metrics (FD\textsubscript{openl3}, KL\textsubscript{passt}, CLAP) and subjective quality/prompt adherence scores (MOS) while significantly boosting diversity (measured by a novel CLAP Conditional Diversity Score (CCDS) and subjective MOS). Ablation studies confirm that both the relativistic ( $\mathcal{L}_R$ ) and contrastive ( $\mathcal{L}_C$ ) components are necessary for balanced performance, with $\mathcal{L}_R$ alone leading to poor prompt adherence and $\mathcal{L}_{\text{LS}} + \mathcal{L}_C$ being less effective than ARC.

For real-world deployment on edge devices, the authors optimized the ARC model for mobile CPUs using Arm's KleidiAI and XNNPACK with dynamic Int8 quantization. This allowed the model to generate approximately 7 seconds of audio in about 6.6 seconds on a test mobile phone, reducing runtime RAM usage from 6.5GB to 3.6GB. This demonstrates the feasibility of running TTA models locally on consumer hardware, which is essential for many interactive creative applications.

Beyond standard TTA, the ping-pong sampling inference procedure naturally supports audio-to-audio capabilities without additional training. By using an existing audio recording as the initial noisy sample, the model can perform style transfer (e.g., voice-to-audio synthesis or beat-aligned generation), showing versatility for creative workflows.

Practical Implementation Considerations:

Computational Requirements: While ARC achieves high speedup during inference, the post-training phase requires significant resources (e.g., 8 H100 GPUs for 100k iterations).
Model Size: The model itself, even with optimizations, occupies several gigabytes of VRAM/RAM and disk space, which can still be a constraint for highly resource-limited edge devices or integration into small applications.
Training Stability: Adversarial training can be sensitive. The initialization from a pre-trained flow model is crucial for stability. Balancing $\mathcal{L}_R$ and $\mathcal{L}_C$ via the $\lambda$ parameter is also important.
Quantization: Dynamic Int8 quantization proved effective for on-device deployment, offering a good trade-off between speed, memory, and quality without requiring quantization-aware training.
Sampling Steps: The paper shows that 8 steps offer a good balance for their optimized model, outperforming 1-step and being significantly faster than the pre-trained model's typical 50+ steps. The optimal number of steps might depend on the specific model architecture and target hardware.

The work highlights that adversarial post-training, when formulated appropriately with components like the relativistic and contrastive losses in ARC, can be a competitive alternative to distillation for accelerating generative models, particularly in audio where preserving diversity and text alignment is key for creative use cases. The achieved low latency on consumer GPUs and feasibility on mobile CPUs are significant steps towards making high-quality TTA an interactive tool.