Improve adversarial post-training for audio generative models

Determine how adversarial post-training can be improved and effectively applied to text-conditional audio generation using gaussian flow-based generative models (diffusion models and rectified flows).

Background

The paper aims to accelerate gaussian flow-based text-to-audio models, noting that standard approaches rely on step distillation which is resource-intensive and often inherits drawbacks from classifier-free guidance. Prior acceleration attempts in the image domain using adversarial post-training (e.g., UFOGen and APT) reported limited gains or required distillation-based initialization, and audio applications have not been adequately addressed.

Given these limitations, the authors identify a gap: while adversarial post-training avoids distillation and uses real data, its improvement and successful application to audio generation is unresolved. This motivates their ARC approach, but they explicitly state that improving adversarial post-training and applying it to audio remains an open question.

References

How to improve adversarial post-training and apply it to audio remains an open question.

— Fast Text-to-Audio Generation with Adversarial Post-Training (2505.08175 - Novack et al., 13 May 2025) in Section 1 (Introduction)

Improve adversarial post-training for audio generative models

Background

References

Related Problems