Create a Video View Paper

Stable Audio 3: Fast, Variable-Length Audio Generation

Stable Audio 3 introduces a suite of latent diffusion models that deliver high-fidelity music and sound effects generation with native support for variable-length synthesis and editing. By combining a novel semantic-acoustic autoencoder with flow matching, distillation, and adversarial training, the system achieves state-of-the-art quality with fast inference on consumer hardware, generating 120 seconds of stereo audio in under a second on datacenter GPUs and under 5 seconds on laptop CPUs.

Script

Stable Audio 3 compresses 44.1 kilohertz stereo audio by a factor of 4096, fitting minutes of sound into a compact 256-dimensional latent space that makes long-form generation tractable on everyday hardware.

The diffusion transformer conditions generation through cross-attention to text embeddings, while duration and inpainting masks are wired directly into each layer via adaptive normalization, enabling seamless variable-length synthesis and surgical audio edits without architectural changes.

Training unfolds in three phases: flow matching learns denoising velocity fields, single-step distillation straightens those paths into direct clean estimates, and adversarial post-training refines perceptual quality and prompt alignment without the teacher, embedding guidance directly into the model weights.

Native variable-length support eliminates the padding waste of fixed-length systems, scaling inference cost linearly with requested duration and enabling efficient generation from 5 seconds to over 6 minutes on the same architecture.

The large model achieves a Frechet Audio Distance of 0.101 and CLAP alignment of 0.393 on 120-second music, outperforming all open baselines, while inpainting edits yield FAD scores as low as 0.046, demonstrating seamless integration of generated and original audio.

With 120 seconds of stereo audio generated in under half a second on datacenter GPUs and under 5 seconds on a laptop, Stable Audio 3 makes state-of-the-art audio generation practical for everyday creators. Explore the full paper and create your own lightning talks at EmergentMind.com.