The paper presents Stable Audio, a generative audio model based on latent diffusion and a fully-convolutional variational autoencoder architecture to create high-quality, long-form audio quickly.
Stable Audio introduces the use of text and timing embeddings to control content and duration of the audio, allowing for variable-length outcomes up to the trained length.
Customized evaluation metrics assess the model's ability to generate full-bandwidth, long-form stereo signals, focusing on quality and adherence to textual prompts.
Empirical results show that Stable Audio excels in generating structured music and has reasonable performance in sound effect production, with opportunities for improvement in stereo sound.
In the realm of generative audio, recent advancements have been substantial, particularly with diffusion-based generative models that have made strides in image, video, and audio synthesis. While these models demonstrate impressive results, a notable challenge presents itself when dealing with raw audio—computational intensity during training and inference. This challenge amplifies when the focus shifts towards generating high-fidelity, long-form audio at standard sampling rates (44.1kHz), as commonly desired in music and sound effects production.
The research introduces Stable Audio, a model anchoring on latent diffusion principles, which benefits from a fully-convolutional variational autoencoder (VAE) architecture. This approach effectively sidesteps computational burdens associated with raw audio manipulation, enabling much faster inference times. Crucially, the model accommodates control over both content and duration of the generated output via text and timing embeddings. These embeddings represent a novel method in the domain of generative audio, granting the ability to produce variable-length audio up to the prescribed training window length—a formidable improvement over previous works constrained to fixed-length outputs.
The study pioneers customized evaluation metrics to assess the generation of long-form full-band stereo signals. These include:
These metrics, adapted from established ones, cater to the assessment of variable-length, long-form audio, providing a robust framework for measuring the plausibility, semantic reliability, and text-based accuracy of generative models.
Empirical findings from the paper indicate Stable Audio's proficiency in generating high-quality, structured music—with identifiable sections like intro and outro—and stereo sound effects. In speed tests on an A100 GPU, the model rendered up to 95 seconds of 44.1kHz stereo audio in as little as 8 seconds. Quantitatively, Stable Audio showcased its competitive edge on public text-to-music benchmarks, leading in several audio quality metrics and text alignment for music generation. When qualitatively measured, it earned high mean opinion scores for audio quality and musical structure compared to other models. However, the study does acknowledge an area for potential refinement: enhancing the model's stereo sound generation for sound effects, as indicated by a modest stereo correctness score.
Amplifying its practical implications, the researchers have made the model, accompanying metrics, and demonstrations publicly accessible. Interested parties can interact with Stable Audio's capabilities through comprehensive demo pages, facilitating transparency and accelerating its utility for various applications.
In conclusion, Stable Audio signifies a leap forward in generative audio synthesis, particularly in the domain of music and sound effects production. This model's introduction stands as a testament to the evolving synergy between efficiency, control, and quality within the field of generative AI.