Fast Timing-Conditioned Latent Audio Diffusion (2402.04825v3)

Published 7 Feb 2024 in cs.SD, cs.LG, and eess.AS

Abstract: Generating long-form 44.1kHz stereo audio from text prompts can be computationally demanding. Further, most previous works do not tackle that music and sound effects naturally vary in their duration. Our research focuses on the efficient generation of long-form, variable-length stereo music and sounds at 44.1kHz using text prompts with a generative model. Stable Audio is based on latent diffusion, with its latent defined by a fully-convolutional variational autoencoder. It is conditioned on text prompts as well as timing embeddings, allowing for fine control over both the content and length of the generated music and sounds. Stable Audio is capable of rendering stereo signals of up to 95 sec at 44.1kHz in 8 sec on an A100 GPU. Despite its compute efficiency and fast inference, it is one of the best in two public text-to-music and -audio benchmarks and, differently from state-of-the-art models, can generate music with structure and stereo sounds.

Citations (72)

View on Semantic Scholar

Summary

The paper introduces a latent diffusion model that leverages text and timing embeddings to generate variable-length, high-fidelity stereo audio.
It employs a fully-convolutional VAE and custom evaluation metrics to render up to 95 seconds of 44.1kHz audio in just 8 seconds on an A100 GPU.
The research makes the model and benchmarks publicly accessible, showcasing its competitive audio quality and structured musical output.

Introduction to Stable Audio

In the field of generative audio, recent advancements have been substantial, particularly with diffusion-based generative models that have made strides in image, video, and audio synthesis. While these models demonstrate impressive results, a notable challenge presents itself when dealing with raw audio—computational intensity during training and inference. This challenge amplifies when the focus shifts towards generating high-fidelity, long-form audio at standard sampling rates (44.1kHz), as commonly desired in music and sound effects production.

Overcoming Long-Form Audio Generation Challenges

The research introduces Stable Audio, a model anchoring on latent diffusion principles, which benefits from a fully-convolutional variational autoencoder (VAE) architecture. This approach effectively sidesteps computational burdens associated with raw audio manipulation, enabling much faster inference times. Crucially, the model accommodates control over both content and duration of the generated output via text and timing embeddings. These embeddings represent a novel method in the domain of generative audio, granting the ability to produce variable-length audio up to the prescribed training window length—a formidable improvement over previous works constrained to fixed-length outputs.

Benchmarking Performance against Existing Metrics

The paper pioneers customized evaluation metrics to assess the generation of long-form full-band stereo signals. These include:

A Fréchet Distance variant using OpenL3 embeddings, tailored for full-bandwidth (up to 44.1kHz) and stereo signals.
A modified Kullback-Leibler divergence metric for semantic evaluation of extended audio lengths.
The CLAP score, designed to measure adherence of the generated stereo audio to textual prompts.

These metrics, adapted from established ones, cater to the assessment of variable-length, long-form audio, providing a robust framework for measuring the plausibility, semantic reliability, and text-based accuracy of generative models.

Empirical Results and Model Capabilities

Empirical findings from the paper indicate Stable Audio's proficiency in generating high-quality, structured music—with identifiable sections like intro and outro—and stereo sound effects. In speed tests on an A100 GPU, the model rendered up to 95 seconds of 44.1kHz stereo audio in as little as 8 seconds. Quantitatively, Stable Audio showcased its competitive edge on public text-to-music benchmarks, leading in several audio quality metrics and text alignment for music generation. When qualitatively measured, it earned high mean opinion scores for audio quality and musical structure compared to other models. However, the paper does acknowledge an area for potential refinement: enhancing the model's stereo sound generation for sound effects, as indicated by a modest stereo correctness score.

Online Resources and Model Accessibility

Amplifying its practical implications, the researchers have made the model, accompanying metrics, and demonstrations publicly accessible. Interested parties can interact with Stable Audio's capabilities through comprehensive demo pages, facilitating transparency and accelerating its utility for various applications.

In conclusion, Stable Audio signifies a leap forward in generative audio synthesis, particularly in the domain of music and sound effects production. This model's introduction stands as a testament to the evolving synergy between efficiency, control, and quality within the field of generative AI.

PDF Markdown

Related Papers

Tweets

https://twitter.com/stableaudio/status/1755558334797685089

https://twitter.com/iScienceLuvr/status/1755415840617353273

https://twitter.com/_akhaliq/status/1755461022494699663

https://twitter.com/jordiponsdotme/status/1755566400301638137

https://twitter.com/StabilityAI/status/1756053339647922310

https://twitter.com/ArxivSound/status/1790231463113245171

YouTube

Show All Videos