Emergent Mind

Fast Timing-Conditioned Latent Audio Diffusion

(2402.04825)
Published Feb 7, 2024 in cs.SD , cs.LG , and eess.AS

Abstract

Generating long-form 44.1kHz stereo audio from text prompts can be computationally demanding. Further, most previous works do not tackle that music and sound effects naturally vary in their duration. Our research focuses on the efficient generation of long-form, variable-length stereo music and sounds at 44.1kHz using text prompts with a generative model. Stable Audio is based on latent diffusion, with its latent defined by a fully-convolutional variational autoencoder. It is conditioned on text prompts as well as timing embeddings, allowing for fine control over both the content and length of the generated music and sounds. Stable Audio is capable of rendering stereo signals of up to 95 sec at 44.1kHz in 8 sec on an A100 GPU. Despite its compute efficiency and fast inference, it is one of the best in two public text-to-music and -audio benchmarks and, differently from state-of-the-art models, can generate music with structure and stereo sounds.

Stable audio diagram showing pre-trained models, parameters learned in training, and signals of interest.

Overview

  • The paper presents Stable Audio, a generative audio model based on latent diffusion and a fully-convolutional variational autoencoder architecture to create high-quality, long-form audio quickly.

  • Stable Audio introduces the use of text and timing embeddings to control content and duration of the audio, allowing for variable-length outcomes up to the trained length.

  • Customized evaluation metrics assess the model's ability to generate full-bandwidth, long-form stereo signals, focusing on quality and adherence to textual prompts.

  • Empirical results show that Stable Audio excels in generating structured music and has reasonable performance in sound effect production, with opportunities for improvement in stereo sound.

Introduction to Stable Audio

In the realm of generative audio, recent advancements have been substantial, particularly with diffusion-based generative models that have made strides in image, video, and audio synthesis. While these models demonstrate impressive results, a notable challenge presents itself when dealing with raw audio—computational intensity during training and inference. This challenge amplifies when the focus shifts towards generating high-fidelity, long-form audio at standard sampling rates (44.1kHz), as commonly desired in music and sound effects production.

Overcoming Long-Form Audio Generation Challenges

The research introduces Stable Audio, a model anchoring on latent diffusion principles, which benefits from a fully-convolutional variational autoencoder (VAE) architecture. This approach effectively sidesteps computational burdens associated with raw audio manipulation, enabling much faster inference times. Crucially, the model accommodates control over both content and duration of the generated output via text and timing embeddings. These embeddings represent a novel method in the domain of generative audio, granting the ability to produce variable-length audio up to the prescribed training window length—a formidable improvement over previous works constrained to fixed-length outputs.

Benchmarking Performance against Existing Metrics

The study pioneers customized evaluation metrics to assess the generation of long-form full-band stereo signals. These include:

  1. A Fréchet Distance variant using OpenL3 embeddings, tailored for full-bandwidth (up to 44.1kHz) and stereo signals.
  2. A modified Kullback-Leibler divergence metric for semantic evaluation of extended audio lengths.
  3. The CLAP score, designed to measure adherence of the generated stereo audio to textual prompts.

These metrics, adapted from established ones, cater to the assessment of variable-length, long-form audio, providing a robust framework for measuring the plausibility, semantic reliability, and text-based accuracy of generative models.

Empirical Results and Model Capabilities

Empirical findings from the paper indicate Stable Audio's proficiency in generating high-quality, structured music—with identifiable sections like intro and outro—and stereo sound effects. In speed tests on an A100 GPU, the model rendered up to 95 seconds of 44.1kHz stereo audio in as little as 8 seconds. Quantitatively, Stable Audio showcased its competitive edge on public text-to-music benchmarks, leading in several audio quality metrics and text alignment for music generation. When qualitatively measured, it earned high mean opinion scores for audio quality and musical structure compared to other models. However, the study does acknowledge an area for potential refinement: enhancing the model's stereo sound generation for sound effects, as indicated by a modest stereo correctness score.

Online Resources and Model Accessibility

Amplifying its practical implications, the researchers have made the model, accompanying metrics, and demonstrations publicly accessible. Interested parties can interact with Stable Audio's capabilities through comprehensive demo pages, facilitating transparency and accelerating its utility for various applications.

In conclusion, Stable Audio signifies a leap forward in generative audio synthesis, particularly in the domain of music and sound effects production. This model's introduction stands as a testament to the evolving synergy between efficiency, control, and quality within the field of generative AI.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube