Stable Audio Open (2407.14358v2)

Published 19 Jul 2024 in cs.SD, cs.AI, and eess.AS

Abstract: Open generative models are vitally important for the community, allowing for fine-tunes and serving as baselines when presenting new models. However, most current text-to-audio models are private and not accessible for artists and researchers to build upon. Here we describe the architecture and training process of a new open-weights text-to-audio model trained with Creative Commons data. Our evaluation shows that the model's performance is competitive with the state-of-the-art across various metrics. Notably, the reported FDopenl3 results (measuring the realism of the generations) showcase its potential for high-quality stereo sound synthesis at 44.1kHz.

Citations (14)

View on Semantic Scholar

Summary

The paper introduces a novel text-to-audio generative model that combines a variational autoencoder with a transformer-based diffusion module.
Methodology features multi-resolution STFT reconstruction, adversarial loss, and efficient block-wise attention to optimize performance.
Evaluation on datasets like AudioCaps demonstrates robust FD_openl3 metrics while ensuring legal transparency with Creative Commons licensed training data.

Overview of "Stable Audio Open"

The paper "Stable Audio Open" presents an open-weights text-to-audio model developed by Stability AI, with its architecture and training process thoroughly described. This model addresses the current lack of openly accessible, high-quality text-to-audio generative models which is critical for advancing both artistic creation and academic research. The research goal was to create a state-of-the-art generative model trained on Creative Commons (CC) licensed data, ensuring legal transparency and usability.

Model Architecture

The architecture of Stable Audio Open is delineated into three core components: a variational autoencoder, a T5-based text embedding for text conditioning, and a transformer-based diffusion model. Key details include:

Autoencoder: The autoencoder, comprising 156 million parameters, reduces waveform length using five convolutional blocks, ResNet-like layers, and dilated convolutions with Snake activation functions. This component acts on raw waveforms, maintaining variable lengths up to 47 seconds at a 44.1kHz sampling rate.
Diffusion-Transformer (DiT): The DiT, equipped with 1,057 million parameters, leverages a T5-based text embedding and operates in the latent space of the autoencoder. The latent dimensions are seamlessly integrated with advanced mechanisms such as rotary positional embeddings and cross-attention layers, enhancing the capability to generate coherent and contextually appropriate audio outputs.

The model's architecture, which closely mirrors Stable Audio 2.0, but with substitutions like T5 for text conditioning in place of CLAP, has been optimized for accessible implementation on consumer-grade GPUs.

Training Methodology

The training data comprises 486,492 recordings sourced from Freesound and the Free Music Archive, totaling around 7,300 hours. Various steps were rigorously followed to ensure that the dataset remains within the bounds of CC licensing. The methodology targeting the autoencoder involved the use of multi-resolution STFT reconstruction losses, adversarial loss terms with convolutional discriminators, and a controlled application of KL divergence loss.

For the Diffusion-Transformer, the training protocol implemented techniques like efficient block-wise attention and gradient checkpointing to reduce computational load. The inference employed classifier-free guidance along with DPM-Solver++, demonstrating robust results in desired outputs.

Evaluation Metrics

The evaluation was done using established metrics: FD $_{openl3}$ , KL $_{passt}$ , and CLAP $_{score}$ . These metrics respectively measure generation plausibility, semantic correspondence, and adherence of generated audio to given text prompts.

Key Findings:

AudioCaps Dataset: When evaluated on the AudioCaps dataset, Stable Audio Open outperformed comparable models, particularly in FD $_{openl3}$ , demonstrating its proficiency in generating realistic sound representations.
Song Describer Dataset: Although the model's performance in generating music was not superior to specialized music generation models, it showcased competitive alignment with the state-of-the-art open-access models like MusicGen.

Additionally, the autoencoder showcased nearly similar performance metrics to Stable Audio 2.0, despite being exclusively trained on Creative Commons licensed data. This balance between performance and legal transparency is a significant achievement.

Implications and Future Directions

The release of Stable Audio Open marks a valuable contribution to the field of generative AI, offering an open-access, high-quality model that can serve as a foundation for further exploration and customization in audio synthesis. A few key implications and future speculations include:

Practical Applications: The model's open-access nature and high performance open possibilities for artists, researchers, and developers to utilize and finetune it for niche applications, advancing both creative and academic endeavors.
Legal and Ethical Considerations: Training models on strictly open-access data sets a new standard in the domain, ensuring ethical use and sharing of AI outputs.
Future Developments: Enhancements to address current limitations—such as improved handling of longer sequences, sophisticated prompts, and multilingual capabilities—are likely directions to bolster the model’s versatility and application scope.

Limitations

Stable Audio Open acknowledges limitations in handling complex prompts with connectors and generating intelligible speech, emphasizing a potential area for refinement. Additionally, constraints in generating high-quality music outputs stem from the limited high-quality music data in available CC licenses.

Conclusion

In conclusion, this paper sets a precedent for responsibly releasing high-performing generative models, balancing openness, quality, and legal compliance. Future iterations might focus on overcoming identified limitations and expanding the model's applicability across different languages and more complex audio requirements. Stable Audio Open stands as a pivotal resource, fostering progressive strides in AI-driven audio generation.