AudioGen: Textually Guided Audio Generation (2209.15352v2)

Published 30 Sep 2022 in cs.SD, cs.CL, cs.LG, and eess.AS

Abstract: We tackle the problem of generating audio samples conditioned on descriptive text captions. In this work, we propose AaudioGen, an auto-regressive generative model that generates audio samples conditioned on text inputs. AudioGen operates on a learnt discrete audio representation. The task of text-to-audio generation poses multiple challenges. Due to the way audio travels through a medium, differentiating ``objects'' can be a difficult task (e.g., separating multiple people simultaneously speaking). This is further complicated by real-world recording conditions (e.g., background noise, reverberation, etc.). Scarce text annotations impose another constraint, limiting the ability to scale models. Finally, modeling high-fidelity audio requires encoding audio at high sampling rate, leading to extremely long sequences. To alleviate the aforementioned challenges we propose an augmentation technique that mixes different audio samples, driving the model to internally learn to separate multiple sources. We curated 10 datasets containing different types of audio and text annotations to handle the scarcity of text-audio data points. For faster inference, we explore the use of multi-stream modeling, allowing the use of shorter sequences while maintaining a similar bitrate and perceptual quality. We apply classifier-free guidance to improve adherence to text. Comparing to the evaluated baselines, AudioGen outperforms over both objective and subjective metrics. Finally, we explore the ability of the proposed method to generate audio continuation conditionally and unconditionally. Samples: https://felixkreuk.github.io/audiogen

Citations (245)

View on Semantic Scholar

Summary

The paper presents a dual-stage autoregressive model that generates audio from text using discrete audio representation and a T5-based text encoder.
It employs strategic audio-text augmentation and classifier-free guidance to improve source separation and enhance text adherence.
Empirical results show lower Fréchet Audio Distance and superior performance over baselines, underlining its potential in multimedia applications.

Overview of "AudioGen: Textually Guided Audio Generation"

The paper "AudioGen: Textually Guided Audio Generation" presents a sophisticated approach to generating audio samples based on descriptive text captions. It introduces AudioGen, an auto-regressive model that exploits a learned discrete audio representation for effective text-to-audio generation. AudioGen addresses several inherent challenges in this domain, such as the ambiguity in differentiating sound objects due to overlapping sounds, the complexity introduced by real-world audio distortions, and the scarcity of paired audio-text datasets.

Core Contributions

Model Architecture: AudioGen consists of two primary stages—a neural audio compression model that encodes raw audio into discrete tokens, and a Transformer-based LLM that uses these tokens in conjunction with text input to generate audio sequences. The text representation is derived from a pre-trained T5 encoder, which facilitates the model's generalization to novel text concepts.
Augmentation Techniques: The paper introduces a strategic audio-text mixing augmentation, merging various audio sources to teach the model to handle complex audio compositions and improve source separation capabilities.
Classifier-Free Guidance: The paper adopts classifier-free guidance, adjusting the conditional and unconditional audio generation probabilities to enhance adherence to the input text, and thereby significantly improves the fidelity of generated audio concerning the textual description.
Multi-stream Audio Representation: To tackle the challenge of long audio sequences, a multi-stream modeling strategy is employed, which enables shorter sequence processing without compromising audio quality or bitrate.

Empirical Evaluation

The experimental section provides a comprehensive evaluation using objective metrics like Fréchet Audio Distance (FAD) and KL-Divergence, as well as subjective human assessments focusing on audio quality and relevance to the text. AudioGen demonstrates superiority over the baseline DiffSound model across all evaluated metrics. In configurations with and without augmentation, AudioGen’s performance illustratively underscores the gains in text adherence and overall sound quality.

Audio Continuation and Compositionality

AudioGen also extends its functionality to audio continuation tasks, where it can generate subsequent audio segments either conditionally on text or unconditionally. Empirical analyses highlight that brief audio prompts combined with textual input significantly improve the model's ability to produce coherent continuations consistent with the given descriptions.

Limitations and Future Directions

Despite its strengths, AudioGen grapples with the extensive sequence lengths inherent in high-resolution audio generation, which impacts inference times and scalability. Future work might explore more efficient sequence modeling techniques, additional augmentation strategies to enhance composition abilities, and broader dataset utilizations that cover a wider demographic to mitigate biases inherent to the current datasets.

Conclusion

The development of AudioGen marks a noteworthy advancement in text-guided audio generation, providing a robust framework capable of producing high-quality, diverse audio outputs from text descriptions. Its methodology paves the way for future research into more nuanced and scalable text-to-audio systems, widening the applicability of AI-driven content creation tools in various digital soundscapes, including movies, video games, and other multimedia applications.

PDF Markdown

Related Papers

GitHub

AudioGen: Textually Guided Audio Generation

Tweets

https://twitter.com/hoyos_dillan/status/1828656465349059038

YouTube

Show All Videos