- The paper presents a dual-stage autoregressive model that generates audio from text using discrete audio representation and a T5-based text encoder.
- It employs strategic audio-text augmentation and classifier-free guidance to improve source separation and enhance text adherence.
- Empirical results show lower Fréchet Audio Distance and superior performance over baselines, underlining its potential in multimedia applications.
Overview of "AudioGen: Textually Guided Audio Generation"
The paper "AudioGen: Textually Guided Audio Generation" presents a sophisticated approach to generating audio samples based on descriptive text captions. It introduces AudioGen, an auto-regressive model that exploits a learned discrete audio representation for effective text-to-audio generation. AudioGen addresses several inherent challenges in this domain, such as the ambiguity in differentiating sound objects due to overlapping sounds, the complexity introduced by real-world audio distortions, and the scarcity of paired audio-text datasets.
Core Contributions
- Model Architecture: AudioGen consists of two primary stages—a neural audio compression model that encodes raw audio into discrete tokens, and a Transformer-based LLM that uses these tokens in conjunction with text input to generate audio sequences. The text representation is derived from a pre-trained T5 encoder, which facilitates the model's generalization to novel text concepts.
- Augmentation Techniques: The paper introduces a strategic audio-text mixing augmentation, merging various audio sources to teach the model to handle complex audio compositions and improve source separation capabilities.
- Classifier-Free Guidance: The paper adopts classifier-free guidance, adjusting the conditional and unconditional audio generation probabilities to enhance adherence to the input text, and thereby significantly improves the fidelity of generated audio concerning the textual description.
- Multi-stream Audio Representation: To tackle the challenge of long audio sequences, a multi-stream modeling strategy is employed, which enables shorter sequence processing without compromising audio quality or bitrate.
Empirical Evaluation
The experimental section provides a comprehensive evaluation using objective metrics like Fréchet Audio Distance (FAD) and KL-Divergence, as well as subjective human assessments focusing on audio quality and relevance to the text. AudioGen demonstrates superiority over the baseline DiffSound model across all evaluated metrics. In configurations with and without augmentation, AudioGen’s performance illustratively underscores the gains in text adherence and overall sound quality.
Audio Continuation and Compositionality
AudioGen also extends its functionality to audio continuation tasks, where it can generate subsequent audio segments either conditionally on text or unconditionally. Empirical analyses highlight that brief audio prompts combined with textual input significantly improve the model's ability to produce coherent continuations consistent with the given descriptions.
Limitations and Future Directions
Despite its strengths, AudioGen grapples with the extensive sequence lengths inherent in high-resolution audio generation, which impacts inference times and scalability. Future work might explore more efficient sequence modeling techniques, additional augmentation strategies to enhance composition abilities, and broader dataset utilizations that cover a wider demographic to mitigate biases inherent to the current datasets.
Conclusion
The development of AudioGen marks a noteworthy advancement in text-guided audio generation, providing a robust framework capable of producing high-quality, diverse audio outputs from text descriptions. Its methodology paves the way for future research into more nuanced and scalable text-to-audio systems, widening the applicability of AI-driven content creation tools in various digital soundscapes, including movies, video games, and other multimedia applications.