Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation (2401.01044v1)
Abstract: Recent advancements in diffusion models and LLMs have significantly propelled the field of AIGC. Text-to-Audio (TTA), a burgeoning AIGC application designed to generate audio from natural language prompts, is attracting increasing attention. However, existing TTA studies often struggle with generation quality and text-audio alignment, especially for complex textual inputs. Drawing inspiration from state-of-the-art Text-to-Image (T2I) diffusion models, we introduce Auffusion, a TTA system adapting T2I model frameworks to TTA task, by effectively leveraging their inherent generative strengths and precise cross-modal alignment. Our objective and subjective evaluations demonstrate that Auffusion surpasses previous TTA approaches using limited data and computational resource. Furthermore, previous studies in T2I recognizes the significant impact of encoder choice on cross-modal alignment, like fine-grained details and object bindings, while similar evaluation is lacking in prior TTA works. Through comprehensive ablation studies and innovative cross-attention map visualizations, we provide insightful assessments of text-audio alignment in TTA. Our findings reveal Auffusion's superior capability in generating audios that accurately match textual descriptions, which further demonstrated in several related tasks, such as audio style transfer, inpainting and other manipulations. Our implementation and demos are available at https://auffusion.github.io.
- ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers. CoRR, abs/2211.01324, 2022.
- Pixart-Ξ±πΌ\alphaitalic_Ξ±: Fast training of diffusion transformer for photorealistic text-to-image synthesis. CoRR, abs/2310.00426, 2023.
- High fidelity neural audio compression. CoRR, abs/2210.13438, 2022.
- BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT (1), pages 4171β4186. Association for Computational Linguistics, 2019.
- Clotho: an audio captioning dataset. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020, pages 736β740. IEEE, 2020.
- CLAP: learning audio concepts from natural language supervision. CoRR, abs/2206.04769, 2022.
- Riffusion - Stable diffusion for real-time music generation. 2022.
- Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans, LA, USA, March 5-9, 2017, pages 776β780. IEEE, 2017.
- Text-to-audio generation using instruction-tuned LLM and latent diffusion model. CoRR, abs/2304.13731, 2023.
- CNN architectures for large-scale audio classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans, LA, USA, March 5-9, 2017, pages 131β135. IEEE, 2017.
- The benefit of temporally-strong labels in audio event classification. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 6-11, 2021, pages 366β370. IEEE, 2021.
- Prompt-to-prompt image editing with cross-attention control. In ICLR. OpenReview.net, 2023.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 6626β6637, 2017.
- Classifier-free diffusion guidance. CoRR, abs/2207.12598, 2022.
- Denoising diffusion probabilistic models. In NeurIPS, 2020.
- Make-an-audio 2: Temporal-enhanced text-to-audio generation. CoRR, abs/2305.18474, 2023a.
- T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. CoRR, abs/2307.06350, 2023b.
- Masked autoencoders that listen. In NeurIPS, 2022.
- Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. In ICML, pages 13916β13932. PMLR, 2023c.
- Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 119β132. Association for Computational Linguistics, 2019.
- Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020a.
- Acoustic scene generation with conditional samplernn. In ICASSP, pages 925β929. IEEE, 2019.
- Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE ACM Trans. Audio Speech Lang. Process., 28:2880β2894, 2020b.
- Audiogen: Textually guided audio generation. In ICLR. OpenReview.net, 2023.
- Bigvgan: A universal neural vocoder with large-scale training. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
- xformers: A modular and hackable transformer modelling library. https://github.com/facebookresearch/xformers, 2022.
- Audioldm: Text-to-audio generation with latent diffusion models. In ICML, pages 21450β21474. PMLR, 2023a.
- Audioldm 2: Learning holistic audio generation with self-supervised pretraining. CoRR, abs/2308.05734, 2023b.
- Conditional sound generation using neural discrete time-frequency representation learning. In MLSP, pages 1β6. IEEE, 2021.
- Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692, 2019.
- Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.
- What is the ground truth? reliability of multi-annotator data for audio tagging. In 29th European Signal Processing Conference, EUSIPCO 2021, Dublin, Ireland, August 23-27, 2021, pages 76β80. IEEE, 2021.
- Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. CoRR, abs/2303.17395, 2023.
- Sdedit: Guided image synthesis and editing with stochastic differential equations. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
- Null-text inversion for editing real images using guided diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 6038β6047. IEEE, 2023.
- GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, pages 16784β16804. PMLR, 2022.
- A fast griffin-lim algorithm. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA 2013, New Paltz, NY, USA, October 20-23, 2013, pages 1β4. IEEE, 2013.
- KarolΒ J. Piczak. ESC: dataset for environmental sound classification. In Proceedings of the 23rd Annual ACM Conference on Multimedia Conference, MM β15, Brisbane, Australia, October 26 - 30, 2015, pages 1015β1018. ACM, 2015.
- Learning transferable visual models from natural language supervision. In ICML, pages 8748β8763. PMLR, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1β140:67, 2020.
- Zero-shot text-to-image generation. In ICML, pages 8821β8831. PMLR, 2021.
- Hierarchical text-conditional image generation with CLIP latents. CoRR, abs/2204.06125, 2022.
- High-resolution image synthesis with latent diffusion models. In CVPR, pages 10674β10685. IEEE, 2022.
- Palette: Image-to-image diffusion models. In SIGGRAPH β22: Special Interest Group on Computer Graphics and Interactive Techniques Conference, Vancouver, BC, Canada, August 7 - 11, 2022, pages 15:1β15:10. ACM, 2022a.
- Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022b.
- A dataset and taxonomy for urban sound research. In Proceedings of the ACM International Conference on Multimedia, MM β14, Orlando, FL, USA, November 03 - 07, 2014, pages 1041β1044. ACM, 2014.
- Flan-moe: Scaling instruction-finetuned language models with sparse mixture of experts. CoRR, abs/2305.14705, 2023.
- Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, pages 2256β2265. JMLR.org, 2015.
- Resolution-robust large mask inpainting with fourier convolutions. In IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022, Waikoloa, HI, USA, January 3-8, 2022, pages 3172β3182. IEEE, 2022.
- What the DAAM: interpreting stable diffusion using cross attention. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 5644β5659. Association for Computational Linguistics, 2023.
- Musical genre classification of audio signals. IEEE Trans. Speech Audio Process., 10(5):293β302, 2002.
- Neural discrete representation learning. In NIPS, pages 6306β6315, 2017.
- Attention is all you need. In NIPS, pages 5998β6008, 2017.
- Audio-text models do not yet leverage natural language. In IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2023, Rhodes Island, Greece, June 4-10, 2023, pages 1β5. IEEE, 2023.
- Diffsound: Discrete diffusion model for text-to-sound generation. IEEE ACM Trans. Audio Speech Lang. Process., 31:1720β1733, 2023.
- Adding conditional control to text-to-image diffusion models. CoRR, abs/2302.05543, 2023.
- Jinlong Xue (9 papers)
- Yayue Deng (9 papers)
- Yingming Gao (15 papers)
- Ya Li (79 papers)