Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation (2401.01044v1)

Published 2 Jan 2024 in cs.SD, cs.AI, cs.CL, and eess.AS

Abstract: Recent advancements in diffusion models and LLMs have significantly propelled the field of AIGC. Text-to-Audio (TTA), a burgeoning AIGC application designed to generate audio from natural language prompts, is attracting increasing attention. However, existing TTA studies often struggle with generation quality and text-audio alignment, especially for complex textual inputs. Drawing inspiration from state-of-the-art Text-to-Image (T2I) diffusion models, we introduce Auffusion, a TTA system adapting T2I model frameworks to TTA task, by effectively leveraging their inherent generative strengths and precise cross-modal alignment. Our objective and subjective evaluations demonstrate that Auffusion surpasses previous TTA approaches using limited data and computational resource. Furthermore, previous studies in T2I recognizes the significant impact of encoder choice on cross-modal alignment, like fine-grained details and object bindings, while similar evaluation is lacking in prior TTA works. Through comprehensive ablation studies and innovative cross-attention map visualizations, we provide insightful assessments of text-audio alignment in TTA. Our findings reveal Auffusion's superior capability in generating audios that accurately match textual descriptions, which further demonstrated in several related tasks, such as audio style transfer, inpainting and other manipulations. Our implementation and demos are available at https://auffusion.github.io.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers. CoRR, abs/2211.01324, 2022.
  2. Pixart-α𝛼\alphaitalic_Ξ±: Fast training of diffusion transformer for photorealistic text-to-image synthesis. CoRR, abs/2310.00426, 2023.
  3. High fidelity neural audio compression. CoRR, abs/2210.13438, 2022.
  4. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT (1), pages 4171–4186. Association for Computational Linguistics, 2019.
  5. Clotho: an audio captioning dataset. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020, pages 736–740. IEEE, 2020.
  6. CLAP: learning audio concepts from natural language supervision. CoRR, abs/2206.04769, 2022.
  7. Riffusion - Stable diffusion for real-time music generation. 2022.
  8. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans, LA, USA, March 5-9, 2017, pages 776–780. IEEE, 2017.
  9. Text-to-audio generation using instruction-tuned LLM and latent diffusion model. CoRR, abs/2304.13731, 2023.
  10. CNN architectures for large-scale audio classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans, LA, USA, March 5-9, 2017, pages 131–135. IEEE, 2017.
  11. The benefit of temporally-strong labels in audio event classification. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 6-11, 2021, pages 366–370. IEEE, 2021.
  12. Prompt-to-prompt image editing with cross-attention control. In ICLR. OpenReview.net, 2023.
  13. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 6626–6637, 2017.
  14. Classifier-free diffusion guidance. CoRR, abs/2207.12598, 2022.
  15. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  16. Make-an-audio 2: Temporal-enhanced text-to-audio generation. CoRR, abs/2305.18474, 2023a.
  17. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. CoRR, abs/2307.06350, 2023b.
  18. Masked autoencoders that listen. In NeurIPS, 2022.
  19. Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. In ICML, pages 13916–13932. PMLR, 2023c.
  20. Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 119–132. Association for Computational Linguistics, 2019.
  21. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020a.
  22. Acoustic scene generation with conditional samplernn. In ICASSP, pages 925–929. IEEE, 2019.
  23. Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE ACM Trans. Audio Speech Lang. Process., 28:2880–2894, 2020b.
  24. Audiogen: Textually guided audio generation. In ICLR. OpenReview.net, 2023.
  25. Bigvgan: A universal neural vocoder with large-scale training. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
  26. xformers: A modular and hackable transformer modelling library. https://github.com/facebookresearch/xformers, 2022.
  27. Audioldm: Text-to-audio generation with latent diffusion models. In ICML, pages 21450–21474. PMLR, 2023a.
  28. Audioldm 2: Learning holistic audio generation with self-supervised pretraining. CoRR, abs/2308.05734, 2023b.
  29. Conditional sound generation using neural discrete time-frequency representation learning. In MLSP, pages 1–6. IEEE, 2021.
  30. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692, 2019.
  31. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.
  32. What is the ground truth? reliability of multi-annotator data for audio tagging. In 29th European Signal Processing Conference, EUSIPCO 2021, Dublin, Ireland, August 23-27, 2021, pages 76–80. IEEE, 2021.
  33. Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. CoRR, abs/2303.17395, 2023.
  34. Sdedit: Guided image synthesis and editing with stochastic differential equations. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
  35. Null-text inversion for editing real images using guided diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 6038–6047. IEEE, 2023.
  36. GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, pages 16784–16804. PMLR, 2022.
  37. A fast griffin-lim algorithm. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA 2013, New Paltz, NY, USA, October 20-23, 2013, pages 1–4. IEEE, 2013.
  38. KarolΒ J. Piczak. ESC: dataset for environmental sound classification. In Proceedings of the 23rd Annual ACM Conference on Multimedia Conference, MM ’15, Brisbane, Australia, October 26 - 30, 2015, pages 1015–1018. ACM, 2015.
  39. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
  40. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2020.
  41. Zero-shot text-to-image generation. In ICML, pages 8821–8831. PMLR, 2021.
  42. Hierarchical text-conditional image generation with CLIP latents. CoRR, abs/2204.06125, 2022.
  43. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10674–10685. IEEE, 2022.
  44. Palette: Image-to-image diffusion models. In SIGGRAPH ’22: Special Interest Group on Computer Graphics and Interactive Techniques Conference, Vancouver, BC, Canada, August 7 - 11, 2022, pages 15:1–15:10. ACM, 2022a.
  45. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022b.
  46. A dataset and taxonomy for urban sound research. In Proceedings of the ACM International Conference on Multimedia, MM ’14, Orlando, FL, USA, November 03 - 07, 2014, pages 1041–1044. ACM, 2014.
  47. Flan-moe: Scaling instruction-finetuned language models with sparse mixture of experts. CoRR, abs/2305.14705, 2023.
  48. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, pages 2256–2265. JMLR.org, 2015.
  49. Resolution-robust large mask inpainting with fourier convolutions. In IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022, Waikoloa, HI, USA, January 3-8, 2022, pages 3172–3182. IEEE, 2022.
  50. What the DAAM: interpreting stable diffusion using cross attention. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 5644–5659. Association for Computational Linguistics, 2023.
  51. Musical genre classification of audio signals. IEEE Trans. Speech Audio Process., 10(5):293–302, 2002.
  52. Neural discrete representation learning. In NIPS, pages 6306–6315, 2017.
  53. Attention is all you need. In NIPS, pages 5998–6008, 2017.
  54. Audio-text models do not yet leverage natural language. In IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2023, Rhodes Island, Greece, June 4-10, 2023, pages 1–5. IEEE, 2023.
  55. Diffsound: Discrete diffusion model for text-to-sound generation. IEEE ACM Trans. Audio Speech Lang. Process., 31:1720–1733, 2023.
  56. Adding conditional control to text-to-image diffusion models. CoRR, abs/2302.05543, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Jinlong Xue (9 papers)
  2. Yayue Deng (9 papers)
  3. Yingming Gao (15 papers)
  4. Ya Li (79 papers)
Citations (17)