Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

WavJourney: Compositional Audio Creation with Large Language Models (2307.14335v2)

Published 26 Jul 2023 in cs.SD, cs.AI, cs.MM, and eess.AS

Abstract: Despite breakthroughs in audio generation models, their capabilities are often confined to domain-specific conditions such as speech transcriptions and audio captions. However, real-world audio creation aims to generate harmonious audio containing various elements such as speech, music, and sound effects with controllable conditions, which is challenging to address using existing audio generation systems. We present WavJourney, a novel framework that leverages LLMs to connect various audio models for audio creation. WavJourney allows users to create storytelling audio content with diverse audio elements simply from textual descriptions. Specifically, given a text instruction, WavJourney first prompts LLMs to generate an audio script that serves as a structured semantic representation of audio elements. The audio script is then converted into a computer program, where each line of the program calls a task-specific audio generation model or computational operation function. The computer program is then executed to obtain a compositional and interpretable solution for audio creation. Experimental results suggest that WavJourney is capable of synthesizing realistic audio aligned with textually-described semantic, spatial and temporal conditions, achieving state-of-the-art results on text-to-audio generation benchmarks. Additionally, we introduce a new multi-genre story benchmark. Subjective evaluations demonstrate the potential of WavJourney in crafting engaging storytelling audio content from text. We further demonstrate that WavJourney can facilitate human-machine co-creation in multi-round dialogues. To foster future research, the code and synthesized audio are available at: https://audio-agi.github.io/WavJourney_demopage/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. MusicLM: Generating music from text. arXiv:2301.11325, 2023.
  2. Expressive sonification of footstep sounds. Proceedings of ISon, 2010.
  3. Language models are few-shot learners. Advances in Neural Information Processing Systems, pp.  1877–1901, 2020.
  4. Foley sound synthesis at the dcase 2023 challenge. arXiv preprint arXiv:2304.12521, 2023.
  5. Simple and controllable music generation. arXiv:2306.05284, 2023.
  6. Clotho: An audio captioning dataset. In IEEE International Conference on Acoustics, Speech and Signal Processing, pp.  736–740, 2020.
  7. DDSP: Differentiable digital signal processing. In International Conference on Learning Representations, 2020.
  8. Freesound technical demo. In Proceedings of the 21st ACM International Conference on Multimedia, pp.  411–412, 2013.
  9. Michael Gallagher. Field recording and the sounding of spaces. Environment and Planning D: Society and Space, pp.  560–576, 2015.
  10. Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  14953–14962, 2023.
  11. CNN architectures for large-scale audio classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal processing (ICASSP), pp.  131–135, 2017.
  12. Make-an-Audio: Text-to-audio generation with prompt-enhanced diffusion models. arXiv:2301.12661, 2023a.
  13. AudioGPT: Understanding and generating speech, music, sound, and talking head. arXiv:2304.12995, 2023b.
  14. HuggingFace. The AI Community Building the Future. https://huggingface.com, 2016.
  15. AudioCaps: Generating captions for audios in the wild. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  119–132, 2019.
  16. Efficient training of audio transformers with patchout. arXiv preprint arXiv:2110.05069, 2021.
  17. AudioGen: Textually guided audio generation. In The International Conference on Learning Representations, 2022.
  18. Rensis Likert. A technique for the measurement of attitudes. Archives of psychology, 1932.
  19. VoiceFixer: A unified framework for high-fidelity speech restoration. In Conference of the International Speech Communication Association, 2022.
  20. AudioLDM: Text-to-audio generation with latent diffusion models. In International Conference on Machine Learning, 2023a.
  21. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b.
  22. Conditional sound generation using neural discrete time-frequency representation learning. In International Workshop on Machine Learning for Signal Processing, 2021.
  23. OpenAI. Introducing ChatGPT. https://openai.com/blog/chatgpt, 2022.
  24. Librispeech: an ASR corpus based on public domain audio books. In IEEE International Conference on Acoustics, Speech and Signal Processing, pp.  5206–5210, 2015.
  25. On impact and evaluation in computational creativity: A discussion of the turing test and an alternative proposal. In Proceedings of the AISB Symposium on AI and Philosophy, volume 39, 2011.
  26. A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922, 2023.
  27. Joshua D Reiss. Intelligent systems for mixing multichannel audio. In International Conference on Digital Signal Processing, 2011.
  28. Crowdmos: An approach for crowdsourcing mean opinion score studies. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp.  2416–2419. IEEE, 2011.
  29. Metrics for measuring ideation effectiveness. Design Studies, 24(2):111–134, 2003.
  30. HuggingGPT: Solving AI tasks with ChatGPT and its friends in Huggingface. arXiv:2303.17580, 2023.
  31. Pyloudnorm: A simple yet flexible loudness meter in Python. In Audio Engineering Convention, 2021.
  32. Suno. Bark. https://github.com/suno-ai/bark, 2023.
  33. ViperGPT: Visual inference via Python execution for reasoning. arXiv:2303.08128, 2023.
  34. NaturalSpeech: End-to-end text to speech synthesis with human-level quality. arXiv:2205.04421, 2022.
  35. Music composition with interactive evolutionary computation. In International Conference on Generative Art, volume 17, pp.  215–226, 2000.
  36. LLaMA: Open and efficient foundation language models. arXiv:2302.13971, 2023a.
  37. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  38. Tacotron: Towards end-to-end speech synthesis. arXiv:1703.10135, 2017.
  39. Benjamin Wright. Footsteps with character: the art and craft of Foley. Screen, 55(2):204–220, 2014.
  40. SpeechGen: Unlocking the generative power of speech language models with prompts. arXiv:2306.02207, 2023.
  41. Leveraging pre-trained AudioLDM for sound generation: A benchmark study. In European Signal Processing Conference, 2023a.
  42. Text-driven foley sound generation with latent diffusion model. In Detection and Classification of Acoustic Scenes and Events Workshop, 2023b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Xubo Liu (66 papers)
  2. Zhongkai Zhu (3 papers)
  3. Haohe Liu (59 papers)
  4. Yi Yuan (54 papers)
  5. Meng Cui (8 papers)
  6. Qiushi Huang (23 papers)
  7. Jinhua Liang (15 papers)
  8. Yin Cao (24 papers)
  9. Qiuqiang Kong (86 papers)
  10. Mark D. Plumbley (114 papers)
  11. Wenwu Wang (148 papers)
Citations (19)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com