Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

WavCraft: Audio Editing and Generation with Large Language Models (2403.09527v3)

Published 14 Mar 2024 in eess.AS

Abstract: We introduce WavCraft, a collective system that leverages LLMs to connect diverse task-specific models for audio content creation and editing. Specifically, WavCraft describes the content of raw audio materials in natural language and prompts the LLM conditioned on audio descriptions and user requests. WavCraft leverages the in-context learning ability of the LLM to decomposes users' instructions into several tasks and tackle each task collaboratively with the particular module. Through task decomposition along with a set of task-specific models, WavCraft follows the input instruction to create or edit audio content with more details and rationales, facilitating user control. In addition, WavCraft is able to cooperate with users via dialogue interaction and even produce the audio content without explicit user commands. Experiments demonstrate that WavCraft yields a better performance than existing methods, especially when adjusting the local regions of audio clips. Moreover, WavCraft can follow complex instructions to edit and create audio content on the top of input recordings, facilitating audio producers in a broader range of applications. Our implementation and demos are available at this https://github.com/JinhuaLiang/WavCraft.

WavCraft: Unveiling a New Horizon in Audio Content Creation and Editing via Natural Language Prompts

Introduction to WavCraft

WavCraft emerges as a cohesive system that ingeniously integrates LLMs with an array of task-specific models tailored for audio content creation and editing. This innovative approach stands out by its ability to interpret and process raw sound materials through natural language descriptions, paving the way for a new paradigm in audio manipulation. By leveraging the intrinsic in-context learning capabilities of LLMs, WavCraft proficiently decomposes complex user instructions into manageable tasks, each addressed collaboratively with specialized audio modules. Such decomposition not only refines the process of creating or editing audio content but also enhances user control through detailed task execution.

Audio Analysis and Task Decomposition

At the heart of WavCraft's operation lies the audio analysis module, which is tasked with translating the essence of input audio clips into natural language descriptors. This process, crucial for understanding the content within audio files, allows the system to respond appropriately to users' commands by generating relevant instructions that are then passed onto an audio programmer module. The module utilizes LLMs to dissect user instructions into basic tasks, each of which is tackled using a suite of expert models designed for specific audio operations. This structured approach to task decomposition sheds light on WavCraft's versatility in audio content manipulation.

Expert Models and Modular Approach

WavCraft's strength lies in its coalition of various audio generation and transformation models, rendering it adept at performing a wide array of audio tasks. From text-to-audio conversion to source separation and beyond, the system employs models such as AudioGen and MusicGen for generating high-fidelity audio content. Additional functionalities such as super-resolution enhancement, audio infilling, and DSP operations further augment WavCraft's capabilities. This modular construction offers substantial flexibility, allowing for the incorporation or substitution of expert models as desired.

Advanced Features and Future Prospects

WavCraft distinguishes itself through several advanced features that underscore its potential to revolutionize audio content creation:

  • Modular Operations: By breaking down complex instructions into elementary tasks, WavCraft can handle intricate editing applications in an explainable manner, enhancing transparency and ease of use.
  • Controllable Editing: The system's profound understanding of user requests enables it to edit targeted audio attributes meticulously while preserving the integrity of the remaining content.
  • Human-AI Co-Creation: WavCraft's design facilitates interactive content creation, allowing for multi-round refinement with users. This co-creative process benefits from the system's ability to maintain consistency throughout the generated audio content.
  • Audio Scriptwriting: Perhaps most intriguingly, WavCraft exhibits the capacity to autonomously generate audio content following high-level outlines, demonstrating a form of creativity hitherto unseen in audio manipulation tools.

Limitations and Areas for Improvement

Despite its impressive capabilities, WavCraft is not without its limitations. The performance of audio analysis models, critical for accurately interpreting audio content, currently restricts the system's effectiveness. Moreover, the inference speed, owing to the need to consult multiple expert models for complex tasks, could benefit from optimization to ensure smoother interaction and usability in practical applications.

Conclusion

WavCraft represents a significant stride forward in the field of artificial intelligence-generated content (AIGC), offering a sophisticated tool for audio content creation and editing through natural language prompts. Its ability to interpret user instructions and raw audio content, decompose tasks, and utilize expert models for specific operations positions it as a versatile and powerful tool in audio production. As research in this field continues to advance, the potential applications and improvements of systems like WavCraft promise to further expand the boundaries of what is possible in audio content creation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. MusicLM: Generating Music From Text, January 2023. URL http://arxiv.org/abs/2301.11325. arXiv:2301.11325 [cs, eess].
  2. Flamingo: A Visual Language Model for Few-Shot Learning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022.
  3. AudioLM: a Language Modeling Approach to Audio Generation, September 2022. URL http://arxiv.org/abs/2209.03143. arXiv:2209.03143 [cs, eess].
  4. Simple and Controllable Music Generation, November 2023. URL http://arxiv.org/abs/2306.05284. arXiv:2306.05284 [cs, eess].
  5. Pengi: An Audio Language Model for Audio Tasks, May 2023. arXiv:2305.11834.
  6. Recursive Visual Programming, December 2023. URL http://arxiv.org/abs/2312.02249. arXiv:2312.02249 [cs].
  7. Audio Set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  776–780, New Orleans, LA, March 2017. IEEE. ISBN 978-1-5090-4117-6. doi: 10.1109/ICASSP.2017.7952261.
  8. Listen, Think, and Understand, May 2023. arXiv:2305.10790.
  9. Visual Programming: Compositional Visual Reasoning Without Training. pp.  14953–14962, 2023. URL https://openaccess.thecvf.com/content/CVPR2023/html/Gupta_Visual_Programming_Compositional_Visual_Reasoning_Without_Training_CVPR_2023_paper.html.
  10. AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head, April 2023. URL http://arxiv.org/abs/2304.12995. arXiv:2304.12995 [cs, eess].
  11. International Telecommunication Union. ITU-R BS.1770-4: Algorithms to measure audio programme loudness and true-peak audio level, 2020. URL https://www.itu.int/rec/R-REC-BS.1770.
  12. Efficient Training of Audio Transformers with Patchout, March 2022. URL http://arxiv.org/abs/2110.05069. arXiv:2110.05069 [cs, eess].
  13. AudioGen: Textually Guided Audio Generation. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=CYK7RfcOzQ4.
  14. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models, January 2023. arXiv:2301.12597.
  15. Acoustic Prompt Tuning: Empowering Large Language Models with Audition Capabilities, November 2023. URL http://arxiv.org/abs/2312.00249. arXiv:2312.00249 [eess].
  16. AudioSR: Versatile Audio Super-resolution at Scale, September 2023a. URL https://arxiv.org/abs/2309.07314v1.
  17. AudioLDM: Text-to-Audio Generation with Latent Diffusion Models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  21450–21474. PMLR, July 2023b. URL https://proceedings.mlr.press/v202/liu23f.html.
  18. LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents, November 2023c. URL http://arxiv.org/abs/2311.05437. arXiv:2311.05437 [cs].
  19. Separate Anything You Describe, August 2023d. URL http://arxiv.org/abs/2308.05037. arXiv:2308.05037 [cs, eess].
  20. WavJourney: Compositional Audio Creation with Large Language Models, July 2023e. URL http://arxiv.org/abs/2307.14335. arXiv:2307.14335 [cs, eess].
  21. WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research, March 2023. arXiv:2303.17395.
  22. OpenAI. GPT-4 Technical Report, March 2023.
  23. Communicative Agents for Software Development, December 2023. URL http://arxiv.org/abs/2307.07924. arXiv:2307.07924 [cs].
  24. Toolformer: Language Models Can Teach Themselves to Use Tools, February 2023. URL http://arxiv.org/abs/2302.04761. arXiv:2302.04761 [cs].
  25. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace, March 2023. URL http://arxiv.org/abs/2303.17580. arXiv:2303.17580 [cs].
  26. ViperGPT: Visual Inference via Python Execution for Reasoning, March 2023. URL http://arxiv.org/abs/2303.08128. arXiv:2303.08128 [cs].
  27. Audiobox: Unified Audio Generation with Natural Language Prompts.
  28. Llama 2: Open Foundation and Fine-Tuned Chat Models, July 2023. arXiv:2307.09288.
  29. Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers, January 2023a. URL http://arxiv.org/abs/2301.02111. arXiv:2301.02111 [cs, eess].
  30. AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models, April 2023b. URL http://arxiv.org/abs/2304.00830. arXiv:2304.00830 [cs, eess].
  31. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, January 2023a. URL http://arxiv.org/abs/2201.11903. arXiv:2201.11903 [cs].
  32. Larger language models do in-context learning differently, March 2023b. URL http://arxiv.org/abs/2303.03846. arXiv:2303.03846 [cs].
  33. Torchaudio: Building Blocks for Audio and Speech Processing. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  6982–6986, May 2022. doi: 10.1109/ICASSP43922.2022.9747236. URL https://ieeexplore.ieee.org/document/9747236?denied=. ISSN: 2379-190X.
  34. Explainability for Large Language Models: A Survey. ACM Transactions on Intelligent Systems and Technology, January 2024. ISSN 2157-6904. doi: 10.1145/3639372. URL https://dl.acm.org/doi/10.1145/3639372. Just Accepted.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Jinhua Liang (15 papers)
  2. Huan Zhang (171 papers)
  3. Haohe Liu (59 papers)
  4. Yin Cao (24 papers)
  5. Qiuqiang Kong (86 papers)
  6. Xubo Liu (66 papers)
  7. Wenwu Wang (148 papers)
  8. Mark D. Plumbley (114 papers)
  9. Huy Phan (75 papers)
  10. Emmanouil Benetos (89 papers)
Citations (3)