WavCraft: Audio Editing and Generation with Large Language Models (2403.09527v3)
Abstract: We introduce WavCraft, a collective system that leverages LLMs to connect diverse task-specific models for audio content creation and editing. Specifically, WavCraft describes the content of raw audio materials in natural language and prompts the LLM conditioned on audio descriptions and user requests. WavCraft leverages the in-context learning ability of the LLM to decomposes users' instructions into several tasks and tackle each task collaboratively with the particular module. Through task decomposition along with a set of task-specific models, WavCraft follows the input instruction to create or edit audio content with more details and rationales, facilitating user control. In addition, WavCraft is able to cooperate with users via dialogue interaction and even produce the audio content without explicit user commands. Experiments demonstrate that WavCraft yields a better performance than existing methods, especially when adjusting the local regions of audio clips. Moreover, WavCraft can follow complex instructions to edit and create audio content on the top of input recordings, facilitating audio producers in a broader range of applications. Our implementation and demos are available at this https://github.com/JinhuaLiang/WavCraft.
- MusicLM: Generating Music From Text, January 2023. URL http://arxiv.org/abs/2301.11325. arXiv:2301.11325 [cs, eess].
- Flamingo: A Visual Language Model for Few-Shot Learning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022.
- AudioLM: a Language Modeling Approach to Audio Generation, September 2022. URL http://arxiv.org/abs/2209.03143. arXiv:2209.03143 [cs, eess].
- Simple and Controllable Music Generation, November 2023. URL http://arxiv.org/abs/2306.05284. arXiv:2306.05284 [cs, eess].
- Pengi: An Audio Language Model for Audio Tasks, May 2023. arXiv:2305.11834.
- Recursive Visual Programming, December 2023. URL http://arxiv.org/abs/2312.02249. arXiv:2312.02249 [cs].
- Audio Set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780, New Orleans, LA, March 2017. IEEE. ISBN 978-1-5090-4117-6. doi: 10.1109/ICASSP.2017.7952261.
- Listen, Think, and Understand, May 2023. arXiv:2305.10790.
- Visual Programming: Compositional Visual Reasoning Without Training. pp. 14953–14962, 2023. URL https://openaccess.thecvf.com/content/CVPR2023/html/Gupta_Visual_Programming_Compositional_Visual_Reasoning_Without_Training_CVPR_2023_paper.html.
- AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head, April 2023. URL http://arxiv.org/abs/2304.12995. arXiv:2304.12995 [cs, eess].
- International Telecommunication Union. ITU-R BS.1770-4: Algorithms to measure audio programme loudness and true-peak audio level, 2020. URL https://www.itu.int/rec/R-REC-BS.1770.
- Efficient Training of Audio Transformers with Patchout, March 2022. URL http://arxiv.org/abs/2110.05069. arXiv:2110.05069 [cs, eess].
- AudioGen: Textually Guided Audio Generation. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=CYK7RfcOzQ4.
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models, January 2023. arXiv:2301.12597.
- Acoustic Prompt Tuning: Empowering Large Language Models with Audition Capabilities, November 2023. URL http://arxiv.org/abs/2312.00249. arXiv:2312.00249 [eess].
- AudioSR: Versatile Audio Super-resolution at Scale, September 2023a. URL https://arxiv.org/abs/2309.07314v1.
- AudioLDM: Text-to-Audio Generation with Latent Diffusion Models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 21450–21474. PMLR, July 2023b. URL https://proceedings.mlr.press/v202/liu23f.html.
- LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents, November 2023c. URL http://arxiv.org/abs/2311.05437. arXiv:2311.05437 [cs].
- Separate Anything You Describe, August 2023d. URL http://arxiv.org/abs/2308.05037. arXiv:2308.05037 [cs, eess].
- WavJourney: Compositional Audio Creation with Large Language Models, July 2023e. URL http://arxiv.org/abs/2307.14335. arXiv:2307.14335 [cs, eess].
- WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research, March 2023. arXiv:2303.17395.
- OpenAI. GPT-4 Technical Report, March 2023.
- Communicative Agents for Software Development, December 2023. URL http://arxiv.org/abs/2307.07924. arXiv:2307.07924 [cs].
- Toolformer: Language Models Can Teach Themselves to Use Tools, February 2023. URL http://arxiv.org/abs/2302.04761. arXiv:2302.04761 [cs].
- HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace, March 2023. URL http://arxiv.org/abs/2303.17580. arXiv:2303.17580 [cs].
- ViperGPT: Visual Inference via Python Execution for Reasoning, March 2023. URL http://arxiv.org/abs/2303.08128. arXiv:2303.08128 [cs].
- Audiobox: Unified Audio Generation with Natural Language Prompts.
- Llama 2: Open Foundation and Fine-Tuned Chat Models, July 2023. arXiv:2307.09288.
- Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers, January 2023a. URL http://arxiv.org/abs/2301.02111. arXiv:2301.02111 [cs, eess].
- AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models, April 2023b. URL http://arxiv.org/abs/2304.00830. arXiv:2304.00830 [cs, eess].
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, January 2023a. URL http://arxiv.org/abs/2201.11903. arXiv:2201.11903 [cs].
- Larger language models do in-context learning differently, March 2023b. URL http://arxiv.org/abs/2303.03846. arXiv:2303.03846 [cs].
- Torchaudio: Building Blocks for Audio and Speech Processing. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6982–6986, May 2022. doi: 10.1109/ICASSP43922.2022.9747236. URL https://ieeexplore.ieee.org/document/9747236?denied=. ISSN: 2379-190X.
- Explainability for Large Language Models: A Survey. ACM Transactions on Intelligent Systems and Technology, January 2024. ISSN 2157-6904. doi: 10.1145/3639372. URL https://dl.acm.org/doi/10.1145/3639372. Just Accepted.
- Jinhua Liang (15 papers)
- Huan Zhang (171 papers)
- Haohe Liu (59 papers)
- Yin Cao (24 papers)
- Qiuqiang Kong (86 papers)
- Xubo Liu (66 papers)
- Wenwu Wang (148 papers)
- Mark D. Plumbley (114 papers)
- Huy Phan (75 papers)
- Emmanouil Benetos (89 papers)