Speech-Copilot: Leveraging Large Language Models for Speech Processing via Task Decomposition, Modularization, and Program Generation (2407.09886v2)
Abstract: In this work, we introduce Speech-Copilot, a modular framework for instruction-oriented speech-processing tasks that minimizes human effort in toolset construction. Unlike end-to-end methods using large audio-LLMs, Speech-Copilot builds speech processing-specific toolsets by analyzing pre-collected task instructions and breaking tasks into manageable sub-tasks. It features a flexible agent based on LLMs that performs tasks through program generation. Our approach achieves state-of-the-art performance on the Dynamic-SUPERB benchmark, demonstrating its effectiveness across diverse speech-processing tasks. Key contributions include: 1) developing an innovative framework for speech processing-specific toolset construction, 2) establishing a high-performing agent based on LLMs, and 3) offering a new perspective on addressing challenging instruction-oriented speech-processing tasks. Without additional training processes required by end-to-end approaches, our method provides a flexible and extendable solution for a wide range of speech-processing applications.
- “Chain-of-thought prompting elicits reasoning in large language models,” Advances in neural information processing systems, vol. 35, pp. 24824–24837, 2022.
- “Emergent abilities of large language models,” TMLR, 2022.
- “Large language models are zero-shot reasoners,” Advances in neural information processing systems, vol. 35, pp. 22199–22213, 2022.
- “Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,” in International Conference on Machine Learning. PMLR, 2022, pp. 9118–9147.
- “Understanding the planning of llm agents: A survey,” 2024.
- “On the planning abilities of large language models-a critical investigation,” Advances in Neural Information Processing Systems, vol. 36, pp. 75993–76005, 2023.
- “Large language models can self-improve,” arXiv preprint arXiv:2210.11610, 2022.
- Liangming Pan et al., “Automatically correcting large language models: Surveying the landscape of diverse automated correction strategies,” Transactions of the Association for Computational Linguistics, vol. 12, pp. 484–506, 2024.
- “Selfcheck: Using llms to zero-shot check their own step-by-step reasoning,” arXiv preprint arXiv:2308.00436, 2023.
- “Ai-augmented predictions: Llm assistants improve human forecasting accuracy,” arXiv preprint arXiv:2402.07862, 2024.
- “Can large language model agents simulate human trust behaviors?,” 2024.
- “Prioritizing safeguarding over autonomy: Risks of llm agents for science,” 2024.
- Timo Schick et al., “Toolformer: language models can teach themselves to use tools. 2023,” arXiv preprint arXiv:2302.04761, 2023.
- “Vipergpt: Visual inference via python execution for reasoning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 11888–11898.
- “Audiogpt: Understanding and generating speech, music, sound, and talking head,” 2023.
- “Anytool: Self-reflective, hierarchical agents for large-scale api calls,” 2024.
- “Large language models as tool makers,” arXiv preprint arXiv:2305.17126, 2023.
- “Craft: Customizing llms by creating and retrieving from specialized toolsets,” 2024.
- “Creator: Tool creation for disentangling abstract and concrete reasoning of large language models,” in Findings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 6922–6939.
- “Dynamic-superb: Towards a dynamic, collaborative, and comprehensive instruction-tuning benchmark for speech,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 12136–12140.
- Shijue Huang et al., “Planning, creation, usage: Benchmarking llms for comprehensive tool utilization in real-world complex scenarios,” arXiv preprint arXiv:2401.17167, 2024.
- “Stabletoolbench: Towards stable large-scale benchmarking on tool learning of large language models,” arXiv preprint arXiv:2403.07714, 2024.
- “Api-bank: A comprehensive benchmark for tool-augmented llms,” in The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
- “Easytool: Enhancing llm-based agents with concise tool instruction,” arXiv preprint arXiv:2401.06201, 2024.
- “Rotbench: A multi-level benchmark for evaluating the robustness of large language models in tool learning,” arXiv preprint arXiv:2401.08326, 2024.
- “Learning to use tools via cooperative and interactive agents,” arXiv preprint arXiv:2403.03031, 2024.
- “Gorilla: Large language model connected with massive apis,” arXiv preprint arXiv:2305.15334, 2023.
- “Toolllm: Facilitating large language models to master 16000+ real-world apis,” arXiv preprint arXiv:2307.16789, 2023.
- “Toolformer: Language models can teach themselves to use tools,” Advances in Neural Information Processing Systems, vol. 36, 2024.
- “Can large language models be an alternative to human evaluations?,” arXiv preprint arXiv:2305.01937, 2023.
- “A closer look into automatic evaluation using large language models,” arXiv preprint arXiv:2310.05657, 2023.
- “Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models,” arXiv:2311.07919, 2023.
- “Salmonn: Towards generic hearing abilities for large language models,” arXiv:2310.13289, 2023.
- “Joint audio and speech understanding,” in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2023.
- Shujie Hu et al., “Wavllm: Towards robust and adaptive speech large language model,” arXiv preprint arXiv:2404.00656, 2024.
- “Desta: Enhancing speech language models through descriptive speech-text alignment,” arXiv preprint arXiv:2406.18871, 2024.
- Zhifeng Kong et al., “Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities,” arXiv:2402.01831, 2024.
- “Listen, think, and understand,” arXiv preprint arXiv:2305.10790, 2023.
- OpenAI, “Chatgpt: Optimizing language models for dialogue,” 2022, Accessed on October 10, 2023.
- “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning. PMLR, 2023, pp. 28492–28518.
- “Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization,” in Proc. INTERSPEECH 2023, 2023, pp. 396–400.
- “Investigating zero-shot generalizability on mandarin-english code-switched asr and speech-to-text translation of recent foundation models with self-supervision and weak supervision,” 2023.
- “Lyricwhiz: Robust multilingual zero-shot lyrics transcription by whispering to chatgpt,” arXiv preprint arXiv:2306.17103, 2023.
- “Do prompts really prompt? exploring the prompt understanding capability of whisper,” 2024.
- OpenAI, “Gpt-4 technical report,” 2023.
- “emotion2vec: Self-supervised pre-training for speech emotion representation,” arXiv preprint arXiv:2312.15185, 2023.
- “Brouhaha: multi-task training for voice activity detection, speech-to-noise ratio, and c50 room acoustics estimation,” in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023, pp. 1–7.
- “Commonaccent: Exploring large acoustic pretrained models for accent classification based on common voice,” Interspeech 2023, 2023.
- Christopher John Bayron, “Autochord: Automatic chord recognition library and chord visualization app,” .
- “Titanet: Neural model for speaker representation with 1d depth-wise separable convolutions and global context,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 8102–8106.
- “Nemo: a toolkit for building ai applications using neural modules,” arXiv preprint arXiv:1909.09577, 2019.
- “Powerset multi-class cross entropy loss for neural speaker diarization,” in Proc. INTERSPEECH 2023, 2023.
- “Phonologybench: Evaluating phonological skills of large language models,” arXiv preprint arXiv:2404.02456, 2024.
- “Understanding sounds, missing the questions: The challenge of object hallucination in large audio-language models,” arXiv preprint arXiv:2406.08402, 2024.
- Long Ouyang et al., “Training language models to follow instructions with human feedback,” Advances in neural information processing systems, vol. 35, pp. 27730–27744, 2022.
- “The zero resource speech challenge 2021: Spoken language modelling,” arXiv preprint arXiv:2104.14700, 2021.
- “Zero resource code-switched speech benchmark using speech utterance pairs for multiple spoken languages,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 10006–10010.