Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
121 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Speech-Copilot: Leveraging Large Language Models for Speech Processing via Task Decomposition, Modularization, and Program Generation (2407.09886v2)

Published 13 Jul 2024 in eess.AS, cs.CL, and cs.SD

Abstract: In this work, we introduce Speech-Copilot, a modular framework for instruction-oriented speech-processing tasks that minimizes human effort in toolset construction. Unlike end-to-end methods using large audio-LLMs, Speech-Copilot builds speech processing-specific toolsets by analyzing pre-collected task instructions and breaking tasks into manageable sub-tasks. It features a flexible agent based on LLMs that performs tasks through program generation. Our approach achieves state-of-the-art performance on the Dynamic-SUPERB benchmark, demonstrating its effectiveness across diverse speech-processing tasks. Key contributions include: 1) developing an innovative framework for speech processing-specific toolset construction, 2) establishing a high-performing agent based on LLMs, and 3) offering a new perspective on addressing challenging instruction-oriented speech-processing tasks. Without additional training processes required by end-to-end approaches, our method provides a flexible and extendable solution for a wide range of speech-processing applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. “Chain-of-thought prompting elicits reasoning in large language models,” Advances in neural information processing systems, vol. 35, pp. 24824–24837, 2022.
  2. “Emergent abilities of large language models,” TMLR, 2022.
  3. “Large language models are zero-shot reasoners,” Advances in neural information processing systems, vol. 35, pp. 22199–22213, 2022.
  4. “Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,” in International Conference on Machine Learning. PMLR, 2022, pp. 9118–9147.
  5. “Understanding the planning of llm agents: A survey,” 2024.
  6. “On the planning abilities of large language models-a critical investigation,” Advances in Neural Information Processing Systems, vol. 36, pp. 75993–76005, 2023.
  7. “Large language models can self-improve,” arXiv preprint arXiv:2210.11610, 2022.
  8. Liangming Pan et al., “Automatically correcting large language models: Surveying the landscape of diverse automated correction strategies,” Transactions of the Association for Computational Linguistics, vol. 12, pp. 484–506, 2024.
  9. “Selfcheck: Using llms to zero-shot check their own step-by-step reasoning,” arXiv preprint arXiv:2308.00436, 2023.
  10. “Ai-augmented predictions: Llm assistants improve human forecasting accuracy,” arXiv preprint arXiv:2402.07862, 2024.
  11. “Can large language model agents simulate human trust behaviors?,” 2024.
  12. “Prioritizing safeguarding over autonomy: Risks of llm agents for science,” 2024.
  13. Timo Schick et al., “Toolformer: language models can teach themselves to use tools. 2023,” arXiv preprint arXiv:2302.04761, 2023.
  14. “Vipergpt: Visual inference via python execution for reasoning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 11888–11898.
  15. “Audiogpt: Understanding and generating speech, music, sound, and talking head,” 2023.
  16. “Anytool: Self-reflective, hierarchical agents for large-scale api calls,” 2024.
  17. “Large language models as tool makers,” arXiv preprint arXiv:2305.17126, 2023.
  18. “Craft: Customizing llms by creating and retrieving from specialized toolsets,” 2024.
  19. “Creator: Tool creation for disentangling abstract and concrete reasoning of large language models,” in Findings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 6922–6939.
  20. “Dynamic-superb: Towards a dynamic, collaborative, and comprehensive instruction-tuning benchmark for speech,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 12136–12140.
  21. Shijue Huang et al., “Planning, creation, usage: Benchmarking llms for comprehensive tool utilization in real-world complex scenarios,” arXiv preprint arXiv:2401.17167, 2024.
  22. “Stabletoolbench: Towards stable large-scale benchmarking on tool learning of large language models,” arXiv preprint arXiv:2403.07714, 2024.
  23. “Api-bank: A comprehensive benchmark for tool-augmented llms,” in The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
  24. “Easytool: Enhancing llm-based agents with concise tool instruction,” arXiv preprint arXiv:2401.06201, 2024.
  25. “Rotbench: A multi-level benchmark for evaluating the robustness of large language models in tool learning,” arXiv preprint arXiv:2401.08326, 2024.
  26. “Learning to use tools via cooperative and interactive agents,” arXiv preprint arXiv:2403.03031, 2024.
  27. “Gorilla: Large language model connected with massive apis,” arXiv preprint arXiv:2305.15334, 2023.
  28. “Toolllm: Facilitating large language models to master 16000+ real-world apis,” arXiv preprint arXiv:2307.16789, 2023.
  29. “Toolformer: Language models can teach themselves to use tools,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  30. “Can large language models be an alternative to human evaluations?,” arXiv preprint arXiv:2305.01937, 2023.
  31. “A closer look into automatic evaluation using large language models,” arXiv preprint arXiv:2310.05657, 2023.
  32. “Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models,” arXiv:2311.07919, 2023.
  33. “Salmonn: Towards generic hearing abilities for large language models,” arXiv:2310.13289, 2023.
  34. “Joint audio and speech understanding,” in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2023.
  35. Shujie Hu et al., “Wavllm: Towards robust and adaptive speech large language model,” arXiv preprint arXiv:2404.00656, 2024.
  36. “Desta: Enhancing speech language models through descriptive speech-text alignment,” arXiv preprint arXiv:2406.18871, 2024.
  37. Zhifeng Kong et al., “Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities,” arXiv:2402.01831, 2024.
  38. “Listen, think, and understand,” arXiv preprint arXiv:2305.10790, 2023.
  39. OpenAI, “Chatgpt: Optimizing language models for dialogue,” 2022, Accessed on October 10, 2023.
  40. “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning. PMLR, 2023, pp. 28492–28518.
  41. “Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization,” in Proc. INTERSPEECH 2023, 2023, pp. 396–400.
  42. “Investigating zero-shot generalizability on mandarin-english code-switched asr and speech-to-text translation of recent foundation models with self-supervision and weak supervision,” 2023.
  43. “Lyricwhiz: Robust multilingual zero-shot lyrics transcription by whispering to chatgpt,” arXiv preprint arXiv:2306.17103, 2023.
  44. “Do prompts really prompt? exploring the prompt understanding capability of whisper,” 2024.
  45. OpenAI, “Gpt-4 technical report,” 2023.
  46. “emotion2vec: Self-supervised pre-training for speech emotion representation,” arXiv preprint arXiv:2312.15185, 2023.
  47. “Brouhaha: multi-task training for voice activity detection, speech-to-noise ratio, and c50 room acoustics estimation,” in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023, pp. 1–7.
  48. “Commonaccent: Exploring large acoustic pretrained models for accent classification based on common voice,” Interspeech 2023, 2023.
  49. Christopher John Bayron, “Autochord: Automatic chord recognition library and chord visualization app,” .
  50. “Titanet: Neural model for speaker representation with 1d depth-wise separable convolutions and global context,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 8102–8106.
  51. “Nemo: a toolkit for building ai applications using neural modules,” arXiv preprint arXiv:1909.09577, 2019.
  52. “Powerset multi-class cross entropy loss for neural speaker diarization,” in Proc. INTERSPEECH 2023, 2023.
  53. “Phonologybench: Evaluating phonological skills of large language models,” arXiv preprint arXiv:2404.02456, 2024.
  54. “Understanding sounds, missing the questions: The challenge of object hallucination in large audio-language models,” arXiv preprint arXiv:2406.08402, 2024.
  55. Long Ouyang et al., “Training language models to follow instructions with human feedback,” Advances in neural information processing systems, vol. 35, pp. 27730–27744, 2022.
  56. “The zero resource speech challenge 2021: Spoken language modelling,” arXiv preprint arXiv:2104.14700, 2021.
  57. “Zero resource code-switched speech benchmark using speech utterance pairs for multiple spoken languages,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 10006–10010.
Citations (4)

Summary

We haven't generated a summary for this paper yet.