Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data (2409.20007v1)

Published 30 Sep 2024 in eess.AS, cs.CL, and cs.SD

Abstract: Recent end-to-end speech LLMs (SLMs) have expanded upon the capabilities of LLMs by incorporating pre-trained speech models. However, these SLMs often undergo extensive speech instruction-tuning to bridge the gap between speech and text modalities. This requires significant annotation efforts and risks catastrophic forgetting of the original language capabilities. In this work, we present a simple yet effective automatic process for creating speech-text pair data that carefully injects speech paralinguistic understanding abilities into SLMs while preserving the inherent language capabilities of the text-based LLM. Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data, achieving impressive performance on Dynamic-SUPERB and AIR-Bench-Chat benchmarks. Furthermore, our model exhibits the ability to follow complex instructions derived from LLMs, such as specific output formatting and chain-of-thought reasoning. Our approach not only enhances the versatility and effectiveness of SLMs but also reduces reliance on extensive annotated datasets, paving the way for more efficient and capable speech understanding systems.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Ke-Han Lu (16 papers)
  2. Zhehuai Chen (39 papers)
  3. Szu-Wei Fu (45 papers)
  4. Chao-Han Huck Yang (89 papers)
  5. Jagadeesh Balam (39 papers)
  6. Boris Ginsburg (111 papers)
  7. Yu-Chiang Frank Wang (88 papers)
  8. Hung-yi Lee (325 papers)
Citations (1)