Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment (2406.18871v1)

Published 27 Jun 2024 in eess.AS and cs.CL

Abstract: Recent speech LLMs (SLMs) typically incorporate pre-trained speech models to extend the capabilities from LLMs. In this paper, we propose a Descriptive Speech-Text Alignment approach that leverages speech captioning to bridge the gap between speech and text modalities, enabling SLMs to interpret and generate comprehensive natural language descriptions, thereby facilitating the capability to understand both linguistic and non-linguistic features in speech. Enhanced with the proposed approach, our model demonstrates superior performance on the Dynamic-SUPERB benchmark, particularly in generalizing to unseen tasks. Moreover, we discover that the aligned model exhibits a zero-shot instruction-following capability without explicit speech instruction tuning. These findings highlight the potential to reshape instruction-following SLMs by incorporating rich, descriptive speech captions.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Ke-Han Lu (16 papers)
  2. Zhehuai Chen (39 papers)
  3. Szu-Wei Fu (46 papers)
  4. He Huang (97 papers)
  5. Boris Ginsburg (111 papers)
  6. Yu-Chiang Frank Wang (88 papers)
  7. Hung-yi Lee (327 papers)
Citations (5)
X Twitter Logo Streamline Icon: https://streamlinehq.com