Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Why Do Speech Language Models Fail to Generate Semantically Coherent Outputs? A Modality Evolving Perspective (2412.17048v1)

Published 22 Dec 2024 in eess.AS, cs.CL, and cs.SD

Abstract: Although text-based LLMs exhibit human-level writing ability and remarkable intelligence, speech LLMs (SLMs) still struggle to generate semantically coherent outputs. There are several potential reasons for this performance degradation: (A) speech tokens mainly provide phonetic information rather than semantic information, (B) the length of speech sequences is much longer than that of text sequences, and (C) paralinguistic information, such as prosody, introduces additional complexity and variability. In this paper, we explore the influence of three key factors separately by transiting the modality from text to speech in an evolving manner. Our findings reveal that the impact of the three factors varies. Factor A has a relatively minor impact, factor B influences syntactical and semantic modeling more obviously, and factor C exerts the most significant impact, particularly in the basic lexical modeling. Based on these findings, we provide insights into the unique challenges of training SLMs and highlight pathways to develop more effective end-to-end SLMs.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Hankun Wang (12 papers)
  2. Haoran Wang (142 papers)
  3. Yiwei Guo (30 papers)
  4. Zhihan Li (18 papers)
  5. Chenpeng Du (28 papers)
  6. Xie Chen (166 papers)
  7. Kai Yu (202 papers)