Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Internalizing ASR with Implicit Chain of Thought for Efficient Speech-to-Speech Conversational LLM (2409.17353v3)

Published 25 Sep 2024 in cs.CL

Abstract: Current speech-based LLMs are predominantly trained on extensive ASR and TTS datasets, excelling in tasks related to these domains. However, their ability to handle direct speech-to-speech conversations remains notably constrained. These models often rely on an ASR-to-TTS chain-of-thought pipeline, converting speech into text for processing before generating audio responses, which introduces latency and loses audio features. We propose a method that implicitly internalizes ASR chain of thought into a speech LLM, enhancing its native speech understanding capabilities. Our approach reduces latency and improves the model's native understanding of speech, paving the way for more efficient and natural real-time audio interactions. We also release a large-scale synthetic conversational dataset to facilitate further research.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Robin Shing-Hei Yuen (1 paper)
  2. Timothy Tin-Long Tse (1 paper)
  3. Jian Zhu (59 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com