Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

NTPP: Generative Speech Language Modeling for Dual-Channel Spoken Dialogue via Next-Token-Pair Prediction (2506.00975v4)

Published 1 Jun 2025 in cs.CL, cs.AI, cs.SD, and eess.AS

Abstract: Inspired by the impressive capabilities of GPT-4o, there is growing interest in enabling speech LLMs (SLMs) to engage in natural, fluid spoken interactions with humans. Recent advancements have led to the development of several SLMs that demonstrate promising results in this area. However, current approaches have yet to fully exploit dual-channel speech data, which inherently captures the structure and dynamics of human conversation. In this work, we systematically explore the use of dual-channel speech data in the context of modern LLMs, and introduce a novel generative modeling paradigm, Next-Token-Pair Prediction (NTPP), to enable speaker-independent dual-channel spoken dialogue learning using decoder-only architectures for the first time. We evaluate our approach on standard benchmarks, and empirical results show that our proposed method, NTPP, significantly improves the conversational abilities of SLMs in terms of turn-taking prediction, response coherence, and naturalness. Moreover, compared to existing methods, NTPP achieves substantially lower inference latency, highlighting its practical efficiency for real-time applications.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Qichao Wang (11 papers)
  2. Ziqiao Meng (12 papers)
  3. Wenqian Cui (7 papers)
  4. Yifei Zhang (167 papers)
  5. Pengcheng Wu (25 papers)
  6. Bingzhe Wu (58 papers)
  7. Irwin King (170 papers)
  8. Liang Chen (360 papers)
  9. Peilin Zhao (127 papers)