Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Exploring the Effect of Segmentation and Vocabulary Size on Speech Tokenization for Speech Language Models (2505.17446v2)

Published 23 May 2025 in cs.CL, cs.SD, and eess.AS

Abstract: The purpose of speech tokenization is to transform a speech signal into a sequence of discrete representations, serving as the foundation for speech LLMs (SLMs). While speech tokenization has many options, their effect on the performance of SLMs remains unclear. This paper investigates two key aspects of speech tokenization: the segmentation width and the cluster size of discrete units. First, we segment speech signals into fixed/variable widths and pooled representations. We then train K-means models in multiple cluster sizes. Through the evaluation on zero-shot spoken language understanding benchmarks, we find the positive effect of moderately coarse segmentation and bigger cluster size. Notably, among the best-performing models, the most efficient one achieves a 50% reduction in training data and a 70% decrease in training runtime. Our analysis highlights the importance of combining multiple tokens to enhance fine-grained spoken language understanding.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Shunsuke Kando (4 papers)
  2. Yusuke Miyao (35 papers)
  3. Shinnosuke Takamichi (70 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com