Spoken Language Modeling with Duration-Penalized Self-Supervised Units (2505.23494v1)

Published 29 May 2025 in cs.CL and eess.AS

Abstract: Spoken LLMs (SLMs) operate on acoustic units obtained by discretizing self-supervised speech representations. Although the characteristics of these units directly affect performance, the interaction between codebook size and unit coarseness (i.e., duration) remains unexplored. We investigate SLM performance as we vary codebook size and unit coarseness using the simple duration-penalized dynamic programming (DPDP) method. New analyses are performed across different linguistic levels. At the phone and word levels, coarseness provides little benefit, as long as the codebook size is chosen appropriately. However, when producing whole sentences in a resynthesis task, SLMs perform better with coarser units. In lexical and syntactic LLMing tasks, coarser units also give higher accuracies at lower bitrates. We therefore show that coarser units aren't always better, but that DPDP is a simple and efficient way to obtain coarser units for the tasks where they are beneficial.

PDF Abstract

Summarize Bookmark Chat (Pro)

Authors (2)

Nicol Visser (1 paper)
Herman Kamper (80 papers)

Tweets

https://twitter.com/ArxivSound/status/1928302712837349851

Spoken Language Modeling with Duration-Penalized Self-Supervised Units (2505.23494v1)

Related Papers

Tweets