Improving Spoken Language Modeling with Phoneme Classification: A Simple Fine-tuning Approach (2410.00025v2)
Abstract: Recent progress in Spoken LLMing has shown that learning language directly from speech is feasible. Generating speech through a pipeline that operates at the text level typically loses nuances, intonations, and non-verbal vocalizations. Modeling directly from speech opens up the path to more natural and expressive systems. On the other hand, speech-only systems require up to three orders of magnitude more data to catch up to their text-based counterparts in terms of their semantic abilities. We show that fine-tuning speech representation models on phoneme classification leads to more context-invariant representations, and LLMs trained on these units achieve comparable lexical comprehension to ones trained on hundred times more data.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.