Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling (2504.07053v2)

Published 9 Apr 2025 in cs.CL, cs.SD, and eess.AS

Abstract: Recent efforts target spoken LLMs (SLMs) that not only listen but also speak for more natural human-LLM interaction. Joint speech-text modeling is a promising direction to achieve this. However, the effectiveness of recent speech tokens for joint modeling remains underexplored. To address this, we introduce Text-Aligned Speech Tokenization and Embedding (TASTE), a method that directly addresses the modality gap by aligning speech token with the corresponding text transcription during the tokenization stage. We propose a method that can achieve this through a attention-based aggregation mechanism and with speech reconstruction as the training objective. We conduct extensive experiments and show that TASTE can preserve essential paralinguistic information while dramatically reducing the token sequence length. With TASTE, we perform straightforward joint spoken LLMing by using Low-Rank Adaptation on the pre-trained text LLM. Experimental results show that TASTE-based SLMs perform comparable to previous work on SALMON and StoryCloze; while significantly outperform other pre-trained SLMs on speech continuation across subjective and objective evaluations. To our knowledge, TASTE is the first end-to-end approach that utilizes a reconstruction objective to automatically learn a text-aligned speech tokenization and embedding suitable for spoken LLMing. Our demo, code, and model are available at https://mtkresearch.github.io/TASTE-SpokenLM.github.io.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Liang-Hsuan Tseng (9 papers)
  2. Yi-Chang Chen (14 papers)
  3. Kuan-Yi Lee (3 papers)
  4. Da-shan Shiu (27 papers)
  5. Hung-yi Lee (327 papers)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com