Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 154 tok/s

Gemini 2.5 Pro 43 tok/s Pro

GPT-5 Medium 23 tok/s Pro

GPT-5 High 32 tok/s Pro

GPT-4o 119 tok/s Pro

Kimi K2 175 tok/s Pro

GPT OSS 120B 362 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

Text-Speech Language Models with Improved Cross-Modal Transfer by Aligning Abstraction Levels (2503.06211v1)

Published 8 Mar 2025 in cs.CL, cs.AI, and eess.AS

Abstract: Text-Speech LLMs (TSLMs) -- LLMs trained to jointly process and generate text and speech -- aim to enable cross-modal knowledge transfer to overcome the scaling limitations of unimodal speech LMs. The predominant approach to TSLM training expands the vocabulary of a pre-trained text LM by appending new embeddings and linear projections for speech, followed by fine-tuning on speech data. We hypothesize that this method limits cross-modal transfer by neglecting feature compositionality, preventing text-learned functions from being fully leveraged at appropriate abstraction levels. To address this, we propose augmenting vocabulary expansion with modules that better align abstraction levels across layers. Our models, \textsc{SmolTolk}, rival or surpass state-of-the-art TSLMs trained with orders of magnitude more compute. Representation analyses and improved multimodal performance suggest our method enhances cross-modal transfer.