Fully exploiting LLM capabilities for ASR beyond the current prompt-concatenation paradigm
Determine effective methodologies to fully exploit large language model capabilities for automatic speech recognition beyond the prevailing paradigm that prepends speech encoder outputs as soft prompts to decoder-only large language models and trains solely on transcription text.
References
While many studies focus on optimizing speech encoder training to improve LLM-based ASR performance , how to fully exploit LLM capabilities beyond the current paradigm remains an open challenge.
— Speech LLMs are Contextual Reasoning Transcribers
(2604.00610 - Deng et al., 1 Apr 2026) in Introduction (Section 1)