Fully exploiting LLM capabilities for ASR beyond the current prompt-concatenation paradigm

Determine effective methodologies to fully exploit large language model capabilities for automatic speech recognition beyond the prevailing paradigm that prepends speech encoder outputs as soft prompts to decoder-only large language models and trains solely on transcription text.

Background

Most current LLM-based ASR systems feed continuous speech encoder outputs as a prefix prompt to a decoder-only LLM and train the system only on transcription text. Because ASR is largely a content-preserving mapping from speech to text, this setup underutilizes the LLM’s generative and reasoning abilities compared to tasks involving semantic transformation.

Although many studies optimize the speech encoder within this paradigm, the question of how to go beyond this setup to better leverage LLM knowledge and reasoning in ASR remains unresolved, motivating the need for new architectures or training strategies.

References

While many studies focus on optimizing speech encoder training to improve LLM-based ASR performance , how to fully exploit LLM capabilities beyond the current paradigm remains an open challenge.

Speech LLMs are Contextual Reasoning Transcribers  (2604.00610 - Deng et al., 1 Apr 2026) in Introduction (Section 1)