In-Context Learning Boosts Speech Recognition via Human-like Adaptation to Speakers and Language Varieties (2505.14887v1)

Published 20 May 2025 in cs.CL, eess.AS, and cs.SD

Abstract: Human listeners readily adjust to unfamiliar speakers and language varieties through exposure, but do these adaptation benefits extend to state-of-the-art spoken LLMs? We introduce a scalable framework that allows for in-context learning (ICL) in Phi-4 Multimodal using interleaved task prompts and audio-text pairs, and find that as few as 12 example utterances (~50 seconds) at inference time reduce word error rates by a relative 19.7% (1.2 pp.) on average across diverse English corpora. These improvements are most pronounced in low-resource varieties, when the context and target speaker match, and when more examples are provided--though scaling our procedure yields diminishing marginal returns to context length. Overall, we find that our novel ICL adaptation scheme (1) reveals a similar performance profile to human listeners, and (2) demonstrates consistent improvements to automatic speech recognition (ASR) robustness across diverse speakers and language backgrounds. While adaptation succeeds broadly, significant gaps remain for certain varieties, revealing where current models still fall short of human flexibility. We release our prompts and code on GitHub.

Summary

In-Context Learning Boosts Speech Recognition via Human-like Adaptation to Speakers and Language Varieties

The paper "In-Context Learning Boosts Speech Recognition via Human-like Adaptation to Speakers and Language Varieties" investigates speech recognition improvement in state-of-the-art spoken LLMs (SLMs) through in-context learning (ICL). Using the Phi-4 Multimodal (Phi-4-MM) model, the authors demonstrate that integrating human-like adaptation benefits enhances automatic speech recognition (ASR) robustness across diverse speakers and language backgrounds, despite the persistent gaps for certain varieties.

Framework and Methodology

The proposed framework allows the Phi-4-MM model to achieve rapid adaptation to new speakers and language varieties through ICL by providing audio-text pairs during inference. Significantly, just 12 exemplar utterances (approximately 50 seconds) can lead to a reduction in word error rates (WERs) by an average of 19.7% under varied English corpora. This approach nullifies traditional methods that require computationally intensive data quantities, such as continued pre-training or supervised fine-tuning, suggesting a scalable, lower-cost alternative for ASR enhancements.

The experimental setup leverages four English speech corpora, including L2-ARCTIC, CMU-Arctic, the Hispanic-English Corpus (HEC), and the Speech Accent Archive (SAA). These datasets offer extensive coverage of various accents, speaker demographics, and speech contexts, facilitating rigorous testing of the adaptation mechanism. The context selection involves sampling examples from either the same speaker or a different speaker belonging to the same variety, providing insights into dynamic adaptation capabilities prompted by task-centric speaker conditions.

Numerical Results and Implications

The paper presents robust numerical evidence for improved ASR through ICL. Across multiple corpora, performance gains are most noticeable in low-resource varieties and underrepresented speakers. For instance, the average relative WER reduction was 27.2% for Spanish heritage speakers within the HEC dataset. Yet, adaptation efficacy stabilized when extending beyond ten-shot examples, implying diminishing returns with increased context length.

Several critical observations underscore the model's sensitivity to prompt design, shown to improve results marginally through systematic task framing. The marking of clips as 'non-native' in zero-shot conditions and the inclusion of "Transcription:" markers in few-shot adaptations slightly enhanced performance, which parallels behavioral patterns in human speech perception studies, suggesting that primordial adaptation mechanisms are accessible in SLMs.

Future Directions and Limitations

The investigation provides a promising trajectory towards equitable speech technology, by effectively utilizing low-resource adaptation methods that bridge existing performance disparities. However, extrapolation to spontaneous speech, non-English languages, and real-time applications remains unexplored. Importantly, the reliance on accurately transcribed context limits practical deployment where labeled data is scarce.

The study concludes that while current models like Phi-4-MM exhibit emergent ASR robustness unlocked through engineering at inference, extending these findings to other SLM frameworks and task types would further validate the proposed methodology and its implications for wide-scale deployment.