In-Context Learning Boosts Speech Recognition via Human-like Adaptation to Speakers and Language Varieties
The paper "In-Context Learning Boosts Speech Recognition via Human-like Adaptation to Speakers and Language Varieties" investigates speech recognition improvement in state-of-the-art spoken LLMs (SLMs) through in-context learning (ICL). Using the Phi-4 Multimodal (Phi-4-MM) model, the authors demonstrate that integrating human-like adaptation benefits enhances automatic speech recognition (ASR) robustness across diverse speakers and language backgrounds, despite the persistent gaps for certain varieties.
Framework and Methodology
The proposed framework allows the Phi-4-MM model to achieve rapid adaptation to new speakers and language varieties through ICL by providing audio-text pairs during inference. Significantly, just 12 exemplar utterances (approximately 50 seconds) can lead to a reduction in word error rates (WERs) by an average of 19.7% under varied English corpora. This approach nullifies traditional methods that require computationally intensive data quantities, such as continued pre-training or supervised fine-tuning, suggesting a scalable, lower-cost alternative for ASR enhancements.
The experimental setup leverages four English speech corpora, including L2-ARCTIC, CMU-Arctic, the Hispanic-English Corpus (HEC), and the Speech Accent Archive (SAA). These datasets offer extensive coverage of various accents, speaker demographics, and speech contexts, facilitating rigorous testing of the adaptation mechanism. The context selection involves sampling examples from either the same speaker or a different speaker belonging to the same variety, providing insights into dynamic adaptation capabilities prompted by task-centric speaker conditions.
Numerical Results and Implications
The paper presents robust numerical evidence for improved ASR through ICL. Across multiple corpora, performance gains are most noticeable in low-resource varieties and underrepresented speakers. For instance, the average relative WER reduction was 27.2% for Spanish heritage speakers within the HEC dataset. Yet, adaptation efficacy stabilized when extending beyond ten-shot examples, implying diminishing returns with increased context length.
Several critical observations underscore the model's sensitivity to prompt design, shown to improve results marginally through systematic task framing. The marking of clips as 'non-native' in zero-shot conditions and the inclusion of "Transcription:" markers in few-shot adaptations slightly enhanced performance, which parallels behavioral patterns in human speech perception studies, suggesting that primordial adaptation mechanisms are accessible in SLMs.
Future Directions and Limitations
The investigation provides a promising trajectory towards equitable speech technology, by effectively utilizing low-resource adaptation methods that bridge existing performance disparities. However, extrapolation to spontaneous speech, non-English languages, and real-time applications remains unexplored. Importantly, the reliance on accurately transcribed context limits practical deployment where labeled data is scarce.
The study concludes that while current models like Phi-4-MM exhibit emergent ASR robustness unlocked through engineering at inference, extending these findings to other SLM frameworks and task types would further validate the proposed methodology and its implications for wide-scale deployment.