Speech In-Context Learning (SICL)
- Speech In-Context Learning (SICL) is a test-time adaptation paradigm that uses a small set of labeled audio examples to adjust model predictions without gradient updates.
- It relies on prompt construction that integrates task instructions with interleaved audio and transcript cues to achieve robust adaptation across different domains and speakers.
- Advanced example selection methods like Semantic KNN, acoustic re-ranking, and Bayesian scoring significantly reduce word error rates in various ASR scenarios.
Speech In-Context Learning (SICL) is a test-time adaptation paradigm for speech models in which a small sequence of labeled audio examples (audio–text or audio–label pairs) is presented as a prompt. Without any gradient update, the model adapts its predictions according to these in-context exemplars, enabling robust on-the-fly adaptation to new speakers, domains, or previously unseen tasks. SICL operationalizes a modality-general extension of in-context learning, originally demonstrated in text LLMs (e.g., GPT-3), to the speech and multimodal domains. The defining characteristic of SICL is the model's ability to integrate both acoustic and label (typically, transcript) evidence from a prompt—interleaved with instructions, demonstration examples, and the target input—entirely at inference, thus mimicking aspects of rapid human perceptual adaptation.
1. Formal Definition and Mathematical Framework
In automatic speech recognition (ASR) SICL, the model is conditioned on a set of in-context examples , where is an audio waveform and is its corresponding transcript. Given a new test audio and the context , the model estimates the conditional probability:
where is the full prompt sequence consisting of explicit task instructions, interleaved audio–text shots, and the final audio to transcribe. The model typically factorizes this probability autoregressively over the text tokens of after encoding all elements of , including any special tokens (e.g., "<|user|>", "<|assistant|>", "<|audio_N|>", etc.) that mark roles and modalities. This paradigm shifts the output distribution from the standard zero-shot to a context-adapted one .
Scaling laws for SICL indicate that the relative word error rate improvement increases rapidly from to but exhibits diminishing returns as more in-context samples are provided, approximated by , with estimated empirically (Roll et al., 20 May 2025).
2. Prompt Construction and Inference Procedure
SICL constructs a single, contiguous input sequence combining:
- Initial instructions (e.g., “transcribe audio from a non-native speaker”), modulated by the target domain.
- For each in-context demonstration, one block:
<|user|><|audio_i|><|end|><|assistant|>Transcription: t_i<|end|>. - The target test audio instance, similarly wrapped with prompt text and delimiters.
The entire context (up to the model’s maximum window, e.g., 128k tokens in Phi-4-MM) is fed into a multimodal decoder, which autoregressively generates the transcription. Key prompt parameters include the “speaker setting” (same speaker vs. same dialect/variety, for example) and explicit markers (such as “Transcription:” or task-specific role directives) (Roll et al., 20 May 2025).
Prompt formatting significantly affects SICL performance; the addition of explicit output markers like “Transcription:” stabilizes early-shot adaptation and improves WER, particularly in low-resource domains. Instructions explicitly mentioning “non-native English speaker” further produce small but reliable WER reductions (0.1–0.3 percentage points) (Roll et al., 20 May 2025).
3. Example Selection Strategies
The effectiveness of SICL critically depends on how in-context examples are selected. Random sampling from a large candidate pool provides weak adaptation signals, especially for out-of-domain or rare settings. Multiple data-driven retrieval strategies have been developed:
- Semantic KNN (TICL): Retrieve examples whose transcripts, embedded by a frozen text encoder, are nearest in -normalized embedding space to a pseudo-transcript of the test utterance (obtained via a strong ASR pseudo-labeler). This method achieves up to an 84.7% relative WER reduction compared to zero-shot on accented English with state-of-the-art LMMs (Zheng et al., 16 Sep 2025).
- Acoustic re-ranking (TICL+): After semantic retrieval, rerank the top- candidates by acoustic similarity using fixed-length audio embeddings. For children's speech, this hybrid approach further yields up to 53.3% relative WER reduction over zero-shot and up to 37.6% over semantic-only selection (Zheng et al., 20 Dec 2025).
- Bayesian Example Selection (ByCS): Evaluate candidates by their inverse inference likelihood: For each in-context candidate, treat the test utterance as “context” and decode the candidate, scoring based on the similarity between the candidate’s predicted label and its true reference. The top- by this Bayesian proxy yield an average 10.25% relative WER reduction over standard acoustic KNN methods in low-resource dialectal ASR (Wang et al., 23 Apr 2024).
A principled selection strategy is indispensable, since random or semantically mismatched examples can degrade performance, especially in domains with phonological or lexical mismatch to the target (Wang et al., 2023, Zheng et al., 16 Sep 2025).
4. Model Architectures and Modalities
SICL has been instantiated in a variety of contemporary speech and speech-text architectures, including:
- Causal decoder-only LMMs with modality adapters: E.g., Phi-4-MM, which combines a convolutional log-Mel frontend, Conformer encoder, two-layer projector, and LoRA-based adapters with a frozen LLM backbone, supporting a 128k token context (Roll et al., 20 May 2025).
- Encoder-decoder Whisper models: SICL is realized via concatenation of demonstration audio and prefix injection of transcripts, leveraging text or speech KNN for example selection (Wang et al., 2023).
- Multitask and instruction-following models with explicit context slots: SALM augments a frozen LLama-based text LLM with a trainable audio encoder, modality adapter, and LoRA layers, supporting both ASR and automatic speech translation (AST) tasks (Chen et al., 2023).
- Document-level architectures (SICL-AED): AEDs with document-level decoder self-attention and utterance-level encoder cross-attention efficiently operate over long-form contexts, with in-context adaptation for long-form ASR, speaker adaptation, and contextual biasing (Yen et al., 29 Sep 2024).
- Textless speech LMs: GSLM-style models, trained on discrete speech units, can be warmup-prompt tuned for classification SICL, demonstrating few-shot performance rivaling SVMs on the same prompt (Hsu et al., 2023).
- Meta-trained speech-LMs for in-context personalization: Q-Former-based extensions to instruction-tuned encoder–decoder LLMs can meta-learn to personalize emotion recognition via in-context support sets of labeled emotional utterances from new speakers, achieving leading accuracy on a large speaker-diverse benchmark (Ihori et al., 10 Sep 2025).
- Multimodal LLMs with vision backbones: ChatGPT-4o, by prompting with spectrogram images and natural-language explanations, can perform pathological speech detection via SICL, reaching high accuracy with only a handful of in-context examples per class (Amiri et al., 31 Mar 2025).
5. Empirical Outcomes and Scaling Laws
SICL robustly reduces WER and enhances coverage in many challenging scenarios:
- Across four English ASR corpora, 1–12 in-context examples cut mean WER by 19.7% (≈1.2 pp); in low-resource settings, native accent groups drop by up to 44.5% (2.1%→1.2%), and other low-resource varieties (Spanish-heritage, L2-ARCTIC Hindi/Korean/Spanish) also benefit substantially (Roll et al., 20 May 2025).
- Performance gain rises steeply with the first few in-context shots but plateaus beyond –10 for most settings, reflecting diminishing returns as context length increases (Roll et al., 20 May 2025, Zheng et al., 16 Sep 2025).
- Speaker-specific context outperforms variety-level at low-shot (–6), but with more demonstrations, variety-level adaptation can match or even slightly surpass same-speaker gains, reflecting a transition from speaker to variety adaptation in the underlying model (Roll et al., 20 May 2025).
- For children's speech, combination of text embedding retrieval and acoustic reranking produces consistent, strong WER reductions across MyST, OGI Kids’, ENNI, and RSR datasets, demonstrating robustness to intra-group variability (Zheng et al., 20 Dec 2025).
- Instruction-following speech LLMs (COSMIC) achieve up to 25.8% WER reduction for one-shot cross-domain adaptation, and can halve bias-word error rates in ASR contextual biasing with explicit prompt directives (Pan et al., 2023).
- For endangered languages and unseen domains, SICL enables LLMs to match or surpass transformer LMs trained natively on those languages, using as few as 50–150 in-context examples without weight updates. Probability-based prompt scoring outperforms instruction selection, particularly for language modeling and ASR in low-resource settings (Li et al., 26 May 2025).
- In task-agnostic spoken language understanding (SLU), models fine-tuned with randomized class definitions outperform standard and symbol-based techniques, yielding up to 143% relative improvement in cross-task five-shot settings (Agrawal et al., 12 May 2025).
6. Limitations and Open Problems
Despite pronounced gains, current SICL methods exhibit several open challenges:
- Model Generality: Most studies target a single (often open-source) SLM, such as Phi-4-MM or Whisper, leaving questions about generalization to proprietary or alternative architectures (Roll et al., 20 May 2025).
- Domain Coverage: Predominant benchmarks employ read speech; spontaneous, conversational, or noisy field audio are largely unexplored (Roll et al., 20 May 2025).
- Practical Deployment: SICL presumes availability of high-quality, labeled in-context examples at inference. For real-world deployment, efficient on-the-fly context harvesting (including automatic pseudo-labeling or semi-supervised propagation) is a necessary extension (Wang et al., 2023).
- Prompt Length and Model Capacity: The context length remains bounded by model-specific buffer limits (e.g., 128k tokens in Phi-4-MM, or fixed-length in encoder-decoders), and accuracy may plateau or degrade if context exceeds effective capacity (Roll et al., 20 May 2025, Zheng et al., 16 Sep 2025).
- Bias and Equity: Even with extensive in-context adaptation, models trained on imbalanced pretraining data exhibit persistent WER gaps for non-native and under-resourced varieties (Roll et al., 20 May 2025).
- Robustness to Adversarial Contexts: Unrelated, adversarial, or mismatched in-context examples can mislead the prediction, especially in closely related dialect settings (Wang et al., 2023).
7. Broader Implications and Extensions
SICL enables speech models to rapidly adapt to diverse domains and tasks, removing the need for costly parameter updates or explicit retraining per condition. The paradigm generalizes across:
- Multilingual and low-resource speech recognition: In-context priming allows LLMs to learn previously unseen languages, and to rival (or beat) transformer LMs trained directly on those languages’ text, all while preserving base multilingual capabilities (Li et al., 26 May 2025).
- Personalized and contextual spoken language understanding: Task-agnostic meta-tuning or randomized label fine-tuning unlocks robust cross-task and cross-domain generalization in speech LLMs (Agrawal et al., 12 May 2025, Ihori et al., 10 Sep 2025).
- Hybrid modalities: SICL has been extended to vision (spectrogram-based), textless speech LMs, semantic retrieval-augmented systems, and long-form streaming ASR (Zheng et al., 20 Dec 2025, Amiri et al., 31 Mar 2025, Hsu et al., 2023, Yen et al., 29 Sep 2024).
- Interpretable and explainable models: SICL-enabled LLMs can generate chain-of-thought rationales and support visual saliency analyses for their decisions in medical and clinical settings (Amiri et al., 31 Mar 2025).
Taken together, SICL operationalizes a class of speech model architectures capable of human-like, flexible, and efficient adaptation based purely on in-context evidence—enabling robust performance across speakers, languages, and downstream tasks without parameter modification (Roll et al., 20 May 2025, Zheng et al., 16 Sep 2025, Wang et al., 2023, Chen et al., 2023, Zheng et al., 20 Dec 2025, Pan et al., 2023, Yen et al., 29 Sep 2024, Agrawal et al., 12 May 2025, Li et al., 26 May 2025, Hsu et al., 2023, Ihori et al., 10 Sep 2025, Wang et al., 23 Apr 2024, Amiri et al., 31 Mar 2025).