Which feature space best guides in-context example selection for zero-shot ASR?
Identify the most effective feature representations for retrieving context examples in the LLM-based zero-shot ASR setting of Omnilingual ASR by rigorously comparing selection based on textual similarity, semantic embeddings, and audio embeddings, and determine which leads to the highest transcription accuracy for unseen-language utterances.
References
An open question is which features to use when selecting context examples—textual, semantic, or audio similarity.
— Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages
(2511.09690 - team et al., 12 Nov 2025) in Section 6.6 (Constructing Context Examples for Zero-Shot ASR)