Which feature space best guides in-context example selection for zero-shot ASR?

Identify the most effective feature representations for retrieving context examples in the LLM-based zero-shot ASR setting of Omnilingual ASR by rigorously comparing selection based on textual similarity, semantic embeddings, and audio embeddings, and determine which leads to the highest transcription accuracy for unseen-language utterances.

Background

The zero-shot LLM-ASR model relies on a few in-context speech–text examples at inference time, making the construction of these examples critical for performance.

Multiple retrieval strategies are plausible—acoustic similarity (e.g., wav2vec 2.0 embeddings), semantic similarity (e.g., SONAR embeddings), or text-based similarity—and their relative effectiveness is not obvious a priori. The authors explicitly pose this as an open question before presenting comparative experiments.

References

An open question is which features to use when selecting context examples—textual, semantic, or audio similarity.

— Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages (2511.09690 - team et al., 12 Nov 2025) in Section 6.6 (Constructing Context Examples for Zero-Shot ASR)

Which feature space best guides in-context example selection for zero-shot ASR?

Background

References

Related Problems