Zero-Shot Document Retrieval
- Zero-shot document retrieval is a technique that ranks documents for new queries without using labeled training data.
- It leverages dense retrievers, cross-encoders, and hybrid pipelines augmented with synthetic queries to achieve state-of-the-art performance.
- Advances in self-learning, representation sharpening, and dynamic indexing drive robust generalization across diverse modalities and languages.
Zero-shot document retrieval is the task of ranking documents by relevance to queries from domains, languages, or problem settings for which the retrieval model has never seen labeled (relevant/non-relevant) examples. Rather than relying on annotated data or explicit supervised fine-tuning, zero-shot retrieval exploits the generalization capabilities of large pre-trained models—be they dense retrievers, LLMs, or hybrid approaches—often in conjunction with auxiliary techniques such as synthetic query generation, pseudo-document creation, prompt-driven representation extraction, or corpus-specific adaptation. Over the past several years, a broad array of methodologies has emerged, with demonstrated state-of-the-art performance across diverse benchmarks, domains, and modalities.
1. Core Principles and Problem Formulation
Zero-shot document retrieval is defined by two key constraints: absence of relevance-labeled data in the target domain, and the need for models to generalize to unseen distributions of queries, documents, or both. The retrieval model may have been trained in a supervised or self-supervised fashion on source data, but, at the point of deployment, must operate in a truly unsupervised manner on new corpora and queries (Yu et al., 2022, Rosa et al., 2022, Arora et al., 2023).
Canonical formulations include both dense and sparse retrieval. In dense retrieval, bi-encoders map queries and documents independently to vector spaces; scoring is typically via dot-product or cosine similarity: where , are the corresponding encoders (Yu et al., 2022, Shi et al., 2021).
Cross-encoder rerankers, in contrast, utilize early query–document interaction by feeding concatenated query-document pairs through a transformer, enabling fine-grained token alignment and robust generalization at the cost of higher inference latency (Rosa et al., 2022, MacAvaney et al., 2019).
In all variants, a key metric of success is the accuracy (nDCG, recall@k, MRR, etc.) on queries and corpora drawn from distributions unseen during any supervised training.
2. Methodological Taxonomy
2.1 Dense and Cross-Encoder Approaches
Zero-shot dense retrieval is epitomized by models such as DPR, Contriever, and COCO-DR, which are trained in a contrastive fashion and applied unchanged to new domains (Yu et al., 2022, Shi et al., 2021). Cross-encoder rerankers, particularly large architectures (T5-3B, monoT5-3B), exhibit superior zero-shot performance due to parameter scale and early self-attention between query and document tokens. Distilled models (MiniLM) are competitive under tight latency constraints, but their cross-domain generalization is notably reduced (Rosa et al., 2022).
Multilingual settings leverage pre-trained transformers such as mBERT, with dense or cross-encoder architectures, producing robust improvements over lexical baselines on non-English collections even in full zero-shot transfer (MacAvaney et al., 2019, Shi et al., 2021).
2.2 Generation-Augmented and Hybrid Pipelines
Generation-augmented retrieval (GAR), retrieval-augmented generation (RAG), and hybrid methods (such as SL-HyDE, LameR, and GAR-meets-RAG) combine LLMs with traditional retrieval, employing techniques such as pseudo-document generation, query augmentation, or iterative recurrent pipelines to successively improve recall and precision (Li et al., 2024, Shen et al., 2023, Arora et al., 2023).
The SL-HyDE framework, for example, leverages LLMs to generate hypothetical, query-conditioned pseudo-documents, which are then used for contrastive bi-encoder retriever self-learning. The generator and retriever mutually reinforce each other via a bootstrapped, self-labeling loop, with both components refined solely using unlabeled corpus data (Li et al., 2024).
LameR employs a two-stage approach: initial retrieval with BM25, followed by candidate-prompted answer synthesis via an LLM, after which the query is augmented with generated answers and re-submitted to BM25. This pipeline achieves strong zero-shot performance, surpassing both BM25 and supervised dense models in nDCG@10 on TREC DL benchmarks (Shen et al., 2023).
2.3 Representation Enhancement and Data-Centric Techniques
Recent advances focus on representation sharpening, document linking, referral augmentation, and synthetic query generation to boost zero-shot discriminative power.
Representation sharpening, as formulated in (Ashok et al., 7 Nov 2025), constructs synthetic, document-specific contrastive queries—crafted to differentiate a document from its nearest embedding neighbors—using LLMs. These are then used, at inference or index time, to shift document representations, yielding +6.9 nDCG@10 improvement on BEIR, with significant gains across low-resource and reasoning-intensive tasks.
Universal Document Linking (Hwang et al., 2024) builds upon the principle of merging similar documents—selected via entropy-based similarity models and NER-driven thresholds—to enhance synthetic query generation. Concatenation of linked documents produces harder pseudo-queries, yielding superior recall across ten multilingual datasets compared to all baselines.
Referral Augmented Retrieval (Tang et al., 2023) constructs augmented document representations by concatenating or aggregating text from inbound citations, hyperlinks, or referring documents, providing up-to-date, multi-perspective context. This approach gives consistent performance gains (up to +37 absolute points Recall@10) with no model retraining or expensive document expansion steps.
Synthetic query generation, using instruction-conditioned or LM-driven generators, is widely effective in zero-shot dense retrievers. For instance, ZeroGR (Sun et al., 12 Oct 2025) combines LM-based docid generation with instruction-tuned pseudo-queries and a reverse-annealing decoding strategy, outperforming all dense retrieval and generative baselines on the MAIR and BEIR benchmarks.
3. Multimodal and Domain-Specific Zero-Shot Retrieval
Zero-shot document retrieval extends beyond pure text corpora to medical, scientific, legal, and especially multimodal (text+image) collections.
PREMIR (Choi et al., 23 Aug 2025) and SERVAL (Nguyen et al., 18 Sep 2025) exemplify approaches for multimodal retrieval. PREMIR uses a multimodal LLM to generate token-level pre-questions from raw page images, visual subcomponents, and extracted text, then performs retrieval and clustering wholly in a zero-shot fashion. SERVAL leverages a vision-LLM to produce detailed document captions, which are then embedded with standard text encoders; this generate-and-encode pipeline achieves state-of-the-art nDCG@5 on ViDoRe-v2 and robustly scales to large, multilingual collections.
Specialized pipelines for domains such as legal filings or clinical trials exploit metadata-rich structures and knowledge-base-guided entity replacement, as in Trial2Vec (Wang et al., 2022), which decomposes long documents into key attributes and context, learning hierarchical contrastive representations by exploiting biomedical knowledge graphs (UMLS).
4. Analysis of Zero-Shot Generalization and Performance
A defining challenge in zero-shot retrieval is domain shift between source and target distributions. Performance correlates strongly with alignment/uniformity in dense representation space (Yu et al., 2022), and with the ability of cross-encoders and rerankers to model query–document interactions at scale (Rosa et al., 2022). Continuous contrastive pretraining on target corpus text (COCO-DR) and robust optimization across diverse query clusters (iDRO) produce substantial gains on BEIR, with COCO-DR Base matching or surpassing much larger models (Yu et al., 2022).
Empirical findings include:
- Large cross-encoders (B parameters) achieve up to +10 nDCG@10 over 60M models in zero-shot settings, while marginal gains on in-domain test sets plateau much earlier (Rosa et al., 2022).
- Supervised models with high in-domain MRR may underperform in zero-shot transfer; in-domain metrics are unreliable proxies for generalization capacity.
- Synthetic and pseudo-labeling techniques (UDL, HyDE, LameR) are additive with zero-shot dense retrievers, and readily plug into indexing or re-ranking pipelines (Hwang et al., 2024, Shen et al., 2023).
- Multilingual cross-encoder models, fine-tuned solely on English, can transfer directly to Arabic, Mandarin, and Spanish retrieval, with statistically significant gains over BM25 (MacAvaney et al., 2019, Shi et al., 2021).
A summary comparison of recent retrieval strategies on standard zero-shot benchmarks is organized in the table below:
| Approach | Core Technique | BEIR/MAIR SOTA Performance (nDCG@10, Acc@1) |
|---|---|---|
| COCO-DR (Yu et al., 2022) | Contrastive + iDRO | 0.462 (BERT-Base); higher for Large |
| LameR (Shen et al., 2023) | LLM-augmented BM25 | 69.1 (DL19, nDCG@10) |
| SL-HyDE (Li et al., 2024) | Pseudo-docs + self-learning | +7.2% nDCG@10 (medical CMIRB) |
| PromptReps (Zhuang et al., 2024) | LLM-prompted dense+sparse | 50.07 (Llama3-8B, BEIR avg) |
| ZeroGR (Sun et al., 12 Oct 2025) | Inst.-tuned Gen. Retrieval | 48.1 (BEIR nDCG@10), 41.1 (MAIR Acc@1) |
| Representation Sharpening (Ashok et al., 7 Nov 2025) | Doc-specific contrastive | +6.9 abs. nDCG@10 BEIR |
| REFERRAL (Tang et al., 2023) | Ref.-augmented context | up to +37 Recall@10 (ACL) |
| UDL (Hwang et al., 2024) | Linked doc. join + QGen | +3–5 NDCG@10 universal |
5. Modalities, Languages, and Practical Considerations
Zero-shot retrieval methods support diverse modalities (text, images, tables), domains (biomedicine, law, science), and languages. Techniques such as pseudo-TOC-guided hierarchical retrieval (DocsRay (Jeong et al., 31 Jul 2025)), cross-lingual transfer via mBERT (Shi et al., 2021), and multi-phase prompt-based reranking (ASRank (Abdallah et al., 25 Jan 2025)) enable plug-and-play adaptation to new document types and non-English corpora, with near-human accuracy in long-document and multimodal tasks.
Practical considerations for deployment include:
- Latency/performance tradeoff: Large cross-encoders are optimal for zero-shot effectiveness but require reranking cascades. Bi-encoders and prompt-based methods balance performance and speed for large-scale use (Rosa et al., 2022, Zhuang et al., 2024).
- Cost and scalability: Indexing-time augmentation (RAR, UDL) and offline compute of enhanced document representations support rapid, up-to-date index refresh without retraining or heavy inference.
- Hyperparameter robustness: Most recent frameworks (COCO-DR, UDL, SL-HyDE) use only a small number of tunable parameters, and exhibit stable performance across corpora sizes and domains.
6. Limitations and Future Directions
Zero-shot retrieval remains limited by domain shift, semantic drift in LLM-generated augmentations, and the memory/computation demands of large inference-time rerankers. A plausible implication is that as LLMs and domain-specific embedding models continue to scale, and as document enrichment and self-labeling strategies become more robust, zero-shot retrieval will further close the gap to in-domain supervised systems, especially in settings where training data is inherently scarce.
Emerging directions include:
- Contrastive docid and query generation: Replacing static identifiers with semantically meaningful, LM-generated docids for flexible retrieval in heterogeneous formats (Sun et al., 12 Oct 2025).
- End-to-end multimodal fusion: Unifying image, text, and layout cues for robust performance on complex documents (Choi et al., 23 Aug 2025, Nguyen et al., 18 Sep 2025, Jeong et al., 31 Jul 2025).
- Adaptive and continual indexing: Incorporating dynamic document arrival and ongoing domain drift, with online updating of linking and enrichment mechanisms (Hwang et al., 2024, Sun et al., 12 Oct 2025).
- Few-shot and instruction-driven extension: Leveraging minimal user-labeled data or natural-language instructions to further tune zero-shot systems in a data-efficient manner (Zhuang et al., 2023, Sun et al., 12 Oct 2025).
Zero-shot document retrieval now encompasses a sophisticated methodological spectrum, integrating advances in unsupervised/self-supervised representation learning, prompt engineering, generative modeling, and domain adaptation. As demonstrated across recent benchmarks and deployments, this paradigm enables effective, task-agnostic retrieval at scale, delivering strong performance in diverse, data-sparse, and rapidly evolving information environments.