Papers
Topics
Authors
Recent
2000 character limit reached

Zero-shot Document Ranking Overview

Updated 1 December 2025
  • Zero-shot document ranking is a technique using large language models without fine-tuning to order documents based on query relevance.
  • It employs diverse paradigms—pointwise, pairwise, listwise, and setwise—to balance computational efficiency with strong empirical performance.
  • Innovations like prompt optimization, hybrid retrieval pipelines, and tournament methods enhance ranking accuracy on benchmarks such as BEIR and TREC DL.

Zero-shot document ranking refers to the task of ordering a set of documents for a given query using models that have not been explicitly fine-tuned for relevance ranking on any supervised or in-domain dataset. Recent advancements exploit LLMs in a zero-shot regime, achieving strong effectiveness by leveraging prompt-based strategies, generative modeling, or emergent comparative abilities. This wide-ranging research area spans multiple ranking paradigms (pointwise, pairwise, listwise, setwise, and tournament-inspired), prompt optimization, and hybridized retrieval pipelines, with extensive empirical validation across benchmarks such as BEIR, TREC Deep Learning, and MS MARCO.

1. Core Paradigms in Zero-Shot Document Ranking

Methods are typically categorized according to how they operationalize the ranking signal from the LLM:

  • Pointwise: Each candidate document is scored independently with respect to the query. Classical query-likelihood models (QLMs) prompt the LLM to estimate P(qd)P(q|d) by scoring the likelihood of generating each query token given the document and a prompt (e.g., "Generate a question that is most relevant to the given article’s title and abstract") (Zhuang et al., 2023). Pointwise models offer efficiency (scaling linearly in the number of candidates), and can make use of output logits for scalar relevance scores (e.g., log-likelihoods or normalized probabilities for "yes"/"no") (Li et al., 13 Jun 2025).
  • Pairwise: The LLM is given (query, document1_1, document2_2) and returns a preference or relative relevance judgment, often via a forced-choice prompt ("Which passage is more relevant to the query?"). All-pairs approaches require O(n2)O(n^2) calls, but can be made more practical through sorting algorithms that require O(nlogn)O(n\log n) (heapsort) or O(n)O(n) (insertion sort) comparisons (Zhuang et al., 2023).
  • Listwise: The full candidate list (or manageable windows thereof) is presented to the LLM, which outputs a permutation or a sorted subset. Pipelines such as Listwise Reranking with a LLM (LRL) directly generate the identifier ordering in a single sequence, optimizing k=1mP(idikq,D,idi1,,idik1)\prod_{k=1}^m P(\mathrm{id}_{i_k}|q, D, \mathrm{id}_{i_1}, \ldots, \mathrm{id}_{i_{k-1}}) (Ma et al., 2023). Constraints on input length are addressed by sliding-window or progressive listwise schemes.
  • Setwise: Building on pairwise and listwise, setwise approaches present small groups (c3c \geq 3) and prompt the LLM to select the most relevant among them, integrating selection into cc-ary sorts. Compared to pairwise, setwise reduces the number of LLM calls by up to 60–70% while retaining or improving effectiveness (Zhuang et al., 2023, Podolak et al., 9 Apr 2025).
  • Hybrid and Tournament: Methods such as TourRank partition candidates into groups per stage and ensemble over multiple randomized tournaments (re-initialized groupings), accumulating per-document advancement points. This yields state-of-the-art effectiveness at modest, parallelizable cost and robustly mitigates input order and context length constraints (Chen et al., 17 Jun 2024).

The following table summarizes the complexity and effectiveness trade-offs across core paradigms:

Method Type LLM Calls per Query Typical nDCG@10 Core Reference
Pointwise O(n)O(n) 0.42–0.66 (Zhuang et al., 2023, Li et al., 13 Jun 2025)
Pairwise O(n2)O(n^2) 0.66–0.68 (Zhuang et al., 2023)
Setwise Heap O(n+klogcn)O(n + k\log_c n) 0.67 (Zhuang et al., 2023, Podolak et al., 9 Apr 2025)
Listwise LRL O(n)O(n) windows 0.66–0.68 (Ma et al., 2023)
TourRank O(kn)O(kn) (parallel) 0.69–0.71 (Chen et al., 17 Jun 2024)

2. Query-Likelihood Models and Prompt-based Scoring

Query-likelihood models (QLMs) adapted for LLMs rank documents by directly estimating the probability of a query given a document:

SQLM(q,d)=1qt=1qlogP(qtp,d,q<t)S_{\mathrm{QLM}}(q, d) = \frac{1}{|q|} \sum_{t=1}^{|q|} \log P(q_t | p, d, q_{<t})

where pp is a short template prompt, dd the document, and q<tq_{<t} the previously generated tokens. Exact prompt engineering is crucial, with dataset- and model-specific templates empirically shown to affect ranking accuracy (Zhuang et al., 2023). QLM scores are often interpolated with zero-shot retriever outputs (e.g., BM25, HyDE) for final ranking, increasing nnDCG@10 by up to 2 points with negligible computational overhead.

Prompt optimization extends to discrete search over prompt tokens (Co-Prompt), combining a generator's priors with a discriminator's re-ranking metric via beam search, consistently outperforming manual or reinforcement-learning–based prompt selection (Cho et al., 2023).

3. Comparative and Anchored Strategies: Pairwise, Setwise, and Reference-based Approaches

Pairwise and setwise: Comparative prompting methods explicitly consider cross-document distinctions. Pairwise ranks based on P(didjq)P(d_i \succ d_j|q), but becomes computationally expensive. Setwise prompting, as formalized in Setwise Comparator SetCompc\mathrm{SetComp}_c, asks which document is best among a group of cc candidates. Setwise insertion further incorporates prior ranking information (e.g., BM25 ordering), biases the LLM via prompt to favor the highest-priority item, and employs block-binary search for efficient top-kk extraction, reducing LLM calls by ~31% and latency by ~23% with a slight effectiveness gain (Podolak et al., 9 Apr 2025).

Reference-based/anchor methods: RefRank and its variants sidestep exhaustive pairwise comparisons by selecting a single (or a handful of) anchor document(s) to serve as reference. Each candidate is compared against the anchor(s), and final ranks are aggregated by the anchors' comparative scores. Multi-anchor (ensemble) schemes (e.g., m=5m=5 anchors) nearly match or surpass full pairwise sort accuracy at a fraction of the computational cost (Li et al., 13 Jun 2025).

Global-Consistent Comparative Pointwise (GCCP) approaches construct a query-focused summary over top candidates as the anchor, leveraging unsupervised spectral multi-document summarization. Pointwise contrastive scores Δs(di,qa)=s(di,q)s(a,q)\Delta s(d_i, q| a) = s(d_i, q) - s(a, q) or direct comparative prompts yield global consistency while retaining linear computational scaling. Post-aggregation with global context (PAGC) linearly combines these contrastive scores with standard pointwise scores, achieving near pairwise-level effectiveness with minimal cost increase (Long et al., 12 Jun 2025).

4. Embedding-based and Explicit Representation Methods

PromptReps elicits both dense and sparse representations from general LLMs without further training, using tailored prompts for each to guide the LLM in selecting key representation tokens:

  • Dense representations: The last hidden state hlasth_{\mathrm{last}} from the document prompt is normalized to yield edensee_{\text{dense}}.
  • Sparse representations: Raw next-token logits are post-processed (activation, log-saturation, filtering) to yield a sparse bag-of-words vector ss.

Hybrid indices (ANN for dense, inverted index for sparse) are then searched. The min–max–normalized final score is a convex combination of dense and sparse similarities, typically α=0.5\alpha=0.5. Hybrid PromptReps with large LLMs (Llama3-8B+) reach nnDCG@10 = 46.2 on BEIR (dense-only 16–22, sparse-only 32–35) and 50.1 when further combined with BM25, surpassing unsupervised state-of-the-art retrievers trained with large-scale paired data (Zhuang et al., 29 Apr 2024).

Recent work also establishes the theoretical foundation of learnable late-interaction models as universal approximators of continuous scoring functions, enhancing zero-shot transfer, reducing storage, and lowering inference cost versus ColBERT and cross-encoders (Ji et al., 25 Jun 2024).

5. Practical and Architectural Innovations

Tournament-inspired and robust strategies

TourRank employs a multi-stage, tournament-style grouping where group advances are determined by LLM-prompted selection. Points across multiple randomized tournament repetitions are aggregated, stabilizing rankings against input-order and context-limit bias. With r=10r=10 repetitions, TourRank-10 achieves nnDCG@10 = 71.63 (DL19), exceeding supervised monoT5-3B and other zero-shot baselines with lower or fully-parallelizable wall-clock latency (Chen et al., 17 Jun 2024).

Temporal and event-prediction scenarios

AutoCast++ applies zero-shot document ranking for event forecasting by combining LLM-based graded-relevance scoring with a temporal reweighting function grounded in human forecaster behavior. Normalized relevance is multiplied by a recency gain function precomputed from crowd-forecast logs, promoting temporally salient context. This yields substantial end-to-end accuracy improvements in event forecasting benchmarks (e.g., +48%+48\% MCQ accuracy vs. static pipelines) (Yan et al., 2023).

Generative anchor and answer-scent cues

ASRank introduces an "answer scent," a query-conditioned natural language semantic target synthesized by a large LLM and used to guide answer generation from each candidate. Reranking is based on the log-likelihood of each document’s ability to generate the answer scent, with final scores integrating the document’s retrieval prior. This approach achieves major gains in Top-1 open-domain QA retrieval (NQ: 22.147.3%22.1 \to 47.3\% for BM25, 19.246.5%19.2 \to 46.5\% for MSS; BEIR nDCG@10: 48.39 vs. 45.78 for monoT5) while maintaining efficient query-time computation (Abdallah et al., 25 Jan 2025).

6. Task Adaptation, Domain Transfer, and Zero-Shot Limits

Task Arithmetic leverages model weight arithmetic to adapt cross-encoder or LLM-based rankers for zero-shot use in new domains. Given a base IR model ΘT\Theta_T and a domain-adapted LM ΘD\Theta_D (and their shared pre-trained weights Θ0\Theta_0), one computes task vectors τD=ΘDΘ0\tau_D = \Theta_D - \Theta_0 and synthesizes new parameters Θ=ΘT+ατD\Theta' = \Theta_T + \alpha \tau_D. This process enables training-free, modular adaptation across scientific, biomedical, multilingual, and legal domains, yielding up to 18% relative gain in nDCG@10 versus task-agnostic baselines (Braga et al., 1 May 2025).

Comprehensive evaluation of long-document ranking models under zero-shot transfer (MS MARCO \rightarrow TREC DL, Robust04, FarRelevant) reveals that chunk aggregation approaches (MaxP, PARADE) outperform pure first-chunk scoring on collections lacking positional bias. However, only modest gains (5%\leq 5\%) are attributed to true long-context modeling unless test distributions are explicitly adversarial to positional priors, underscoring the importance of robust aggregation strategies in zero-shot settings (Boytsov et al., 2022).

7. Frontiers: Visual and Multi-Modal Zero-Shot Retrieval

In the document image domain, SERVAL establishes a generate-and-encode zero-shot baseline: a vision-LLM generates a rich description for each image, which is then embedded via standard (multilingual) text encoders for retrieval. Without any contrastive text-image training, SERVAL achieves nnDCG@5 = 63.4% on ViDoRe-v2 (surpassing ColNomic-7B) and nnDCG@10 = 72.1% on MIRACL-VISION, demonstrating the generality of zero-shot pipelines in high-dimensional multi-modal settings (Nguyen et al., 18 Sep 2025).

References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Zero-shot Document Ranking.