Zero-Shot Listwise Doc Reranking
- The paper shows that zero-shot listwise reranking optimizes global metrics like nDCG@10 by modeling inter-document dependencies.
- It employs various architectures such as decoder-only LLMs and encoder-decoder models to encode query-document and document–document interactions.
- Efficiency gains are achieved through innovations like token compression and tournament sort, reducing latency and improving scaling.
Zero-shot listwise document reranking is the paradigm in which a LLM or related machine learning system is given a user query and a set of candidate documents (returned by a first-stage retriever such as BM25 or a dense retriever), and outputs a single ranked permutation of the candidates. The "zero-shot" designation indicates that the reranker is not fine-tuned on any task-specific human relevance annotations for the evaluation queries but instead leverages pretrained or distilled knowledge and either prompt-based or synthetic listwise supervision. This approach dominates contemporary research in information retrieval due to its capacity to directly model mutual dependencies among candidates, globally optimize ranking metrics such as nDCG@10, and flexibly encode query-document and document–document interactions via modern LLMs (Zhang et al., 2023, Pradeep et al., 2023, Pradeep et al., 2023, Ma et al., 2023, Ke et al., 24 Apr 2026). Recent advances in prompt design, data distillation, inference algorithms, and architectural efficiency have enabled open-source LLMs and compact encoder–decoder models to rival and occasionally outperform proprietary engines even in fully zero-shot scenarios.
1. Listwise Reranking: Formalization and Zero-Shot Setting
A listwise reranker is a model that, given a query and candidate set , generates a permutation of indices optimizing a ranking metric. Unlike pointwise rerankers, which score each in isolation, listwise rerankers attend to all candidates jointly and output a sequence (e.g., ). In zero-shot listwise reranking, is not adapted on human-annotated relevance labels for the evaluation domain. Models are either prompted to completion, distilled from outputs of stronger LLMs, or supervised on synthetic listwise data but without in-domain relevance supervision (Zhang et al., 2023, Pradeep et al., 2023).
The listwise loss for sequence generation is: where encodes the "gold" permutation. At inference, deterministic or greedy decoding produces the final order.
2. Model Architectures, Prompting, and Inference Algorithms
Architectures
Leading approaches utilize:
- Decoder-only LLMs, e.g., Code-LLaMA-Instruct, Vicuna-v1.5, Mistral-derived Zephyrβ. These generate ranking permutations via instruction prompts (Zhang et al., 2023, Pradeep et al., 2023, Pradeep et al., 2023).
- Encoder–decoder models under Fusion-in-Decoder (FiD) (Yoon et al., 2024, Tamber et al., 2023): pass each candidate through a shared encoder; the decoder generates the permutation or scores.
- Embedding-based models, e.g., E²Rank, which re-encode the query plus candidates in a pseudo-relevance feedback–style prompt and rerank via dot-product similarities (Liu et al., 26 Oct 2025).
- Token-compressed approaches, e.g., ResRank, project each passage to a single embedding, combine with residual LLM states, and score via cosine similarity (Ke et al., 24 Apr 2026).
- Diffusion LLMs (dLLMs), e.g., DiffuRank, use masked denoising and either parallel slotwise scoring or full permutation generation with assignment algorithms (Liu et al., 13 Feb 2026).
Prompting and Output
Common prompt patterns enumerate the candidates with indices and explicitly instruct the model to return a ranking. For example: 3 Strict formatting is critical for reliable zero-shot inference, especially in sliding-window inference (Zhang et al., 2023, Pradeep et al., 2023).
For extremely large candidate sets, a sliding window over fixed-size blocks with overlap is standard (e.g., window size , stride 0 for top-100 lists).
3. Training Data, Listwise Supervision, and Synthetic Labels
Zero-shot listwise reranking fundamentally depends on high-quality listwise supervision—even if only available in pseudo form. Pointwise IR datasets such as MS MARCO, which only mark a few positives per query with binary labels (e.g., 1 positive per query), are insufficient and cause extreme false negatives and diminished ranking fidelity in listwise setups. Effective listwise rerankers must be trained or distilled using candidate lists annotated with exhaustive or high-quality permutations, preferably with graded (not binary) labels (Zhang et al., 2023).
Practical listwise supervision sources:
- Pseudo-labels/distillation: Top-K lists are reordered by a strong teacher (e.g., GPT-3.5 or GPT-4) or advanced non-LLM rerankers (e.g., co.rerank, Contriever+ft) to serve as "silver" ground-truth (Zhang et al., 2023, Pradeep et al., 2023, Pradeep et al., 2023).
- Data augmentation: Randomize orderings and list lengths at train time to improve robustness and prevent order bias (Pradeep et al., 2023).
- Synthetic listwise examples: For encoder–decoder frameworks or embedding-based models, negatives are randomly sampled per query and combined with known positives to train under a listwise or pairwise loss (e.g., ListT5: (Yoon et al., 2024)).
High-quality, fully graded, human-annotated listwise datasets remain an open need for ground-truth evaluation and supervised training (Zhang et al., 2023).
4. Evaluation, Benchmarks, and Quantitative Comparisons
Zero-shot listwise rerankers are evaluated on public benchmarks such as TREC Deep Learning Tracks (DL19, DL20), BEIR, and reasoning datasets (e.g., BRIGHT, R2MED). Metrics include nDCG@10 and MAP@100.
Key empirical findings:
| Model | DL19 nDCG@10 | DL20 nDCG@10 | BEIR nDCG@10 (avg) |
|---|---|---|---|
| BM25 (first-stage) | 0.5058 | 0.4796 | 0.43–0.50 |
| monoBERT/monoT5 (pointwise) | 0.59–0.60 | 0.60–0.61 | 0.49–0.53 |
| RankGPT-3.5 (listwise) | 0.658 | 0.629 | 0.49–0.53 |
| RankGPT-4 (listwise) | 0.757 | 0.710 | 0.54–0.54 |
| Code-LLaMA-Instruct 34B | 0.743 | 0.687 | n/a |
| RankVicuna (7B, GPT-3.5 teach) | 0.668 | 0.655 | 0.49–0.53 |
| RankZephyr (7B, GPT-4 teach) | 0.782 | 0.816 | 0.533–0.826 |
| ListT5-3B (FiD) | – | – | 0.53 |
| E²Rank-8B | – | – | 0.54 |
| ResRank (4B, token comp.) | – | – | 0.544 |
| DiffuRank (Perm-Assign 8B) | – | – | ~0.49 |
Listwise LLM reranking consistently improves over pointwise LLM or cross-encoder baselines. State-of-the-art open-source rerankers (e.g., RankZephyr, ResRank, E²Rank, ListT5) approach or exceed GPT-4 performance at massively reduced model size and latency (Pradeep et al., 2023, Ke et al., 24 Apr 2026, Liu et al., 26 Oct 2025, Yoon et al., 2024).
5. Efficiency, Architectural Innovations, and Scaling Properties
Zero-shot listwise reranking with naïve decoder-only LLMs is limited by:
- Context window: Models accept 1–2 full passages. Sliding-window schemes cover top-100 lists but risk "trapping" relevant passages within windows (Zhang et al., 2023, Pradeep et al., 2023).
- Autoregressive decoding: Token-by-token sequence generation induces superlinear latency and high resource consumption.
Recent techniques to address efficiency/scale:
- Token compression: ResRank compresses each passage to a single embedding vector, allowing full listwise context and order-sensitive re-ranking in a single pass (O(n) tokens, zero generation) (Ke et al., 24 Apr 2026).
- Encoder–decoder with tournament sort: ListT5 fuses candidate encodings via FiD and applies tournament sort (O(n + k log n)) for scalable top-k extraction without sliding windows (Yoon et al., 2024).
- Embedding-based scoring: E²Rank reranks via a single dot product between a "PRF-style" query embedding (combining the query and the candidate set) and cached document embeddings (Liu et al., 26 Oct 2025).
- Diffusion LLMs: DiffuRank provides fully parallel, permutation-compatible decoding, eliminating left-to-right constraints and offering efficient optimization (Liu et al., 13 Feb 2026).
Empirical latency reductions reach up to 5× over sliding-window LLM approaches (E²Rank-8B: 3.40 s/query vs. RankQwen3-8B: 16.93 s/query for top-100 rerank (Liu et al., 26 Oct 2025)).
6. Prompt Optimization, Robustness, and Ensemble Techniques
Prompt structure significantly impacts reranking effectiveness, especially in the zero-shot setting. Methods for improving prompt robustness and mitigating positional or formatting bias include:
- Instruction tuning/data augmentation: Training over shuffled candidate orderings and variable list sizes yields outputs robust to permutation and windowing changes (Pradeep et al., 2023).
- Constrained prompt optimization (Co-Prompt): Algorithmically searches the prompt space using a beam search with a listwise metric as a discriminator, directly optimizing for ranking effectiveness without model updates (Cho et al., 2023).
- Positional and score injection: InsertRank injects BM25 scores directly as numerical features in the prompt, enabling the LLM to leverage lexical scoring for complex queries (Seetharaman et al., 17 Jun 2025).
Listwise rerankers generally outperform pointwise counterparts in promoting harder, mid-rank documents and in maintaining resilience to input order perturbation (Zhang et al., 2023, Pradeep et al., 2023, Seetharaman et al., 17 Jun 2025).
7. Open Challenges and Future Directions
Open research problems and recommended strategic directions include:
- High-quality, graded listwise annotation: Absence of human-annotated listwise datasets is a critical bottleneck; synthetic listwise supervision from strong teachers is currently paramount (Zhang et al., 2023).
- Efficiency at deployment: Non-autoregressive reranking (as in ResRank, E²Rank, ListT5 tournament sort) is essential for industrial utility; "lost in the middle" effects from prompt length must be alleviated (Ke et al., 24 Apr 2026, Liu et al., 26 Oct 2025, Yoon et al., 2024).
- Domain adaptation and robustness: Out-of-domain adaptation for listwise rerankers (e.g., BEIR, BRIGHT) is still challenging; domain-adaptive pseudo-labeling and few-shot prompting are potential paths (Zhang et al., 2023, Liu et al., 26 Oct 2025).
- Integrated architectures: End-to-end retrieval–reranker joint training (e.g., ResRank dual-stage objectives) and integration of chain-of-thought/reasoning signals hold promise for more advanced relevance modeling (Ke et al., 24 Apr 2026, Seetharaman et al., 17 Jun 2025).
- Interpretable and human-aligned reranking: Learnable prompt optimization (e.g., Co-Prompt) and interpretable scoring, as well as explicit chain-of-thought explanations, are crucial for transparency (Cho et al., 2023).
High-fidelity, zero-shot listwise reranking has become the dominant paradigm in neural document ranking, due to its empirical effectiveness, architectural flexibility, and ability to generalize with minimal or no in-domain labels. State-of-the-art open models now match or exceed proprietary LLM performance, provided that careful attention is paid to listwise supervision sources, efficient architectural design, and robust prompt engineering.
Key references: (Zhang et al., 2023, Pradeep et al., 2023, Pradeep et al., 2023, Liu et al., 26 Oct 2025, Ke et al., 24 Apr 2026, Yoon et al., 2024, Ma et al., 2023, Cho et al., 2023, Liu et al., 13 Feb 2026, Seetharaman et al., 17 Jun 2025, Tamber et al., 2023).