Setwise Prompting for Efficient Document Ranking

Updated 1 December 2025

Setwise Prompting is a zero-shot document ranking method that iteratively compares small, overlapping candidate sets using LLMs to efficiently extract top-k relevance.
It employs mathematical formulations like setwise heapsort and insertion to reduce LLM calls and prompt tokens, optimizing computational trade-offs.
Empirical results demonstrate that setwise methods lower latency and token consumption while preserving or slightly improving ranking accuracy compared to pointwise and pairwise approaches.

Setwise prompting is a methodology for zero-shot document ranking using LLMs that sits between pointwise, pairwise, and listwise prompting paradigms in both operational mechanics and computational-accuracy trade-offs. Unlike pointwise (which evaluates each document in isolation), pairwise (which exhaustively or approximately compares all pairs), or listwise (which attempts to jointly rank subsets), the Setwise approach iteratively examines small, overlapping sets of candidates, aiming to efficiently extract the top-k most relevant documents relative to a query, with minimal model inference overhead while preserving or exceeding ranking effectiveness (Zhuang et al., 2023, Podolak et al., 9 Apr 2025).

1. Theoretical Motivation and Limitations of Existing Paradigms

Pointwise prompting processes each $(\mathrm{query},\,\mathrm{document})$ pair separately, requiring $O(n)$ LLM calls for $n$ documents. While batching is possible, this approach fails to capture cross-document relevance judgements: the model must “score” each document independently, relying on post-hoc normalization which can result in poor relative calibration.

Pairwise prompting compares document pairs directly—either via all-pairs ( $O(n^2)$ ) or by employing a sorting algorithm (e.g., heapsort with $O(k\,\log n)$ calls to identify the top- $k$ ). This increases effectiveness (as the LLM directly discriminates between documents for each comparison), but becomes expensive in both sequential calls and token usage, incurring high latency.

Listwise prompting presents the LLM with windows of $s$ candidates and asks for a total ordering or ranking of each window, but this tends to generate brittle completions (format deviations, inconsistency), is sensitive to context window length, and requires $O(r \cdot n/s)$ calls.

Setwise prompting addresses the efficiency-effectiveness tension by selecting the best document out of a small set $S$ of size $c > 2$ in each prompt. This multi-way comparison reduces the required sorting depth from $O(\log_2 n)$ (pairwise) to $O(\log_c n)$ , or, in bubble sort, from $O(n)$ to $O(n/(c-1))$ , decreasing the total number of LLM calls and overall prompt token consumption (Zhuang et al., 2023, Podolak et al., 9 Apr 2025).

2. Mathematical Formulation and Setwise Sorting Algorithms

For a query $Q$ and a candidate set $S = \{d_1, \ldots, d_c\}$ , the LLM’s soft selection probability for each candidate is: $P(i\,|\, Q, S) = \frac{\exp \left(g_\theta(Q, d_i)\right)} {\sum_{j=1}^c \exp \left(g_\theta(Q, d_j)\right)}$ where $g_\theta(Q, d)$ denotes the compatibility (e.g., logit or likelihood) assigned by the model.

The ranking objective is to induce a permutation $\pi$ maximizing true relevance in top- $k$ positions by simulating a sorting procedure based on setwise argmax comparisons. For setwise heapsort (using $c$ -ary heaps), the number of LLM inferences is $O(n + k \log_c n)$ , and for setwise bubble sort $O(n / (c-1))$ .

Empirical Example

With Flan-T5-large, reranking 100 documents ( $n = 100$ , $k=10$ , $c=3$ ):

Setwise Heapsort: $\sim$ 125 calls, $\sim$ 40k prompt tokens, latency $\approx$ 8s
Pairwise Heapsort: $\sim$ 230 calls, $\sim$ 105k prompt tokens, latency $\approx$ 16s
Setwise Bubblesort: $\sim$ 460 calls, $\sim$ 148k tokens, latency $\approx$ 29s

The canonical setwise prompt is: “You are given the query: {Query} And the following {c} documents: (1) {Doc₁} ... (c) {Doc_c} Please select the document (1–c) that is most relevant to the query. Respond only with the index.”

Setwise Heapsort Pseudocode

def build_heap(D, c):
    for i in range(floor(len(D)/c), 0, -1):
        sift_down(i, c)

def extract_top_k_heap(D, k):
    for j in range(k):
        top = D[1]
        D[1] = D[last]; remove last
        sift_down(1, c)    
        output top

(Zhuang et al., 2023, Podolak et al., 9 Apr 2025)

3. Setwise Insertion: Prior-Aware Efficiency Extension

Setwise Insertion further improves computational efficiency by incorporating prior ranking knowledge (e.g., BM25 or initial Setwise Heapsort output), restricting LLM comparisons mostly to candidates likeliest to perturb the top- $k$ .

Algorithm Steps:

Initialize buffer $S$ with the current top- $k$ (sorted, e.g., by BM25 or Setwise Heapsort).
For each new candidate batch $C = \{c_i, ..., c_{i+c-1}\}$ $C = {c_{i}, ..., c_{i + c - 1}}$ , compare $\{s_\text{min}\} \cup C$ ${s_{min}} \cup C$ in a setwise call, where $s_\text{min}$ $s_{min}$ is minimum in $S$ $S$ .
- If the winner is from $C$ , insert into $S$ at the correct position using $O(\log_c k)$ setwise calls (and remove $s_\text{min}$ ).
- Otherwise, discard the batch.
Repeat through all candidates.

Complexity:

For $n$ total documents, $k$ results, set size $c$ , and $a$ actual insertions,

$T = O(k \log_c k) + O\left( \frac{n}{c} \right) + O\left( a\, \frac{k}{c} \right)$

In practical settings, when the prior ranking is strong ( $a \ll n$ ), the insertion cost is dominated by the initial sort and candidate scanning (Podolak et al., 9 Apr 2025).

Prompt Bias Technique:

Each setwise prompt presents the strongest prior-ranked document (e.g., $s_\text{min}$ ) as “Document A” and directs the LLM: “if uncertain, pick A,” thus reducing hallucinations and unnecessary exchanges.

4. Empirical Evaluation: Effectiveness, Efficiency, and Robustness

Setwise Reranking and Setwise Insertion have been extensively evaluated on TREC DL 2019/2020 and BEIR, with models including Flan-T5 (large/xl/xxl), Llama2-Chat-7B, Vicuna-13B, and Gemma2-9B-IT.

Performance Overview (mean over five models, both datasets):

Method	Inferences/query	Latency (s)	NDCG@10
Setwise Heapsort (no prior)	126.2	9.41	0.642
Setwise Insertion (with prior)	96.6	6.27	0.653
Relative reduction/gain	-23%	-31%	+1.7%

Flan-T5-large (TREC DL 2019):
- Heapsort (no prior): $0.669 \pm 0.0002$
- Insertion (with prior): $0.671 \pm 0.0001$

Cost (GPT-4, TREC DL):

setwise.heapsort: $\approx \$1.28 $per query ($ $p er q u ery ($ 0.674 $nDCG)</li> <li>pairwise.heapsort:$ $n D CG) < / l i >< l i > p ai r w i se . h e a p sor t :$ \approx \$3.39 $per query ($ $p er q u ery ($ 0.680 $nDCG) (<a href="/papers/2310.09497" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Zhuang et al., 2023</a>, <a href="/papers/2504.10509" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Podolak et al., 9 Apr 2025</a>)</li> </ul> <p>Setwise approaches maintain robustness to initial ranking order: when initial candidate lists are inverted or randomized, setwise methods retain effectiveness comparable to pairwise heapsort, whereas listwise and pairwise-bubble sorts degrade more sharply (<a href="/papers/2310.09497" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Zhuang et al., 2023</a>).</p> <h2 class='paper-heading' id='trade-offs-limitations-and-failure-modes'>5. Trade-Offs, Limitations, and Failure Modes</h2> <p>Setwise Insertion yields the largest efficiency gains when only a small number of candidates outside the initial top-k must be inserted (i.e., when the prior ranking is good), as the insertion overhead ($ O(k/c) $per inserted document) is amortized over few events. If many insertions occur (weak initial ranker, large$ a $), or$ $), or$ k $is very small, the overhead can approach or exceed that of simple setwise sorting.</p> <p>Setwise and its variants are constrained by model and prompt limitations:</p> <ul> <li>Must select$ c $so the prompt (query +$ $so t h e p ro m pt (q u ery +$ c $docs + template) fits the LLM’s context window.</li> <li>Reranking requires strictly sequential sorting passes; batching works best across queries, not within-document lists.</li> <li>Models lacking access to logits must use “max-compare” (select winner in batch), which can prematurely discard close candidates.</li> <li>Hallucinations or inconsistent index outputs can still occur if the prompt format is not strictly followed.</li> <li>Excessively large$ c $values cause document truncation and may reduce retrieval quality (<a href="/papers/2310.09497" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Zhuang et al., 2023</a>, <a href="/papers/2504.10509" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Podolak et al., 9 Apr 2025</a>).</li> </ul> <p>There is a practical break-even: when the insertion overhead (many qualifying new candidates for top-k) surpasses that of a full setwise sort, Setwise Heapsort is preferable.</p> <h2 class='paper-heading' id='implementation-recipes-and-deployment-considerations'>6. Implementation Recipes and Deployment Considerations</h2> <p>Recommended settings:</p> <ul> <li>Set size$ c \in [3, 5] $to balance prompt size and efficiency.</li> <li>For$ k $(evaluation cutoff, e.g.$ $(e v a l u a t i o n c u t o ff, e . g .$ k=10 $): <ul> <li>First, sort the top-k using Setwise Heapsort ($ O(k \log_c k) $);</li> <li>Then run Setwise Insertion over remaining$ n-k $documents in batches of$ $d oc u m e n t s inba t c h eso f$ c $.</li> </ul></li> <li>Use “max-compare + prior bias” for models without direct logit access; “sort-compare” if logits are available to prune multiple candidates.</li> <li>Scan all$ n−k $remaining candidates to avoid omissions.</li> <li>Batch setwise calls across multiple queries to decrease GPU latency.</li> <li>Monitor the empirical insertion rate$ a $and revert to plain Setwise Heapsort when insertions grow too frequent (<a href="/papers/2504.10509" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Podolak et al., 9 Apr 2025</a>).</li> </ul> <p>For full code and data, see: <a href="https://github.com/ielab/LLM-rankers" rel="nofollow noopener">https://github.com/ielab/LLM-rankers</a></p> <h2 class='paper-heading' id='extensions-and-research-directions'>7. Extensions and Research Directions</h2> <p>Setwise prompting provides knobs for efficiency-quality trade-offs via$ c $and is compatible with dynamic adaptation (varying$ c$ based on available tokens), hybrid retrieve+rank pipelines, prompt-tuning/self-supervised prompt evolution (e.g., PromptBreeder), and passage-level ranking in complex QA settings (Zhuang et al., 2023).

Further plausible exploration includes:
- Dynamic adjustment of set size per comparison;
- Integration with dense retrieval for joint retrieve+rank;
- Prompt evolution for enhanced setwise comparison robustness.
The reproducible environment is Python 3.9, PyTorch, Huggingface Transformers, and Pyserini for BM25 retrieval (GPU: RTX A6000), and validated across multiple LLM architectures and two large-scale zero-shot reranking benchmarks (Zhuang et al., 2023, Podolak et al., 9 Apr 2025).

PDF Markdown Chat (Pro)

References (2)

1.

A Setwise Approach for Effective and Highly Efficient Zero-shot Ranking with Large Language Models (2023)

2.

Beyond Reproducibility: Advancing Zero-shot LLM Reranking Efficiency with Setwise Insertion (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Setwise Prompting Approach.