InsertRank: LLM Listwise BM25 Reranker

Updated 22 July 2025

The paper introduces a novel LLM reranker that integrates BM25 lexical signals into listwise reranking to enhance document relevance evaluation.
It leverages chain-of-thought reasoning by combining traditional BM25 scores with semantic content, ensuring robust retrieval across varied queries.
Experimental results on BRIGHT and R2MED benchmarks demonstrate statistically significant improvements across multiple LLM families.

InsertRank LLM Listwise Reranker is a method for improving LLM-based document reranking by combining the LLM’s reasoning ability with explicit lexical signals from traditional retrieval models, such as BM25. InsertRank integrates BM25 scores into the LLM's input during listwise reranking, guiding the model’s judgment and consistently increasing retrieval effectiveness in complex, reasoning-driven queries. This approach has demonstrated improvements across diverse LLM families and a range of specialized retrieval benchmarks, notably BRIGHT and R2MED, by enhancing both the grounding and robustness of LLM-based ranking outputs (Seetharaman et al., 17 Jun 2025).

1. Methodology: Injecting Lexical Signals into LLM Listwise Reranking

InsertRank augments standard listwise LLM reranking by incorporating BM25 scores as an explicit numerical signal during reranking inference. After retrieving an initial ranked list $(D_1, D_2, ..., D_n)$ for a query $q$ using BM25, the framework includes each document’s BM25 score $b_i$ alongside its textual content in the reranker’s prompt:

$(q, D_1, b_1, D_2, b_2, ..., D_n, b_n) \rightarrow M \rightarrow r_1, r_2, ..., r_n$

Here, $M$ denotes the LLM-based reranker and $r_1,...,r_n$ the output reranked indices. The reranker input is formatted as a prompt that clearly presents each document together with its BM25 score, e.g.,

“You are also given the BM25 scores from a lexical retriever. <query> <{doc1, BM25 score: s1}, {doc2, BM25 score: s2}, ..., {docn, BM25 score: sn}>”

This setup enables the model to reason about both the semantic content and the classical term-based relevance indicators.

2. Reasoning Capabilities and Prompt Design

InsertRank leverages the LLM’s chain-of-thought reasoning, guiding it to balance the semantic evaluation of document content with the relevance signals from BM25. The injection of BM25 scores acts as an anchor, informing the LLM about traditional measures of document-query match, potentially counteracting semantic drift and reducing hallucinated or irrelevant rankings.

The explicit presentation of scores in the prompt has two effects:

It introduces a clear decision “hint” for the LLM, augmenting its internal comprehension with the strengths of a term-matching signal.
It encourages the model to learn associations between high BM25 scores and expected relevance, thereby making the reranker more robust in domains or queries where semantic similarity alone may not suffice.

3. Experimental Results and Benchmarks

InsertRank has been extensively evaluated on two rigorous reasoning-focused retrieval datasets:

BRIGHT: A benchmark with approximately 1 million documents and 1,300 complex, multi-domain queries, each requiring reasoning over diverse document types.
R2MED: A medical retrieval benchmark encompassing eight tasks, reflecting the challenges of specialized clinical language and domain reasoning.

Performance is measured using NDCG@10. When BM25 scores are injected into the prompt, InsertRank, using Deepseek-R1, achieves an average score of 37.5 on BRIGHT and 51.1 on R2MED. These results consistently outperform standard listwise reranking without BM25 scores, across all tested LLM families. Reported relative gains include 3.2% for Gemini 2.0 flash, 16.3% for Gemini 2.5 flash, and ~0.8% for GPT-4o and Deepseek-R1.

4. Influence of Score Normalization and Input Order

Ablation studies explore the effect of score normalization and document reordering:

Score Normalization: Scaling BM25 scores to the [0,1] range causes a slight decrease in performance (e.g., a 0.58% reduction on BRIGHT), while rescaling to [0,100] offers a marginal improvement (0.5–0.8% gain). This suggests that higher dynamic range in numerical cues is more interpretable or effective for LLMs in reranking prompts.
Document Order: InsertRank shows robustness to the order in which documents are presented in the prompt, particularly on BRIGHT, with gains persisting even after shuffling candidate order. However, for R2MED, maintaining the original BM25 order yields better results, indicating some sensitivity to document placement in prompts, especially in sensitive or highly technical domains.

5. Cross-Model Effectiveness

A significant attribute of InsertRank is its consistent improvement across different LLM architectures and families:

For GPT-4o and Deepseek-R1, injecting BM25 scores produces statistically significant, albeit smaller, performance increases.
For Gemini models, notably 2.5 flash, score-injection yields substantial gains, supporting the model-agnostic effectiveness of the approach.

This cross-model consistency demonstrates InsertRank’s generalizability and value as a plug-in enhancement to varied LLM-based retrieval systems.

6. Practical Implications and Future Directions

The empirical results underscore several practical and theoretical insights:

Robustness: By integrating explicit, low-cost retrieval signals (BM25), InsertRank anchors LLM judgments, resulting in more consistent and reliable reranking, especially in challenging reasoning-heavy contexts.
Generalizability: InsertRank benefits all major LLM families tested, making it a widely applicable technique for production and research reranking pipelines.
Prompt Engineering Considerations: The effectiveness of InsertRank depends partly on proper normalization of numerical cues and careful prompt organization. Providing a salient, sufficiently dynamic BM25 signal and thoughtful document sequencing are important recurring themes.

Directions for future exploration include extending this approach by injecting other retriever-derived signals or metadata, performing in-depth analyses of which cues are most beneficial in various domains, and investigating richer prompt structures to further harness both lexical and semantic reasoning in LLM reranking (Seetharaman et al., 17 Jun 2025).

7. Relation to Broader Developments in Listwise LLM Reranking

InsertRank relates closely to the ongoing movement in the LLM ranking community to integrate model interpretability, external retrieval signals, and prompt-based reasoning to overcome the limitations of purely neural or semantic ranking. The approach sits within a larger set of methods that address position bias, context window limitations, and the need for robust, reproducible performance across domains and LLM architectures. Its improvements are incremental but statistically significant, reinforcing a broader consensus that hybrid methods—bridging classical information retrieval and LLM-based semantics—offer the most reliable path to state-of-the-art retrieval effectiveness in complex query settings (Seetharaman et al., 17 Jun 2025).

PDF Markdown Chat (Pro)

References (1)

InsertRank: LLMs can reason over BM25 scores to Improve Listwise Reranking (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to InsertRank LLM Listwise Reranker.