InsertRank: LLMs can reason over BM25 scores to Improve Listwise Reranking (2506.14086v1)

Published 17 Jun 2025 in cs.IR, cs.AI, and cs.CL

Abstract: LLMs have demonstrated significant strides across various information retrieval tasks, particularly as rerankers, owing to their strong generalization and knowledge-transfer capabilities acquired from extensive pretraining. In parallel, the rise of LLM-based chat interfaces has raised user expectations, encouraging users to pose more complex queries that necessitate retrieval by ``reasoning'' over documents rather than through simple keyword matching or semantic similarity. While some recent efforts have exploited reasoning abilities of LLMs for reranking such queries, considerable potential for improvement remains. In that regards, we introduce InsertRank, an LLM-based reranker that leverages lexical signals like BM25 scores during reranking to further improve retrieval performance. InsertRank demonstrates improved retrieval effectiveness on -- BRIGHT, a reasoning benchmark spanning 12 diverse domains, and R2MED, a specialized medical reasoning retrieval benchmark spanning 8 different tasks. We conduct an exhaustive evaluation and several ablation studies and demonstrate that InsertRank consistently improves retrieval effectiveness across multiple families of LLMs, including GPT, Gemini, and Deepseek models. %In addition, we also conduct ablation studies on normalization by varying the scale of the BM25 scores, and positional bias by shuffling the order of the documents. With Deepseek-R1, InsertRank achieves a score of 37.5 on the BRIGHT benchmark. and 51.1 on the R2MED benchmark, surpassing previous methods.

Summary

The paper introduces a novel approach that integrates BM25 scores as lexical signals to enhance LLM-based listwise reranking without fine-tuning.
The method achieves significant performance gains, notably a 37.5 NDCG@10 improvement on the BRIGHT benchmark and consistent gains on R2MED.
The study demonstrates that leveraging existing retrieval signals can reduce LLM reasoning errors, such as hallucinations and overthinking, in complex queries.

An Analysis of an LLM-Based Method for Enhanced Retrieval Effectiveness via Listwise Reranking

The paper "LLMs can reason over BM25 scores to Improve Listwise Reranking" introduces a novel approach in the domain of information retrieval, specifically addressing the application of LLMs for listwise reranking by leveraging BM25 scores. The authors focus on the potential of LLMs to enhance retrieval effectiveness for reasoning-centric queries, a timely topic given the rising complexity of user queries facilitated by advanced LLM-driven interfaces.

Core Contributions

The paper makes several contributions to the field:

Integration of Lexical Signals: The authors propose a method incorporating BM25 scores as lexical signals during the reranking process to aid LLMs in reasoning more effectively over document lists. This integration is positioned as a key improvement over state-of-the-art methods that often rely solely on semantic matching.
Evaluation Across Benchmarks: The proposed strategy is rigorously evaluated using two reasoning-centric benchmarks—BRIGHT and R2MED. The BRIGHT benchmark involves a reasoning evaluation across diverse domains, whereas R2MED tests retrieval in complex medical scenarios.
Zero-Shot Effectiveness: A notable aspect of this approach is its zero-shot nature. The retrieval system does not require fine-tuning, thereby presenting an efficient avenue for enhancing ranking performance without additional training overhead.

Strong Numerical Results

On the BRIGHT benchmark, the integration of BM25 scores resulted in an average NDCG@10 score improvement—37.5 versus the highest competing scores. Similarly, for R2MED, the approach showed consistent gains, underscoring the robustness of using BM25 scores in a zero-shot listwise setting.

Implications and Analyses

While previous works have leaned heavily on fine-tuning and dense retrieval methods, this paper's insights stress leveraging retriever-generated scores for improving LLM reasoning. The potential of these lexical signals to guide LLMs mitigates known issues like hallucinations, incorrect reasoning, and brittleness. Furthermore, the examination of BM25 score normalization suggests an enhanced grounding effect, helping the model effectively resonate with contextual information without falling into reasoning pitfalls such as overthinking.

The paper also explores evaluation methodologies that emphasize token economy—highlighting the intrinsic efficiency of listwise approaches as stacking all documents together in LLM prompts. The ablation studies conducted, which include document shuffling and score normalization, reveal insights into how document order and lexical score scaling impact reasoning accuracy.

Future Directions

This research paves the way for exploration of additional metadata or signals that could bolster LLMs' retrieval functionalities, potentially enriching document representation. It suggests a trajectory toward exploiting further lexical and syntactic cues, particularly in hybrid models combining dense and sparse retrieval methods. Moreover, it raises questions on whether metadata application can extend beyond retrieval tasks into domains like distillation and deployment optimization.

In summary, the paper provides a compelling reason to integrate existing retrieval signals with LLM reasoning capabilities, bolstering retrieval effectiveness while maintaining computational simplicity. As LLM capacities evolve, this integration is poised to become increasingly integral in the retrieval landscape, fostering better-informed and more strategic approaches to document ranking.

Related Papers

Tweets

https://twitter.com/_reachsumit/status/1935192757024534945