REARANK: Reasoning Re-ranking Agent via Reinforcement Learning (2505.20046v1)

Published 26 May 2025 in cs.IR and cs.CL

Abstract: We present REARANK, a LLM-based listwise reasoning reranking agent. REARANK explicitly reasons before reranking, significantly improving both performance and interpretability. Leveraging reinforcement learning and data augmentation, REARANK achieves substantial improvements over baseline models across popular information retrieval benchmarks, notably requiring only 179 annotated samples. Built on top of Qwen2.5-7B, our REARANK-7B demonstrates performance comparable to GPT-4 on both in-domain and out-of-domain benchmarks and even surpasses GPT-4 on reasoning-intensive BRIGHT benchmarks. These results underscore the effectiveness of our approach and highlight how reinforcement learning can enhance LLM reasoning capabilities in reranking.

Summary

The paper introduces Rearank, a novel reasoning agent that applies reinforcement learning to perform listwise document reranking in information retrieval systems.
It leverages a sliding window strategy and a normalized relative improvement reward, significantly enhancing NDCG@10 performance even with scarce annotated queries.
Experiments reveal Rearank-7B outperforms larger models on reasoning-intensive benchmarks, demonstrating superior generalization through effective data augmentation.

The paper "REARANK: Reasoning Re-ranking Agent via Reinforcement Learning" introduces Rearank, a novel LLM-based agent designed for listwise document reranking in information retrieval systems. The core idea is to enhance reranking performance and interpretability by explicitly incorporating reasoning capabilities into the LLM agent, trained effectively through reinforcement learning (RL) despite limited annotated data.

Modern information retrieval often uses a two-stage process: initial retrieval (e.g., BM25) followed by reranking. LLMs have shown promise for reranking, particularly as agents that output final decisions rather than internal scores. However, adapting LLMs for this task faces challenges: they aren't inherently optimized for ranking, training requires scarce labeled data, their decision processes lack transparency, and state-of-the-art models are often large and computationally expensive.

Rearank addresses these challenges by formulating listwise reranking within an RL framework. Given a query $q$ and an initial list of $n$ passages $P = (p_1, \ldots, p_n)$ , the goal is to find the optimal permutation $\sigma$ that maximizes a ranking quality score (Equation 1). Due to LLM context length limits, Rearank employs a sliding window approach, where the LLM agent reorders a window of $w$ passages at a time. Iterating this window across the initial list (typically with overlap) allows reranking the entire list using approximately $O(2n/w)$ LLM calls, improving efficiency compared to processing passages individually.

In the RL setup, the LLM acts as the policy $\pi_\theta$ . The state is the current ranking of passages and the query. The action is the reordering produced by the LLM on a window of passages. The reward signal guides the learning. The paper uses Grouped Policy Optimization (GRPO), where the model samples multiple output sequences (reasoning + ranking) for an input and computes advantages based on a rule-based reward. The training objective (Equation 3) includes a token-level loss based on these advantages and a KL penalty for stability.

A key aspect of Rearank's implementation is the design of the reward function and data augmentation strategy. The reward is a composite signal (Equation 6). The primary ranking reward ( $r_\text{rank}$ ) is based on Normalized Discounted Cumulative Gain (NDCG@10). To make the reward stable and effective across different sets of candidate passages (which can have varying maximum possible NDCG@10), they use a relative improvement score normalized by the ideally achievable score for that specific set (Equation 5). This min-max normalization of the improvement score helps reduce variance. Additional format rewards encourage the agent to produce outputs with the required > and <answer> tags and the correct ranking list format.

To overcome the scarcity of high-quality listwise ranking data, the authors introduce a data augmentation technique called Initial State Expansion. Starting with a small set of annotated queries (179 from MSMARCO-V2), they generate multiple diverse sets of candidate passages (e.g., 20 passages sampled 50 times per query from BM25 top 100). These varied initial rankings serve as diverse training instances, allowing the model to learn robustly from a wider range of scenarios using the same underlying relevance judgments. This generated 12k training instances from only 179 queries.

Experiments were conducted on in-domain (TREC-DL19/20), out-of-domain (BEIR), and reasoning-intensive (BRIGHT) benchmarks, using nDCG@10 as the metric. Rearank-7B (built on Qwen2.5-7B) was compared against various baselines, including zero-shot RankQwen/RankGPT, SFT RankZephyr, and larger/reasoning-focused Qwen3 models and a concurrent RL approach (Rank-R1).

Key results showed that Rearank-7B, trained on only 179 annotated queries, achieved performance comparable to or surpassing much larger models like GPT-4 and Qwen3 on various benchmarks, particularly excelling on the reasoning-intensive BRIGHT dataset where it outperformed GPT-4. It demonstrated significant improvements over the Qwen2.5-7B baseline (6.5% on in-domain, 4.5% on OOD, 2.7% on BRIGHT). Compared to RankZephyr-7B (trained on 105k SFT instances), Rearank-7B showed comparable in-domain performance and better OOD performance, suggesting improved generalization from the RL approach. Against the concurrent Setwise Rank-R1-7B (trained on 72k RL instances), Rearank-7B performed better with substantially less training data, highlighting the effectiveness of the listwise strategy and data augmentation.

Ablation studies confirmed the importance of the RL training and the specific reward design. Applying the reasoning prompt alone to the base model yielded minimal gains. Filtering low-quality candidate sets was crucial. The proposed normalized relative improvement reward function outperformed raw or difference-based NDCG rewards. A direct SFT baseline trained on the same small dataset showed poor reasoning transfer compared to the RL-trained Rearank.

Analysis revealed that the RL training shapes the reasoning pattern, encouraging the agent to strategically identify relevant information and use concise comparisons. While larger Qwen3 models showed only marginal improvement from enabling reasoning, reasoning was crucial for Rearank's performance gains. The paper also explored the transferability of the learned reasoning, showing some improvements on mathematical reasoning tasks. They did not find a strong correlation between reasoning length and performance in their experiments.

Rearank demonstrates that RL can effectively train LLMs for reasoning-based listwise reranking with minimal annotated data by leveraging a novel data augmentation method and a carefully designed reward function. Its compact size and listwise strategy contribute to improved inference efficiency.

Limitations noted include the lack of formal evaluation on the faithfulness of generated explanations and the reliance on the quality of initial candidates provided by the first-stage retrieval system.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/TheTuringPost/status/1929690820216869297

https://twitter.com/_reachsumit/status/1927225624600437090

YouTube

Show All Videos