Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 65 tok/s

Gemini 2.5 Pro 53 tok/s Pro

GPT-5 Medium 35 tok/s Pro

GPT-5 High 34 tok/s Pro

GPT-4o 99 tok/s Pro

Kimi K2 182 tok/s Pro

GPT OSS 120B 458 tok/s Pro

Claude Sonnet 4.5 38 tok/s Pro

2000 character limit reached

ReasonRank: Reranking LLM for Reasoning Tasks

Updated 12 August 2025

ReasonRank is a listwise passage reranker that synthesizes cross-domain, reasoning-intensive training data to generate gold labels and explicit reasoning chains.
The model leverages a two-stage post-training protocol—with supervised fine-tuning and reinforcement learning—to jointly optimize stepwise reasoning and ranking accuracy.
Extensive evaluation on benchmarks like BRIGHT, R2MED, and BEIR demonstrates significant gains in NDCG@10 and reduced latency, establishing state-of-the-art performance.

ReasonRank is a LLM-based listwise passage reranker explicitly designed to excel in reasoning-intensive ranking scenarios. Its core innovation is the synthesis of domain-diverse, reasoning-rich training data—labeled and explained via a high-capacity reasoning LLM—and a two-stage post-training protocol that jointly optimizes for strong stepwise reasoning and effective ranking. Extensive experiments, especially on the BRIGHT benchmark suite, establish that ReasonRank leads the field in reasoning-intensive passage reranking both in accuracy and latency. The following sections detail the mechanism, training approach, evaluation, and broader methodological impact of ReasonRank.

1. Reasoning-Intensive Training Data Synthesis

The foundational component of ReasonRank is its fully automated, domain-spanning reasoning-intensive training data framework. Training queries and candidate passage lists are constructed from heterogeneous sources:

Complex QA Queries: Sourced from six StackExchange sites (Biology, Earth Science, Economics, Robotics, StackOverflow, Sustainable Living), paired with high-quality “gold” answers and passages collected from relevant domains.
Coding Tasks: Derived from Leetcode, with candidate passages including both code snippets and explanatory documentation.
Mathematics Tasks: Sourced both from STEM corpus solution pairs (Math problems) and theorem statements from ProofWiki (Math theorems).
Web Search: Drawn from MSMARCO search logs to ensure coverage of standard search scenarios.

Once queries and passages are assembled, DeepSeek-R1—a sophisticated large reasoning model—is deployed to generate both gold label assignments (pointwise and listwise) and explicit reasoning chains. For each query, DeepSeek-R1 receives the passage batch and gold answer, then:

Selects positive (relevant) and hard negative passages,
Constructs an ordered gold ranking (listwise label) for the passage set,
Produces a free-form reasoning explanation supporting the ranking decisions.

A self-consistency data filtering mechanism is applied post-generation: the listwise ranking is scored (e.g., via NDCG@10) relative to the pointwise-labeled reference, and only examples meeting a minimum score threshold (α) are retained. This process ensures high reasoning and ranking fidelity in the training corpus.

2. Two-Stage Post-Training Protocol

ReasonRank's model training is a two-phase process engineered to maximize both reasoning ability and ranking accuracy:

a. Cold-Start Supervised Fine-Tuning (SFT):

The LLM backbone (e.g., Qwen2.5-7B-Instruct, Qwen2.5-32B-Instruct) is initially trained with the generated reasoning-intensive corpus.
Inputs are (query, candidate passage list); outputs are concatenated formatted text with:
- Reasoning chain (within > … </think> tags) > - Final passage ranking (within <answer> … </answer> tags) > > - The loss is standard autoregressive cross-entropy: > > $\mathcal{L} = -\sum_{i=1}^{|y|} \log P_\theta(y_i \mid x, y_{<i})$ > > b. Reinforcement Learning (RL) with Multi-View Ranking Reward: > > > - To refine ranking under the sequential, windowed listwise processing regime, RL is used with Grouped Relative Policy Optimization (GRPO). > > - The multi-view ranking reward, > > $R^m = \text{NDCG}@10 + \phi \cdot \text{Recall}@10 + \gamma \cdot \mathrm{RBO}$ > > accounts jointly for ranking quality (NDCG), coverage (Recall), and the list similarity at varying depths (Rank-Biased Overlap, RBO). > > $\mathrm{RBO} = (1-p) \sum_{d=1}^{|y^{\text{list}}|} p^{d-1} \frac{|y'_{1:d} \cap y^{\text{list}}_{1:d}|}{d}$ > > Hyperparameters φ, γ, p are optimized via validation. > > - Format-checking is implemented: the RL reward is masked to zero if output formatting (involving <think> and <answer> sections) is invalid. > > - The GRPO objective for each token sequence is: > > $J_{\mathrm{GRPO}}(\theta) = -\frac{1}{|G|}\sum_{i=1}^{|G|} \frac{1}{|y_i|} \sum_{t=1}^{|y_i|} \min\left\{ r_{i,t}(\theta) \hat{A}_{i,t}, \text{clip}(r_{i,t}(\theta), 1-\varepsilon, 1+\varepsilon) \hat{A}_{i,t} \right\} - \beta D_{\mathrm{KL}}(\pi_\theta \| \pi_{\mathrm{ref}})$ > > where $\hat{A}_{i,t}$ is the normalized advantage, $r_{i,t}(\theta)$ is the token probability ratio, $\varepsilon$ is a clipping constant, and $\pi_{\mathrm{ref}}$ is the reference model. > > ## 3. Evaluation and Benchmark Performance > > ReasonRank is extensively evaluated on a spectrum of both traditional and reasoning-focused information retrieval (IR) benchmarks. Key findings include: > > - BRIGHT Benchmark (12 datasets spanning economics, code, math, etc.): ReasonRank (32B) achieves an average NDCG@10 = 40.6, surpassing all prior models and establishing state-of-the-art performance. Major gains (up to +5 NDCG@10) are observed over prior listwise reasoning rerankers (e.g., Rank-R1, Rank1, Rank-K). > > - R2MED (medical QA): Robust gains confirm generalization to medical contexts. > > - BEIR (conventional IR): Retains strong performance, demonstrating that reasoning-centric training does not sacrifice mainstream ranking ability. > > - Latency: By reasoning over sliding windows (instead of pointwise per-passage), ReasonRank achieves 2–2.7× reduced latency compared with pointwise rerankers such as Rank1. > > Ablations show that omitting either reasoning-intensive training data or the RL phase results in significant performance drops, confirming the essential role of both components. > > ## 4. Technical Details and Implementation > > - Self-Consistency Filtering Hyperparameters: The NDCG@10 threshold α is set via grid search, ensuring high agreement across repeated DeepSeek-R1 ranking inferences. > > - Model Backbone: Both Qwen2.5-7B-Instruct and Qwen2.5-32B-Instruct are used, with parameter-efficient LoRA fine-tuning and DeepSpeed optimization. Training sequence lengths are capped (e.g., ≤3072 tokens) for long reasoning chains. > > - Sliding Window Strategy: Ranking is performed over windows (stride and window size chosen based on context length constraints), each window outputs its own reasoning/ranking pair, which are merged for a global submission. > > - RL Details: The RL process is run for additional epochs after SFT. KL regularization to a reference model is critical for stability, and clipping is used to bound policy updates. > > ## 5. Methodological Impact and Future Directions > > ReasonRank demonstrates that high-quality, fully automated, cross-domain reasoning-intensive data combined with staged SFT+RL training enables LLM-based rerankers to dominate in contexts requiring nontrivial inference and evidence synthesis. The multi-view RL reward, which incorporates holistic ranking quality over sequentially generated windows, is more effective than traditional single-metric approaches (e.g., raw NDCG rewards). > > Planned future work includes: > > - Data Diversity Expansion: Mixing reasoning-light and reasoning-heavy data for dynamic behavior optimization. > > - Backbone Scaling and Generalization: Testing on newer LLM platforms (e.g., Llama 3.1, Qwen3) and flexible adaptation to even larger context sizes, allowing direct full-list ranking. > > - Broader Application: Transfer of ReasonRank’s synthesis and reasoning capabilities to QA, dialogue, legal, and clinical retrieval settings, where interpretability is strictly required. > > ## 6. Summary Table: ReasonRank Workflow > > | Stage | Input | Output | > |---------------------------------|-------------------------|--------------------------------------------------------------------------------------------------| > | Data synthesis | Multi-domain queries, passages | DeepSeek-R1 generated: (a) pointwise labels, (b) ordered listwise labels, (c) reasoning chains | > | Self-consistency filtering | Gold lists, labels | High-quality consistent training examples | > | Supervised fine-tuning (SFT) | (query, passage list) | Model trained to produce (<think>reasoning, <answer>ranking</answer>) tokens | | Reinforcement learning (RL) | SFT-initialized model | Improved ranking via multi-view reward over sequential/overlapping windows |

7. Conclusion

ReasonRank achieves state-of-the-art passage ranking performance in reasoning-intensive settings by systematizing high-quality data synthesis, enforcing rigorous internal reasoning through structured SFT and RL objectives, and leveraging LLMs for explicit, interpretable chain-of-thought explanations. Its dual emphasis on generalization across domains and practical efficiency (lower latency) positions it as a reference system for research and deployment in high-stakes or complex IR environments. Future directions focus on expanding ReasonRank’s domain reach, leveraging emerging LLM architectures, and enabling direct, full-context ranking without windowing constraints (Liu et al., 9 Aug 2025).

PDF Markdown Chat (Pro)

References (1)

ReasonRank: Empowering Passage Ranking with Strong Reasoning Ability (2025)

Follow Topic

Get notified by email when new papers are published related to ReasonRank.

ReasonRank: Reranking LLM for Reasoning Tasks

1. Reasoning-Intensive Training Data Synthesis

2. Two-Stage Post-Training Protocol

7. Conclusion

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

ReasonRank: Reranking LLM for Reasoning Tasks

1. Reasoning-Intensive Training Data Synthesis

2. Two-Stage Post-Training Protocol

7. Conclusion

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research