Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ReasonIR: Training Retrievers for Reasoning Tasks (2504.20595v1)

Published 29 Apr 2025 in cs.AI, cs.CL, cs.IR, and cs.LG

Abstract: We present ReasonIR-8B, the first retriever specifically trained for general reasoning tasks. Existing retrievers have shown limited gains on reasoning tasks, in part because existing training datasets focus on short factual queries tied to documents that straightforwardly answer them. We develop a synthetic data generation pipeline that, for each document, our pipeline creates a challenging and relevant query, along with a plausibly related but ultimately unhelpful hard negative. By training on a mixture of our synthetic data and existing public data, ReasonIR-8B achieves a new state-of-the-art of 29.9 nDCG@10 without reranker and 36.9 nDCG@10 with reranker on BRIGHT, a widely-used reasoning-intensive information retrieval (IR) benchmark. When applied to RAG tasks, ReasonIR-8B improves MMLU and GPQA performance by 6.4% and 22.6% respectively, relative to the closed-book baseline, outperforming other retrievers and search engines. In addition, ReasonIR-8B uses test-time compute more effectively: on BRIGHT, its performance consistently increases with longer and more information-rich rewritten queries; it continues to outperform other retrievers when combined with an LLM reranker. Our training recipe is general and can be easily extended to future LLMs; to this end, we open-source our code, data, and model.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Rulin Shao (20 papers)
  2. Rui Qiao (47 papers)
  3. Varsha Kishore (8 papers)
  4. Niklas Muennighoff (56 papers)
  5. Xi Victoria Lin (39 papers)
  6. Daniela Rus (181 papers)
  7. Bryan Kian Hsiang Low (77 papers)
  8. Sewon Min (45 papers)
  9. Wen-tau Yih (84 papers)
  10. Pang Wei Koh (64 papers)
  11. Luke Zettlemoyer (225 papers)

Summary

Existing retrievers often struggle with tasks that require complex reasoning because their training data typically consists of short, factual queries paired with documents that provide direct answers. This contrasts with reasoning-intensive tasks where relevant documents might offer background knowledge, methodologies, or examples rather than a simple factual answer. The paper "ReasonIR: Training Retrievers for Reasoning Tasks" (Shao et al., 29 Apr 2025 ) introduces ReasonIR-8B, a bi-encoder retriever specifically designed for these reasoning-intensive tasks. The core innovation lies in ReasonIR-Synthesizer, a pipeline for generating synthetic training data that focuses on challenging, reasoning-intensive queries and difficult negative examples.

The ReasonIR-Synthesizer pipeline generates two main types of synthetic data to address the shortcomings of existing datasets:

  1. Varied-length () Data: This data is designed to improve the retriever's ability to handle longer and more complex queries, enhancing its effective context length. It involves generating long queries (300-2000 words) paired with relevant positive documents, following a distillation idea.
  2. Hard Query () Data: This data aims to train the retriever on queries requiring reasoning beyond simple keyword or semantic matching. It's generated from "reasoning-worthy" seed documents (like those from BRIGHT (Su et al., 16 Jul 2024 )) using a "human-like brainstorm guideline" to create challenging, self-contained queries.

A crucial component for both and data generation is the Multi-turn Hard Negative Generation. Unlike traditional hard negative mining methods that rely on existing retrievers (which perform poorly on reasoning queries), this approach directly synthesizes unhelpful but superficially relevant documents. This is done in a separate turn after generating the query and positive document, making the hard negatives more challenging and relevant to the specific synthetic query. Analysis shows that this synthetic data is significantly more challenging and covers a wider range of query lengths than public datasets like MS MARCO or Natural Questions.

ReasonIR-8B is trained by fine-tuning Llama3.1-8B on a mixture of public datasets (like MS MARCO, Natural Questions, HotpotQA) and the synthetic data generated by ReasonIR-Synthesizer. The model is adapted to use a bi-directional attention mask, and contrastive training is used with a loss objective:

(q)=logexp(τcos(h(q),h(d+)))dj{d+}Dexp(τcos(h(q),h(dj))\ell(q) = \log \frac{\exp\bigl(\tau \cdot \mathrm{cos}(h(q), h(d^+))\bigr)}{\sum_{d_j \in \{d^+\} \cup D^-} \exp\bigl(\tau \cdot \cos(h(q), h(d_j)\bigr)}

where hh is the retriever encoding function, qq is the query, d+d^+ is a positive document, DD^- are negative documents (in-batch and synthetic hard negatives), and τ\tau is a temperature parameter (set to 0.02). Techniques like GradCache and cross-device negatives are employed to enable large batch sizes for robust training.

For improving test-time performance, the paper explores two main strategies:

  1. Query Rewriting: Using an LLM to expand the original query into a longer, more detailed, and informative "Rewritten-query". ReasonIR-8B is shown to consistently benefit from longer rewritten queries, unlike other retrievers which plateau or worsen.
  2. ReasonIR-Rerank: A simple yet effective LLM reranking method. It addresses the issue of ties in scores produced by naive LLM rerankers by interpolating the reranker's scores with the base retriever's scores (either ReasonIR-8B or BM25). Using Qwen2.5-32B-Instruct in a zero-shot setting, this method achieves high performance without requiring additional training or generating lengthy reasoning traces like some prior LLM rerankers.

The paper evaluates ReasonIR-8B on both information retrieval (IR) and retrieval-augmented generation (RAG) benchmarks. On the BRIGHT (Su et al., 16 Jul 2024 ) IR benchmark, ReasonIR-8B achieves state-of-the-art results, with 24.4 nDCG@10 using original queries and 29.9 nDCG@10 with GPT-4 rewritten queries. When combined with the QwenRerank method, performance further improves to 36.9 nDCG@10. Notably, ReasonIR-8B with query rewriting achieves better performance than several LLM reranker baselines while requiring significantly less test-time compute (estimated over 200x less FLOPS compared to a 32B parameter LLM reranker). A hybrid approach combining ReasonIR-8B and BM25 scores also shows improved performance, suggesting complementarity between dense and sparse retrieval for these tasks.

In RAG tasks using MMLU (Hendrycks et al., 2020 ) and GPQA (Chiu et al., 10 Apr 2024 ) with a filtered MassiveDS (Shao et al., 9 Jul 2024 ) datastore, ReasonIR-8B also demonstrates strong performance. It improves MMLU performance by 6.4% and GPQA performance by 22.6% relative to closed-book baselines and outperforms other retriever and search engine baselines. Using the reader model itself (Llama3.1-8B-Instruct for MMLU, Qwen2.5-7B-Instruct for GPQA) for query rewriting generally helps on MMLU and the search engine baseline on GPQA, though it can sometimes decrease performance for dense retrievers on GPQA, potentially due to the smaller model's rewriting quality on complex tasks.

Ablation studies confirm the importance of the synthetic data mix. Training on a combination of public data, varied-length data, and hard query data yields the best performance, demonstrating a synergy between improving length generalization and handling reasoning-intensive queries. Including "easy" synthetic queries did not provide benefits on BRIGHT, suggesting that the difficulty and reasoning focus of the synthetic data are key.

For practical implementation, training involves fine-tuning a LLM (Llama3.1-8B) as a bi-encoder. This requires significant computational resources for training (using multiple GPUs for large batch sizes with techniques like GradCache and cross-device negatives). Deployment of ReasonIR-8B involves encoding queries and documents into vector embeddings and performing cosine similarity search, which is computationally efficient compared to cross-encoder reranking. The proposed ReasonIR-Rerank method can be integrated as a post-processing step on the top-k retrieved documents using an LLM API (like Qwen2.5-32B-Instruct), adding some latency and cost but improving ranking quality. The open-sourced code, data, and model facilitate further research and application.

Youtube Logo Streamline Icon: https://streamlinehq.com