Papers
Topics
Authors
Recent
Search
2000 character limit reached

ReasonIR: Models for Reasoning-Intensive Retrieval

Updated 6 February 2026
  • ReasonIR is a family of retrieval models and training protocols designed for complex, reasoning-intensive IR tasks beyond surface-level lexical matching.
  • The flagship model, ReasonIR-8B, leverages a bi-encoder architecture with bidirectional attention and contrastive learning, achieving superior nDCG@10 scores on advanced benchmarks.
  • By integrating synthetic data generation and temporal metrics, ReasonIR effectively handles long queries and multi-step reasoning, setting a new standard for retrieval-augmented generation.

ReasonIR is a family of retrieval models and training protocols targeting reasoning-intensive information retrieval (IR) and retrieval-augmented generation (RAG). Unlike conventional document retrievers focused on short, factual queries or surface-level lexical matching, ReasonIR is explicitly designed to select and rank documents that support higher-order reasoning over complex, information-rich inputs, including in multi-step and temporally grounded settings. The ReasonIR paradigm is embodied in its flagship model, ReasonIR-8B, which integrates novel synthetic data generation, a tailored bi-encoder architecture, and tested evaluation on state-of-the-art reasoning and temporal benchmarks (Shao et al., 29 Apr 2025, Abdallah et al., 14 Jan 2026).

1. Model Architecture and Training Protocol

ReasonIR-8B is based on Llama3.1-8B (approximately 8 billion non-embedding parameters), fine-tuned for bi-encoder retrieval. The causal attention mask of standard LLMs is replaced with a bidirectional attention mechanism, allowing both query and document tokens to attend to each other in both directions. The final hidden state representations are average-pooled and projected via a linear layer; query and document embeddings are then compared via cosine similarity. All embeddings are 2\ell_2-normalized.

The model supports effective context lengths of up to 2,048 tokens, which is critical for handling long, rewritten queries and detailed documents at test time. This context length is leveraged extensively during retrieval for complex reasoning tasks.

Contrastive learning is employed as the principal training objective: for each query qq, positive document d+d^+, and a negative set DD^- (including in-batch and hard negatives), the loss is

(q)=logexp ⁣(τcos(h(q),h(d+)))exp ⁣(τcos(h(q),h(d+)))+dDexp ⁣(τcos(h(q),h(d))),\ell(q) = -\log \frac{\exp\!\bigl(\tau\,\cos(h(q),h(d^+))\bigr)} {\exp\!\bigl(\tau\,\cos(h(q),h(d^+))\bigr) +\sum_{d^-\in D^-} \exp\!\bigl(\tau\,\cos(h(q),h(d^-))\bigr)},

with temperature τ=0.02\tau=0.02 (Shao et al., 29 Apr 2025).

2. Synthetic Data Generation and Training Set Composition

The ReasonIR-Synthesizer pipeline fabricates two key classes of synthetic examples in addition to leveraging public datasets:

  • Varied-Length (VL) Examples: For each task instruction, Llama3.1-70B generates a “query” (300–2,000 words) and a corresponding positive document. Hard negatives are then synthesized by prompting the LLM to construct superficially similar yet semantically irrelevant documents conditioned on the query and positive.
  • Hard-Query (HQ) Examples: The pipeline samples high-quality, reasoning-worthy seed documents (filtered with the FineWeb-Edu classifier) from scientific and technical corpora. LLMs generate challenging, self-contained queries requiring higher-order reasoning—not mere lexical overlap. A second prompt conditions on the query and positive to produce plausible yet unhelpful hard negatives.

The full training mixture consists of:

  • 1,383,877 public examples (MS MARCO, NQ, HotpotQA, etc.; ~80% of training steps),
  • 244,970 Varied-Length synthetic examples (~14%),
  • 100,521 Hard-Query synthetic examples (~6%).

Ablation studies demonstrate that only the combined inclusion of public, VL, and HQ examples yields maximal retrieval accuracy on reasoning-centric benchmarks; using only public+VL or public+HQ gives modest gains (Shao et al., 29 Apr 2025).

3. Evaluation Metrics, Benchmarks, and Empirical Results

ReasonIR models are primarily evaluated on the BRIGHT benchmark for reasoning-intensive IR. The core metric is nDCG@10,

nDCG@10=i=1102reli1log2(i+1)IDCG@10,\mathrm{nDCG}@10 = \frac{\sum_{i=1}^{10} \frac{2^{\mathrm{rel}_i}-1}{\log_2(i+1)}} {\mathrm{IDCG}@10},

where reli\mathrm{rel}_i are graded relevance labels.

On BRIGHT, ReasonIR-8B, without reranking, achieves nDCG@10 of 29.9 (GPT-4 rewritten queries), surpassing BM25 (26.5) and strong dense retrievers (GRIT-LM-7B at 23.4). Combined with zero-shot Qwen2.5-32B reranking, ReasonIR-Rerank attains 36.9 nDCG@10, establishing a new overall state of the art (Shao et al., 29 Apr 2025). Comparable systems such as ReasonRank achieve 40.6 under listwise reasoning-augmented ranking (Liu et al., 9 Aug 2025).

ReasonIR-8B also delivers substantial gains in RAG setups: applied to MMLU and GPQA (using MassiveDS and Llama3.1-8B or Qwen2.5-7B), it yields relative performance improvements of +5.5% on MMLU and +22.7% on GPQA over strong closed-book LLM baselines (Shao et al., 29 Apr 2025).

Evaluation on the TEMPO benchmark further reveals ReasonIR’s strengths and limitations under temporal reasoning regimes. On Task 1 (overall retrieval), ReasonIR attains nDCG@10=27.2, Temporal Coverage@10=72.4% (best among all tested models), and Temporal Precision@10=57.4%. However, complex classes requiring synthesis across multiple periods (e.g., TCP – Trend/Cross-Period) remain challenging (Abdallah et al., 14 Jan 2026).

4. Distinctive Capabilities and Test-Time Properties

ReasonIR-8B demonstrates strong utilization of longer input queries at test time. Unlike other retrievers, ReasonIR-8B’s nDCG@10 continues to improve as the length of rewritten queries increases, up to at least 1,024 tokens; competing methods plateau or degrade beyond 256 tokens.

When combined with large LLM rerankers (e.g., QwenRerank or ReasonRank), the retrieval pipeline achieves additive improvements, with reranked nDCG@10 exceeding prior best methods by 7.5+ points (Shao et al., 29 Apr 2025, Liu et al., 9 Aug 2025). Simple tie-breaking interpolation between ReasonIR and reranker scores outperforms more compute-intensive rerankers with significantly reduced inference cost.

5. Advances in Reasoning-Intensive and Temporal Retrieval

Recent work establishes that surface-level matching is insufficient for reasoning-heavy information needs. ReasonIR benchmarks and ablation analyses highlight several factors:

  • Step-wise Planning and Multi-hop Retrieval: On TEMPO, introducing decomposed “Query+Step” or “Query+All” inputs dramatically improves retrieval of multi-period, temporally grounded evidence. ReasonIR’s architecture is well suited for such step-aware ranking protocols (Abdallah et al., 14 Jan 2026).
  • Temporal Metadata Conditioning: Explicit normalization of temporal intent and anchors in queries can boost ReasonIR’s nDCG@10 by 8 points. This finding motivates future integration of structured temporal representations within the ReasonIR pipeline.
  • Domain Adaptation: Significant variance in per-domain performance (e.g., finance, legal, history) suggests that incorporation of domain-specific temporal ontologies or fine-tuning on specialized corpora can further enhance ReasonIR’s generality (Abdallah et al., 14 Jan 2026).

The introduction of temporal metrics such as Temporal Coverage@k (TC@k) and Temporal Precision@k (TP@k) highlights new evaluation axes and encourages development of retrievers that rank temporally relevant evidence early and comprehensively.

6. Practical Implications and Future Directions

The ReasonIR pipeline, combining synthetic data generation, long-context bi-encoder modeling, and contrastive learning with hard negatives, is model-agnostic and directly extensible to future backbone LLMs with larger parameter scales or improved attention mechanisms. The open-sourced code, data, and model checkpoints provide reproducible recipes for adapting and scaling ReasonIR methodology (Shao et al., 29 Apr 2025).

A plausible implication is that next-generation retrieval frameworks will increasingly require joint modeling of structured metadata, explicit step- or plan-aware representations, and integration with downstream RAG systems in a confidence- or temporally-aware fashion. Systematic evaluation on benchmarks like TEMPO and BRIGHT foregrounds the dual importance of reasoning accuracy and temporal completeness, setting the stage for continuous progress in reasoning-intensive retrieval.

7. Comparison to Reasoning-Intensive Rerankers

Reranking approaches such as ReasonRank (Liu et al., 9 Aug 2025) employ automated synthesis of reasoning-rich training data, explicit supervised fine-tuning on listwise rationales, and reinforcement learning with multi-view ranking rewards combining NDCG, Recall, and RBO. ReasonRank achieves SOTA on BRIGHT (40.6 leaderboard score), R2MED, and demonstrates strong generalization, all with substantially lower latency than pointwise baselines. Integration of chain-of-thought rationales and consistency filtering in training further distinguishes this generation of rerankers.

Comparatively, ReasonIR-8B’s retriever-centric design, with its focus on bi-encoder efficiency and step-aware scaling, complements reranking strategies; the combination yields superior recall and downstream QA accuracy. The interaction between retriever and reranker quality will be a central axis of further research in reasoning-intensive IR (Shao et al., 29 Apr 2025, Liu et al., 9 Aug 2025).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ReasonIR.