MARAG-R1: Agentic Multi-tool RAG Framework

Updated 3 November 2025

MARAG-R1 is a dynamic multi-tool retrieval-augmented generation framework that integrates agentic reasoning with diverse retrieval methods.
It employs supervised fine-tuning followed by reinforcement learning to optimize multi-step tool selection and iterative evidence synthesis.
Empirical results demonstrate that MARAG-R1 outperforms traditional single-retriever models in factual accuracy and multi-hop reasoning benchmarks.

MARAG-R1 (Multi-tool Agentic Retrieval-Augmented Generation-R1) is a reinforcement-learned retrieval-augmented generation (RAG) framework that enables LLMs to transcend traditional single-retriever approaches by dynamically orchestrating multiple retrieval tools. The framework is designed to systematically interleave agentic reasoning and retrieval, facilitating broad and precise corpus-level evidence acquisition and synthesis. MARAG-R1 demonstrates substantial improvements in factuality and coverage across several reasoning benchmarks, achieving state-of-the-art results.

1. Motivation and Problem Addressed

LLMs, although proficient at reasoning and text generation, are fundamentally restricted by the static nature of their pretraining data, resulting in factual errors and insufficient adaptability to new information. Conventional retrieval-augmented generation (RAG) systems attempt to alleviate this by grounding LLM predictions in external data. However, these systems predominantly rely on a single retriever mechanism with a fixed top- $k$ selection, limiting the scope of accessible information. This single-retriever paradigm constitutes the primary bottleneck in corpus-level question answering (QA) and multi-hop reasoning, particularly when tasks require multi-step synthesis, aggregation, and extensive context integration.

MARAG-R1 is constructed to address these limitations by enabling dynamic, multi-tool coordination during retrieval, moving beyond static pipeline architectures and facilitating deeper, adaptive interaction between reasoning and evidence gathering.

2. Framework Architecture and Retrieval Tools

MARAG-R1’s architecture centers around a suite of four complementary retrieval mechanisms:

Tool Name	Purpose	Operation
Semantic Retriever ( $F_\mathrm{DR}$ )	Context-sensitive dense retrieval	Embedding-based similarity search
Keyword Retriever ( $F_\mathrm{KR}$ )	Precision on factual expressions/entities	Token/term-based search
Document Filter ( $F_\mathrm{DF}$ )	Efficient evidence narrowing	Metadata, logical, or attribute-based filter
Aggregation Tool ( $F_\mathrm{AG}$ )	Set/statistical aggregation & synthesis	Counting, ranking, min/max, sorting

The agent is equipped to plan and execute sequences of retrieval/tool-use actions, integrating evidence through iterative cycles of reasoning and evidence synthesis. This dynamic, agentic workflow stands in contrast to static retrieval setups; tool invocation is an explicit, learnable decision process at each reasoning step.

A reasoning trajectory is formally represented as $\mathcal{T} = \{S_1, S_2, \ldots, S_{|\mathcal{T}|}\}$ with steps $S_t = (R_t, C_t, D_t)$ , comprising intermediate reasoning ( $R_t$ ), tool selection and arguments ( $C_t$ ), and resulting documents ( $D_t$ ).

3. Training Methodology: Supervised Fine-Tuning and Reinforcement Learning

MARAG-R1 adopts a two-stage training protocol:

Supervised Fine-Tuning (SFT): The agent is trained on expert or machine-generated trajectories $\mathcal{T}^*$ , curated by advanced LLMs and filtered via rejection sampling for quality. The objective leverages next-token prediction loss:

$\mathcal{L}_{\text{SFT}} = -\sum_{i=1}^N \sum_{j=1}^{|\mathcal{T}_i|} \log P_\theta(S_j \mid Q_i, S_{<j})$

This produces a policy capable of basic multi-tool use and multi-step reasoning (referred to as MARAG-CS).

Reinforcement Learning (RL): The initial policy is refined to optimize multi-tool coordination via policy gradient methods using the Leave-One-Out (RLOO) baseline:

$\nabla_\theta J(\theta) = \frac{1}{K} \sum_{i=1}^K \left(R(\mathcal{T}_i) - \frac{1}{K-1} \sum_{j\neq i} R(\mathcal{T}_j)\right) \nabla_\theta \log P_\theta(\mathcal{T}_i)$

Reward Design:

Answer Reward ( $R_A$ ):

$R_A = \mathrm{F1}(A, A^*)$

Token-level F1 between predicted and gold answers.
Document Coverage Reward ( $R_E$ ):

$R_E = \mathrm{F1}(\mathcal{D}_{\mathrm{pred}}, \mathcal{D}^*)$

Measures overlap of retrieved vs. gold-support documents.
Tool Exploration Reward ( $R_T$ ):

$R_T = \begin{cases} 1, & N_\text{call} \leq N_\text{call}^* \ \max\left(0,\, 1 - \frac{N_\text{call} - N_\text{call}^*}{N_\text{call}^*}\right), & \text{otherwise} \end{cases}$

Encourages sufficient exploration without redundant calls.

The composite reward:

$R(\mathcal{T}) = R_A + R_E + R_T$

RL enables the agent to learn nuanced tool-use strategies, improving efficiency and the quality of reasoning-retrieval interaction beyond imitation learning.

4. Dynamic Multi-Retriever Coordination and Reasoning-Retrieval Interleaving

The core innovation of MARAG-R1 is explicit, multi-tool agentic retrieval coordination. At each reasoning step, tool invocation becomes a decision subject to learned policy, adapting to the current knowledge state and task requirement. The agent is able to:

Sequence and combine different retrieval mechanisms (semantic, keyword, filter, aggregation).
Iteratively update its context as retrievals accumulate and evidence is synthesized.
Alternate between deepening the retrieval context and advancing reasoning steps, until sufficient information is gathered for final output.

This agentic process allows MARAG-R1 to escape static, top- $k$ constraints and leverage compositionally diverse retrieval strategies, outperforming iterative and graph-based retrieval models in both breadth and precision.

5. Empirical Evaluation and Results

MARAG-R1’s empirical evaluation utilizes the following datasets and metrics:

GlobalQA: Assesses four corpus-level QA types (TopK, Count, Sort, MinMax), demanding aggregation, statistical, and sorting skills across large document corpora.
2WikiMultiHopQA and HotpotQA: Benchmark multi-hop reasoning with cross-document evidence integration.

Metrics include token-level answer accuracy (F1) and document-evidence coverage (D-F1@20).

Model	F1 (14B)	D-F1@20 (14B)
StandardRAG	~1.5	~8.1
IRCoT	~0.1	~8.8
HyperGraphRAG	~0.1	-
Search-R1	~2.9	~9.2
ReCall	14.25	20.00
MARAG-CS (SFT-only)	28.92	39.83
MARAG-R1	31.22	42.11

MARAG-R1 exhibits dominant performance margins over single-retriever, multi-hop, graph-based, and RL-based agentic retrieval baselines.
Ablation studies indicate substantial performance loss when any reward component or retrieval tool (especially aggregation/keyword) is omitted.
Transfer results show MARAG-R1 generalizes robustly, attaining best F1 and EM scores on 2WikiMultiHopQA (26.93 F1) and HotpotQA (39.16 F1).
Qualitative analyses reveal the agent’s ability to preserve evidence granularity and provenance, correctly synthesize multi-source information, and perform global set/statistical operations.

6. Broader Implications and Future Directions

MARAG-R1 establishes a paradigmatic shift in retrieval-augmented LLM methodology, moving from static, single-retriever pipelines to adaptive, agentic, multi-tool coordination. Notable outcomes and implications include:

Significant expansion in evidence coverage and factual accuracy, observed consistently on corpus-level and multi-hop reasoning benchmarks.
The framework demonstrates that retrieval selection and composition should be dynamic and reasoning-sensitive, not fixed at preprocessing.
Interleaved, iterative retrieval encourages a process akin to human research: updating hypotheses and context as new evidence emerges.
A plausible implication is that future RAG systems will increasingly integrate retrieval as a fundamental component of the reasoning cycle rather than a static one-shot module.

MARAG-R1 builds upon and extends several prior lines of research:

[GlobalQA, HyperGraphRAG, GraphRAG]: Corpus-level QA, graph-based retrieval.
[IRCoT, ReAct, Search-R1, ReCall]: Agentic, iterative retrieval, RL-based coordination.
[2WikiMultiHopQA, HotpotQA]: Multi-hop QA evaluation and generalization.
[RLOO]: Baseline for reinforcement learning optimization.
[BGE, Qwen3]: Embedding and model backbones.

These connections situate MARAG-R1 as a unifying advance in dynamic retrieval-augmented reasoning. Its empirical results, technical innovations, and agentic design point toward a future of reasoning-centric, tool-augmented LLM architectures.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to MARAG-R1 Framework.