DeepRAG Framework

Updated 19 February 2026

DeepRAG Framework is a retrieval-augmented generation system that couples reasoning with retrieval using a formal Markov Decision Process and hierarchical decomposition.
It employs iterative atomic query decomposition and imitation training to optimize the balance between retrieval actions and parametric reasoning.
Empirical results show notable improvements on benchmarks like HotpotQA, 2WikiMultihopQA, and MedHopQA, demonstrating enhanced answer accuracy and reduced retrieval costs.

DeepRAG Framework

DeepRAG denotes a class of frameworks and methodologies designed to tightly couple reasoning and retrieval within Retrieval-Augmented Generation (RAG) systems. Aimed at overcoming factual hallucinations and inefficient retrieval in LLMs, DeepRAG advances a paradigm where step-wise decomposition of queries is jointly optimized with context-sensitive retrieval, adaptive policy learning, and, in some implementations, domain-specific supervision and embedding. The DeepRAG paradigm is particularly characterized by explicit Markov Decision Process (MDP) formulations, policy-driven retrieval-vs-parametric reasoning, and hierarchical decomposition, supporting significant efficiency and accuracy improvements across open-domain and domain-specific QA tasks (Guan et al., 3 Feb 2025, Ji et al., 31 May 2025).

1. Markov Decision Process Formulation and Adaptive Retrieval

DeepRAG formalizes retrieval-augmented reasoning as a Markov Decision Process $(\mathcal{S},\mathcal{A},P,R)$ , where:

A state $s_t$ at time $t$ encodes the original question $x$ and a history of $(q_i, r_i)$ pairs, with each $q_i$ an atomic subquery and $r_i$ the corresponding response (either retrieved context and answer, or direct answer).
The action $a_{t+1} = (\sigma_{t+1}, \delta_{t+1})$ consists of the termination decision $\sigma_{t+1} \in \{\mathsf{continue}, \mathsf{terminate}\}$ and the retrieval vs. parametric decision $\delta_{t+1} \in \{\mathsf{parametric},\mathsf{retrieve}\}$ .
Transitions evolve $s_t$ by adding new subqueries and responses, determined either via retrieval (external documents) or parametric model generation (internal knowledge).
The reward $R$ is concentrated on terminal transitions, penalizing unnecessary retrievals and incentivizing correct answers: $R(s_{t+1}=s_t+[o]) = -C(o) \times T(s_t)$ with $C(o) = 1$ if $o$ is correct, $+\infty$ otherwise, and $T(s_t)$ the count of retrieval actions.

By formulating RAG as an MDP, DeepRAG enables fine-grained, query- and step-wise adaptation between retrieval and parametric reasoning—eschewing static retrieval strategies (Guan et al., 3 Feb 2025).

2. Iterative Atomic Decomposition and Imitation Training

DeepRAG achieves efficient reasoning by decomposing complex input queries into sequences of atomic subqueries, each targeted and minimal in scope. The atomicity reduces extraneous retrieval and narrows search to requisite knowledge units. Atomic decomposition is constructed via a beam/binary-tree search exploring all combinations of retrieval/parametric steps per subquery, with supervision distilled from minimal-retrieval trajectories that yield correct final answers.

Imitation learning forms Stage I of DeepRAG’s training, where the LLM is fine-tuned to reproduce the optimal sequence of subqueries and retrieval decisions extracted during tree search. The loss function marginalizes out retrieval tokens for parametric-only branches: $\mathcal{L}_{\mathrm{IL}} = -\sum_{i=1}^n \left[ \log\Pr_\theta(q_i \mid s_{i-1}) + \log\Pr_\theta(a_i \mid s_{i-1}, q_i, d_i) \right] ,$ where $d_i = \varnothing$ for parametric reasoning (Guan et al., 3 Feb 2025).

3. Chain-of-Calibration Optimization and Policy Learning

Beyond imitation, DeepRAG enforces preference for minimal yet sufficient retrieval through a second-stage fine-tuning, termed “Chain of Calibration.” Here, for each subquery $q_i$ , the model is presented with pairs of continuations—one with retrieval, one without—and trained to prefer continuations consistent with minimal-retrieval but correct-answer trajectories. This is operationalized via a DPO-style loss: $\mathcal{L}_{\mathrm{CC}} = -\log \sigma\Bigl( \beta\left[\log\tfrac{\pi_\theta(y_w \mid s_i, q_i)}{\pi_{\mathrm{ref}}(y_w \mid s_i, q_i)} -\log\tfrac{\pi_\theta(y_l \mid s_i, q_i)}{\pi_{\mathrm{ref}}(y_l \mid s_i, q_i)}\right] \Bigr) ,$ where $y_w$ , $y_l$ denote “winner” and “loser” response sequences by retrieval status (Guan et al., 3 Feb 2025).

Imitation and calibration stages collectively ensure that the LLM’s retrieval-vs-parametric reasoning decisions align with the underlying knowledge boundaries, balancing retrieval cost against empirical answer accuracy.

4. Domain-Aware Supervision and Specialized Instantiations

DeepRAG variants extend the core framework with domain- and task-specific supervision:

In biomedical multi-hop QA for MedHopQA, question decomposition is performed by a hierarchical reasoning module (DeepSeek R1), emitting claims annotated by hierarchical depth. Subqueries derived from claims are processed as MDP actions, guided by concept-level rewards computed via UMLS ontology overlap:

$r_\mathrm{concept}(c_k) = \frac{|CU \cap C^*|}{|C^*|}$

where $CU$ is the set of concepts linked in the retrieved/generated text and $C^*$ the gold-standard label set. The overall trajectory reward accumulates classic RAG-Gym metrics (sufficiency, utility, redundancy) and concept-level scores. The full loss is

$L(\theta) = L_{\rm ret}(\theta) + \lambda \cdot L_{\rm gen}(\theta) + \mu \cdot L_{\rm reward}(\theta),$

with grid-searched hyperparameters (Ji et al., 31 May 2025).

Custom language-specific embedding modules (e.g., for Hindi) further illustrate specialization, involving tailored tokenization and transformer architectures optimized by contrastive learning over domain corpora (23% gain over multilingual baselines is reported for Hindi) (M, 11 Mar 2025).

5. Empirical Performance and Comparative Analysis

DeepRAG demonstrates pronounced gains in both retrieval efficiency and answer correctness compared to standard RAG and recent adaptive retrieval frameworks:

On HotpotQA and 2WikiMultihopQA, full DeepRAG outperforms the best baseline by $+6.99$ EM and $+4.98$ F1, while reducing mean retrievals per query to $1.09$ (vs. $2.46$–$6.26$ for baselines) (Guan et al., 3 Feb 2025).
Biomedical DeepRAG achieves $62.4\%$ EM and $71.8\%$ concept accuracy on MedHopQA, with ablation studies confirming that hierarchical reasoning, process supervision, and concept rewards each contribute significantly to overall gains (Ji et al., 31 May 2025).

Method	HotpotQA EM/F1	2WikiMHQA EM/F1	Avg EM/F1
Best Baseline	35.2/41.43	35.2/42.85	40.68/46.37
DeepRAG-Imi	35.1/46.59	47.2/52.33	45.38/49.75
DeepRAG (full)	40.7/51.54	48.1/53.25	47.67/51.35

Retrieval cost and answer accuracy are calibrated, with DeepRAG attaining high Matthews Correlation (0.45) for knowledge-boundary decisions—substantially above prior adaptive RAG frameworks (Guan et al., 3 Feb 2025, Ji et al., 31 May 2025).

6. Limitations, Extensions, and Future Directions

DeepRAG’s decomposition via binary-tree search incurs exponential cost in worst-case multi-hop queries, motivating future work on heuristic or Monte Carlo Tree Search-based decomposition. Current instantiations are limited in retrieval corpus diversity (e.g., Wikipedia), and on-policy RL-based policy improvement remains a potential augmentation. Domain adaptations (e.g., biomedical, Hindi/NLP) highlight the flexibility of DeepRAG but also the complexities of task-specific contrastive supervision and reward engineering.

Across its instantiations, DeepRAG establishes a template for tightly integrated, policy- and task-aware retrieval and reasoning, grounded in formal MDPs and imitation-calibration pipelines, and empirically validated for both general and domain-specific QA (Guan et al., 3 Feb 2025, Ji et al., 31 May 2025, M, 11 Mar 2025).