- The paper introduces a retrieval-augmented framework that decomposes multi-hop questions into optimal subqueries using a Markov Decision Process.
- It employs binary tree search to balance internal reasoning with external retrieval, reducing overhead and improving QA accuracy by approximately 22%.
- The approach leverages imitation learning and atomic decision calibration to fine-tune retrieval triggers, minimizing hallucinations and ensuring precise answers.
The paper presents a detailed framework for integrating retrieval-augmented generation with step‐by‐step reasoning for LLMs. It formulates multi-hop question answering as a Markov Decision Process (MDP) that systematically decomposes the original query into a sequence of dependent subqueries, where the model dynamically decides whether to rely on internal (parametric) knowledge or to retrieve external information. This dynamic decision process is driven by two key sub-decisions at each step: a binary termination decision and an atomic decision regarding retrieval.
The methodology is structured around three primary components:
- Binary Tree Search for Data Synthesis:
- x: Input question
- qi: Subquery generated at the i-th iteration
- ri: Intermediate answer (possibly enriched with retrieved documents)
- Imitation Learning:
Leveraging the synthesized trajectories from the binary tree search, the approach fine-tunes the model via imitation learning. This stage teaches the model to mimic the optimal retrieval narrative and query decomposition pattern that minimizes retrieval cost while preserving answer correctness. The training objective employs a masked loss function to prevent spurious learning from retrieved texts, ensuring that the model learns to generate precise subqueries and corresponding intermediate answers.
- Chain of Calibration:
- yw: Generated snippet for a direct answer
- yl: Generated snippet for a retrieval-based answer
- πθ: Model’s probability distribution
- πref: Reference model probabilities
- β: Hyperparameter controlling the penalty
- This formulation enables the model to optimize its retrieval mechanism by explicitly calibrating its internal confidence relative to external evidence.
In extensive experiments across multiple open-domain QA benchmarks—including datasets that test multi-hop reasoning, temporal sensitivity, and robustness to knowledge sparsity—the proposed framework demonstrates a strong performance advantage. In particular, DeepRAG achieves a significant improvement (approximately 21.99% in answer accuracy) over several baseline approaches such as chain-of-thought (CoT), iterative retrieval methods, and confidence-based retrieval strategies. Moreover, the paper provides detailed retrieval efficiency analyses where the average number of retrieval operations is shown to be lower compared to methods that rely on exhaustive iterative retrieval. This indicates an effective balance: the model leverages its internal knowledge when possible while judiciously opting for retrieval only when necessary.
Additional ablation studies further validate the efficacy of each component:
- Imitation Learning Variants: Alternative data synthesis strategies (e.g., selecting maximum cost paths or random paths) yield lower performance compared to the minimal retrieval cost strategy.
- Chain of Calibration Alternatives: Techniques based on all-node preferences or sentence-level calibration result in over-reliance on internal knowledge and degraded performance, reinforcing the necessity of the proposed atomic decision calibration.
Overall, the paper provides a comprehensive and technically detailed framework that integrates reasoning and adaptive retrieval in a principled manner. By modeling the retrieval-augmented generation process as an MDP and introducing strategic calibration of knowledge boundaries, the method offers a robust solution to minimize hallucinations and redundant retrievals while optimizing answer accuracy in LLMs.