Dynamic RAG: Adaptive Retrieval in LLMs

Updated 14 November 2025

Dynamic RAG is a framework where large language models actively retrieve external data during inference using feedback-driven, iterative queries.
It enhances accuracy and efficiency by interleaving retrieval with generation based on uncertainty signals and multi-hop reasoning.
Dynamic RAG employs techniques like dynamic k-selection, reinforcement learning, and multi-armed bandit strategies to balance retrieval cost with performance.

Dynamic Retrieval-Augmented Generation (Dynamic RAG) refers to a class of architectures and methodologies in which LLMs interact with external knowledge sources in a non-static, query- or context-adaptive manner during inference and/or generation. Unlike traditional RAG pipelines, which retrieve a fixed set of documents per input query (typically via a one-shot similarity search), Dynamic RAG systems adaptively decide what, when, and how much to retrieve, often interleaving retrieval with generation and utilizing explicit feedback to optimize both pipeline components.

1. Foundations and Motivation

The foundational goal of Retrieval-Augmented Generation is to overcome the knowledge limitations and hallucination tendencies of LLMs by grounding their outputs in external, dynamically retrieved information. In standard RAG, a retriever—usually a bi-encoder—fetches a set of relevant contexts for a given query, which are then concatenated (or otherwise incorporated) into the generation context for the LLM. However, this static paradigm faces several well-documented limitations (Gupta et al., 3 Oct 2024):

Inflexibility: Static retrieval at $t=0$ assumes the query fully specifies all information needs, which is often false in multi-hop, ambiguous, or evolving scenarios.
Long-context inefficiency: As the number or length of retrieved passages grows, softmax attention becomes diluted (“attention dilution”), leading to degraded focus and increased entropy in the model’s information distribution (Wang et al., 7 Aug 2025).
Missed Relevance: Important but indirect or dynamically relevant evidence is often missed because static scoring cannot adapt to intermediate findings (Hei et al., 11 Jun 2024).

Dynamic RAG methodologies address these challenges by introducing mechanisms for iterative, state-aware, or feedback-driven retrieval aimed at (a) maximizing answer accuracy and factual grounding, (b) minimizing unnecessary context, (c) improving robustness to ambiguity and multi-step reasoning, and (d) enhancing efficiency by limiting LLM calls and context expansion (Sun et al., 12 May 2025, Tang et al., 2 Dec 2024, Su et al., 7 Jun 2025).

2. Key Architectural Patterns in Dynamic RAG

2.1 Interleaved Retrieval and Generation

Dynamic RAG decomposes the classic retrieve-then-generate pipeline into more granular, often iterative steps:

Trigger Policies: The system computes, at each generation step $t$ , an uncertainty or information salience signal $u(s_t)$ (e.g., predictive entropy on next-token probabilities), invoking retrieval when this surpasses a learned or hand-tuned threshold $\tau$ (Su et al., 7 Jun 2025).
State-Dependent Queries: Instead of statically encoding the original query, the retrieval query at time $t$ can be a function $f_{\text{query}}(s_t)$ of the generation history or internal hidden state, e.g., $q_t = \text{MLP}([q; h_t])$ (He et al., 28 Apr 2025).
Context Update: Retrieved documents $D_t$ are injected into the context via concatenation or as special tokens, and decoding continues until end-of-sequence or another retrieval is triggered.

2.2 Dynamic Relevance Mining

Two-stage retrieval, as typified by DR-RAG, seeks to capture both “static-relevant” and “dynamic-relevant” evidence. The second stage uses query reformulation based on initially retrieved evidence to surface documents whose relevance emerges only given prior findings (Hei et al., 11 Jun 2024).

2.3 Dynamic k-Selection and Ordering

DynamicRAG formalizes the reranker as an RL agent that outputs both a permutation over retrieved documents and a dynamic stopping point, thereby determining the optimal $k$ per query (Sun et al., 12 May 2025). The agent is trained on a composite reward signal derived from the quality of the final generated answer, encompassing exact match, semantic similarity, fluency, and LLM-based validation.

2.4 Multi-Armed Bandit Retrieval Selection

MBA-RAG operationalizes adaptive method selection through a multi-armed bandit (MAB) algorithm, where each arm represents a retrieval strategy (e.g., zero-step, one-shot, multi-step retrieval), and the agent optimizes a cost-sensitive reward that trades off answer accuracy against retrieval cost (Tang et al., 2 Dec 2024).

2.5 Graph- and Plan-Based Modularization

Plan*RAG and DyG-RAG decompose complex queries into explicit DAGs of sub-questions (Plan*RAG) or temporally ordered event-entity graphs (DyG-RAG), enabling fine-grained, parallelizable, and causally faithful multi-hop retrieval and answer synthesis (Verma et al., 28 Oct 2024, Sun et al., 16 Jul 2025).

3. Formal Modeling and Algorithmic Instantiations

Let $q$ be the query, $y_{1:t-1}$ the generated tokens so far, $s_t$ the model state, and $R$ the retrieval oracle:

Adaptive Retrieval Policy:

$\pi_{\text{trigger}}(s_t) = \begin{cases} 1 & \text{if } u(s_t) \geq \tau \ 0 & \text{otherwise} \end{cases}$

where $u(s_t)$ is, for example, entropy of the model’s predictive distribution (Su et al., 7 Jun 2025).

Dynamic k-Selection:

The RL-trained policy $\pi_{\theta_r}$ outputs a sequence of document indices $(a_1,\dots,a_T)$ terminated by a STOP token, defining both order and $k$ . Joint reward:

$r = \alpha\,\mathrm{EM} + \beta\,\mathrm{SS} + \gamma\,\mathrm{TF} + \lambda\,\mathrm{LP} + \delta\,\mathrm{LLM{\hbox -}Eval}$

(Sun et al., 12 May 2025).

Iterative Multi-Hop Retrieval (CDF-RAG):

At each hop for a refined query $\hat{q}$ , retrieve semantic and causal paths, aggregate as $\mathcal{K} = T_{\text{sem}} \cup C_{\text{graph}}$ , condition generation, and validate output with causal and hallucination metrics (Khatibi et al., 17 Apr 2025).

Entropy-Invariant Attention Scaling (BEE-RAG):

Incorporate a balancing factor $\beta_i$ , adaptively determined via moment matching (Intrinsic Multi-Importance Inference or parameter-efficient fine-tuning), into softmax attention to enforce context-length-invariant entropy:

$a_{i,j} = \frac{\exp\left(\frac{\bm{q}_i\cdot\bm{k}_j}{\sqrt{d}+\beta_i}\right)}{\sum_{l=1}^n\exp\left(\frac{\bm{q}_i\cdot\bm{k}_l}{\sqrt{d}+\beta_i}\right)}$

(Wang et al., 7 Aug 2025).

4. Training, Optimization, and Feedback Mechanisms

Dynamic RAG controllers are typically trained using one or more of the following paradigms:

Supervised Behavioral Cloning: Imitating expert selection, ordering, or refinement moves observed from a static reranker, robust pathway, or workflow (Sun et al., 12 May 2025).
Direct Preference Optimization (DPO): Sampling alternative document selection/ordering trajectories, comparing their induced generation rewards, and optimizing the policy to prefer higher-reward paths (Sun et al., 12 May 2025, Leng et al., 7 Oct 2025).
Reinforcement Learning: Using PPO or bandit algorithms to adapt retrieval policies or query-refinement modules based on downstream reward signals combining coverage, context, causal depth, hallucination rates, and end-task metrics (Khatibi et al., 17 Apr 2025, Tang et al., 2 Dec 2024).
Contrastive Retraining: The R³ framework “closes the loop” between retriever and generator, labeling positive/negative retrievals using on-policy LLM outputs, and applying a contrastive loss on retriever encodings guided by answer quality (Zhou et al., 28 Oct 2025).

5. Empirical Results, Trade-Offs, and Applications

Dynamic RAG systems achieve substantial empirical gains across QA, fact verification, and complex reasoning tasks:

Answer Quality: State-of-the-art dynamic RAG methods achieve 5–16 percentage point absolute improvements in EM/F1 across diverse benchmarks over static RAG and prior adaptive approaches (see CDF-RAG, DR-RAG, MBA-RAG, DynamicRAG, Plan*RAG, BEE-RAG) (Khatibi et al., 17 Apr 2025, Hei et al., 11 Jun 2024, Tang et al., 2 Dec 2024, Sun et al., 12 May 2025, Verma et al., 28 Oct 2024, Wang et al., 7 Aug 2025).
Efficiency: DR-RAG and similar two-stage methods reduce average LLM compute per sample by over 70%, by consolidating all retrieval into a single LLM call (Hei et al., 11 Jun 2024). Pruning and process-level policy optimization—e.g., DecEx-RAG—yield 5–6 $\times$ faster data construction for policy optimization (Leng et al., 7 Oct 2025).
Robustness and Explainability: Causal graph validation (CDF-RAG), self-reflection agents in multi-modal domains (mRAG), and entropy-balanced attention (BEE-RAG) contribute to improved causal consistency, lowered hallucination, and context-length invariance (Khatibi et al., 17 Apr 2025, Hu et al., 29 May 2025, Wang et al., 7 Aug 2025).
Adaptivity: Query-adaptive selection of $k$ , iterative refinement, and bandit-based arm selection ensure that complex ("multi-hop" or ambiguous) queries receive richer evidence, while simple/single-hop queries incur minimal retrieval cost (Tang et al., 2 Dec 2024, Sun et al., 12 May 2025).

Table: Selected Empirical Performance

System / Dataset	EM / F1 Gain vs. Prev. State-of-the-art
CDF-RAG (MedMCQA)	+16 pp Accuracy
DR-RAG (HotpotQA)	+6 EM, +9 F1
MBA-RAG (avg 6 datasets)	+4.4 EM, +4.8 F1, –20% retrieval steps
DynamicRAG (NQ)	EM 48.4 vs. GPT-4o 36.1
DecEx-RAG (6 datasets)	+6.3 / +5.7 absolute EM/F1 improvement
BEE-RAG (Qwen-7B)	+2–3 pp EM/F1 over best zero-shot RAG

6. Limitations and Open Research Directions

Despite their improved flexibility and performance, Dynamic RAG methods face significant technical challenges and open problems:

Retrieval Cost and Latency: Iterative, token-level, or multi-hop retrieval-based architectures may introduce inference latency due to repeated or per-step searches (He et al., 28 Apr 2025, Su et al., 7 Jun 2025).
Query Formulation Accuracy: RL-driven or LLM-generated queries can fail to retrieve necessary evidence unless carefully tuned or guided by explicit uncertainty heuristics or chain-of-thought tracing (Su et al., 7 Jun 2025).
Credit Assignment and Feedback: Causal attribution of downstream errors or improvements to specific retrieval events is difficult, and naive rewards may induce reward hacking or local optima (Leng et al., 7 Oct 2025).
Context Length Management: Even with entropy balancing or compressed embedding approaches (e.g., DRAG), context-window and memory limits remain, especially in high- $k$ or document-rich settings (Shapkin et al., 2023, Wang et al., 7 Aug 2025).
End-to-End Differentiability: Many agentic dynamic RAG approaches remain pipeline-based, with limited end-to-end differentiable training of retrieval and generation modules (He et al., 28 Apr 2025, Su et al., 7 Jun 2025).

Future work aims to:

Jointly train retrieval and generation with RL on process- or answer-level rewards;
Integrate meta-learning for domain or user-adaptive retrieval policies;
Develop scalable, privacy-preserving dynamic indices;
Further automate chain-of-thought-guided query refinement and retrieval scheduling;
Extend dynamic RAG to multi-modal and agentic settings with strong self-reflection and evidence validation (Gupta et al., 3 Oct 2024, Hu et al., 29 May 2025, Su et al., 7 Jun 2025).

7. Specialized Extensions

Dynamic RAG architectures extend beyond text QA to event-centric temporal reasoning (DyG-RAG), multi-turn dialogue (DH-RAG), and multi-modal VQA or reference-guided image generation (ImageRAG, mRAG). In each, adaptivity—whether by dynamic entity/event selection, context pruning, self-reflection, or plan-based decomposition—yields state-of-the-art robustness, accuracy, and efficiency in real-world, knowledge-intensive scenarios (Sun et al., 16 Jul 2025, Zhang et al., 19 Feb 2025, Hu et al., 29 May 2025, Shalev-Arkushin et al., 13 Feb 2025).

Dynamic RAG methods represent the dominant trend in the evolution of RAG, advancing from rigid, batch-mode retrieval to closed-loop, uncertainty-driven, agentic, parallelized, and context-aware knowledge grounding. This shift is marked by demonstrable gains across efficiency and accuracy metrics, explicit modeling of when/what/how much to retrieve, and a growing methodological diversity suited for complex, high-stakes information retrieval in large-scale language generation systems.