DynamicRAG: Adaptive Retrieval-Generated Framework

Updated 10 November 2025

DynamicRAG is a retrieval-augmented generation framework that dynamically selects and orders documents based on query context and LLM feedback.
It employs a reinforcement learning-based reranker to optimize document selection, balancing contextual informativeness with prompt brevity.
Empirical results show improved efficiency with fewer LLM calls and state-of-the-art performance across diverse benchmarks in QA and multi-hop tasks.

DynamicRAG refers to a class of retrieval-augmented generation (RAG) frameworks in which the set, ordering, or integration of retrieved documents is determined dynamically based on query context and often optimized via direct feedback from LLMs. The principal aim is to maximize the final generation quality—measured through multifaceted metrics aggregating factual correctness, textual overlap, and even LLM-based preference scores—by adaptively balancing the trade-off between contextual informativeness and distractive noise. Unlike traditional static reranking, which selects or reorders a fixed number of passages based on external retriever scores or heuristic policies, DynamicRAG incorporates reinforcement-learning-driven agents that consult the downstream generator’s actual output in a closed loop, with the reranker’s decisions modulated by rewards computed over the LLM’s response. This strategy has demonstrated state-of-the-art performance on diverse knowledge-intensive benchmarks, strong efficiency gains, and robust sample efficiency.

1. System Architecture: Components and Flow

DynamicRAG consists of three main modules coupled sequentially:

Retriever (frozen): An off-the-shelf retriever (typically dual-encoder, e.g. Contriever-MS MARCO) that returns the top- $N$ candidate documents $D$ for a user query %%%%2%%%% from an external corpus $C$ .
Dynamic Reranker (trainable): A policy network $\pi_{\theta_r}$ (usually instantiated as an LLM, e.g. LLaMA2) which, given $(q, D)$ , outputs a (possibly variable-sized) subset $\hat{D}\subset D$ , an explicit ordering over $\hat{D}$ , and implicitly decides the number of documents $k=|\hat{D}|$ to be passed to the generator.
Generator (trainable): An LLM $\pi_{\theta_g}$ that produces the final answer $\hat{y}$ , conditioning on $(q, \hat{D})$ .

Inference workflow:

$D = \text{Retriever}(q, C)$
$\hat{D} = \pi_{\theta_r}(q, D)$ (where $|\hat{D}|=k$ is determined dynamically per query)
$\hat{y} = \arg\max_y p(y|q,\hat{D})$ via $\pi_{\theta_g}$

This architecture decouples retrieval (frozen) from both reranking (dynamic, trainable) and generation (trainable), enabling fine-grained control and learning over document selection conditioned on actual output quality.

2. Reranker as a Reinforcement Learning Agent

The reranking module is cast as a partially-observable Markov decision process (POMDP), with states represented implicitly by the tuple $(q, D, h_{t-1})$ (query, candidate list, selection/history), and the action space comprising indices of candidate passages and the special $\texttt{STOP}$ token. At each step $t$ , $\pi_{\theta_r}(a_t | q, D, h_{t-1})$ generates a distribution over unselected document IDs and $\texttt{STOP}$ .

Training Regimen

Behavioral Cloning (BC):

Initially, the agent is bootstrapped by supervised imitation of expert trajectories $\tau=(a_1,\dots,a_T)$ provided by a strong reference model (MonoT5). The BC objective is:

$\mathcal{J}_{BC}(\theta_r) = \mathbb{E}_{(G, i, \tau)\sim \mathcal{D}_e}\Big[\sum_{t=1}^{T}\log\pi_{\theta_r}(a_t|G,i,h_{t-1})\Big]$

Reinforcement Learning (Direct Preference Optimization, DPO):

After pretraining, $\pi_{\theta_r}$ is further fine-tuned using DPO, with feedback from the generator $\pi_{\theta_g}$ serving as the environment:

For sampled trajectory $\tau$ , after generating output $\hat{y}$ , assign scalar reward:

$r(G, i, \tau) = \alpha \cdot \mathrm{EM} + \beta \cdot \mathrm{SS} + \gamma \cdot \mathrm{TF} + \lambda \cdot \mathrm{LP} + \delta \cdot \mathrm{LLM\!-\!Eval}$

$\mathrm{EM}$ : exact match between ground-truth $y_{gt}$ and model output $\hat{y}$
$\mathrm{SS}$ : BERTScore
$\mathrm{TF}$ : ROUGE
$\mathrm{LP}$ : length penalty $1/(1 + |\hat{y}|)$
$\mathrm{LLM\!-\!Eval}$ : learned LLM-based evaluator

All coefficients are set to $0.2$ in reported experiments.

DPO step: Sample $N$ trajectories, select best/worst $(\tau^+,\tau^-)$ based on $r(\cdot)$ , and optimize

$\mathcal{J}_{DPO}(\theta_r) =\mathbb{E}_{(\tau^+,\tau^-)}\left[ \log \sigma\Big( \beta_{\text{DPO}} [\log\pi_{\theta_r}(\tau^+) - \log\pi_{\theta_r}(\tau^-)] \Big) \right]$

$\sigma(\cdot)$ is the sigmoid, $\beta_{\text{DPO}}$ is a temperature or scaling hyperparameter.

3. Dynamic Selection of Document Count ( $k$ )

By incorporating a $\texttt{STOP}$ token in its action space, $\pi_{\theta_r}$ can end the selection process at any step $t$ , thus choosing a variable number $k$ of supporting passages per query. Empirically, before RL, supervised models over-select ( $k$ often peaking at 14–15 on NQ/HotpotQA); after RL (with explicit length penalty), $k$ shifts left (often peaking near 12–14), reflecting learned avoidance of prompt bloat and recognition that excess context can degrade answer quality.

4. Empirical Evaluation and Ablations

Benchmarks: Seven datasets spanning open-domain QA (Natural Questions, TriviaQA, ASQA, 2WikimQA), multi-hop (HotpotQA), fact verification (FEVER), and long-form QA (ELI5).

Baselines: Non-retrieval LLMs (GPT-3.5-Turbo, GPT-4, GPT-4o), RAG systems (IRCoT, ReAct, Reward-RAG, FLARE, RA-DIT), static rerankers (MonoT5, Vanilla-RAG, Self-RAG, RankRAG), and supervised fine-tuning.

Key Results:

Dataset	SOTA Baseline	EM/Accuracy/ROUGE (Baseline)	DynamicRAG (LLaMA3-8B)
NQ	RankRAG	50.6 (470k ex.)	48.4 (150k ex.)
HotpotQA	RankRAG	35.3	36.7
ASQA	ChatQA-1.5	46.8	56.3
FEVER	-	-	91.4 (Acc)
ELI5	-	-	24.6 (ROUGE-L)

Reranker recall: R@20=86.8/67.2 (NQ/HotpotQA), matching or surpassing RankRAG with far less supervised data (20K vs 470K samples).
Inference cost: Only two LLM calls per query (one by reranker, one by generator), compared to 20 for RankRAG—∼17× throughput gain.

Ablation Insights:

Removing retrieval or the reranker crushes EM (–25 to –13 points).
Skipping RL drops EM by 7.2 points.
Omitting the exact match reward term degrades QA benchmarks most severely; other reward terms (BERTScore, ROUGE, length penalty, LLM-Eval) are also essential for peak performance.

5. Efficiency, Robustness, and Generalization

DynamicRAG achieves substantial computational and sample efficiency:

150K training examples suffice vs. 470K for previous SOTA.
2 LLM forward passes at inference versus 20+ in iterative baselines.

The framework is robust to retriever backbone: tested with DPR, Contriever, and monoT5 retrievers; always outperforms vanilla RAG. Model-scaling ablations reveal that larger rerankers and sharing parameters between reranker/generator maximize accuracy gains.

The learned reranker is specialized to generation-targeted rewards, not simply IR metrics, and adapts dynamically to the information requirements of each query (e.g., more passages for multi-hop or ambiguous queries, fewer when context is confidently sufficient).

6. Limitations and Future Directions

Limitations:

Supervised signals for reward construction require gold answers, limiting out-of-the-box deployment to domains lacking reference data.
DPO-based RL demands sampling/batching multiple trajectories, which can stress memory/compute during fine-tuning.
Current DynamicRAG formulation uses per-generation scalar rewards; it does not model stepwise (per-selection) rewards or hierarchical stopping.

Proposed directions for extension:

Incorporate external critic models for richer, more informed reward shaping.
Enable truly online learning: adapt on continual deployment feedback, not just reference answers.
Extend to modalities and tasks beyond QA (e.g., summarization, dialog), with possible reward redefinition and per-step adaptation.
Explore observation-based RL (per-step feedback) to allow finer-grained policy adjustments.

7. Comparative Position and Impact

DynamicRAG distinguishes itself from static or purely supervised RAG frameworks by:

Framing document selection and ordering as an RL policy, tightly coupled to multi-metric, LLM-feedback-based generation rewards.
Empirically achieving or exceeding state-of-the-art results on diverse benchmarks with notably fewer training data.
Nearly eliminating LLM prompt bloat and token overhead, while increasing computational and sample efficiency.

Given these attributes, DynamicRAG provides a principled and practically validated architecture for retrieval-augmented generation systems that require dynamic context selection and explainable, reward-driven adaptivity (Sun et al., 12 May 2025).

PDF Markdown Chat (Pro)

References (1)

DynamicRAG: Leveraging Outputs of Large Language Model as Feedback for Dynamic Reranking in Retrieval-Augmented Generation (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to DynamicRAG.