Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 177 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 183 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

DynamicRAG: Adaptive Retrieval-Generated Framework

Updated 10 November 2025
  • DynamicRAG is a retrieval-augmented generation framework that dynamically selects and orders documents based on query context and LLM feedback.
  • It employs a reinforcement learning-based reranker to optimize document selection, balancing contextual informativeness with prompt brevity.
  • Empirical results show improved efficiency with fewer LLM calls and state-of-the-art performance across diverse benchmarks in QA and multi-hop tasks.

DynamicRAG refers to a class of retrieval-augmented generation (RAG) frameworks in which the set, ordering, or integration of retrieved documents is determined dynamically based on query context and often optimized via direct feedback from LLMs. The principal aim is to maximize the final generation quality—measured through multifaceted metrics aggregating factual correctness, textual overlap, and even LLM-based preference scores—by adaptively balancing the trade-off between contextual informativeness and distractive noise. Unlike traditional static reranking, which selects or reorders a fixed number of passages based on external retriever scores or heuristic policies, DynamicRAG incorporates reinforcement-learning-driven agents that consult the downstream generator’s actual output in a closed loop, with the reranker’s decisions modulated by rewards computed over the LLM’s response. This strategy has demonstrated state-of-the-art performance on diverse knowledge-intensive benchmarks, strong efficiency gains, and robust sample efficiency.

1. System Architecture: Components and Flow

DynamicRAG consists of three main modules coupled sequentially:

  1. Retriever (frozen): An off-the-shelf retriever (typically dual-encoder, e.g. Contriever-MS MARCO) that returns the top-NN candidate documents DD for a user query qq from an external corpus CC.
  2. Dynamic Reranker (trainable): A policy network πθr\pi_{\theta_r} (usually instantiated as an LLM, e.g. LLaMA2) which, given (q,D)(q, D), outputs a (possibly variable-sized) subset D^D\hat{D}\subset D, an explicit ordering over D^\hat{D}, and implicitly decides the number of documents k=D^k=|\hat{D}| to be passed to the generator.
  3. Generator (trainable): An LLM πθg\pi_{\theta_g} that produces the final answer y^\hat{y}, conditioning on (q,D^)(q, \hat{D}).

Inference workflow:

  1. D=Retriever(q,C)D = \text{Retriever}(q, C)
  2. D^=πθr(q,D)\hat{D} = \pi_{\theta_r}(q, D) (where D^=k|\hat{D}|=k is determined dynamically per query)
  3. y^=argmaxyp(yq,D^)\hat{y} = \arg\max_y p(y|q,\hat{D}) via πθg\pi_{\theta_g}

This architecture decouples retrieval (frozen) from both reranking (dynamic, trainable) and generation (trainable), enabling fine-grained control and learning over document selection conditioned on actual output quality.

2. Reranker as a Reinforcement Learning Agent

The reranking module is cast as a partially-observable Markov decision process (POMDP), with states represented implicitly by the tuple (q,D,ht1)(q, D, h_{t-1}) (query, candidate list, selection/history), and the action space comprising indices of candidate passages and the special STOP\texttt{STOP} token. At each step tt, πθr(atq,D,ht1)\pi_{\theta_r}(a_t | q, D, h_{t-1}) generates a distribution over unselected document IDs and STOP\texttt{STOP}.

Training Regimen

Behavioral Cloning (BC):

Initially, the agent is bootstrapped by supervised imitation of expert trajectories τ=(a1,,aT)\tau=(a_1,\dots,a_T) provided by a strong reference model (MonoT5). The BC objective is:

JBC(θr)=E(G,i,τ)De[t=1Tlogπθr(atG,i,ht1)]\mathcal{J}_{BC}(\theta_r) = \mathbb{E}_{(G, i, \tau)\sim \mathcal{D}_e}\Big[\sum_{t=1}^{T}\log\pi_{\theta_r}(a_t|G,i,h_{t-1})\Big]

Reinforcement Learning (Direct Preference Optimization, DPO):

After pretraining, πθr\pi_{\theta_r} is further fine-tuned using DPO, with feedback from the generator πθg\pi_{\theta_g} serving as the environment:

  • For sampled trajectory τ\tau, after generating output y^\hat{y}, assign scalar reward:

r(G,i,τ)=αEM+βSS+γTF+λLP+δLLM ⁣ ⁣Evalr(G, i, \tau) = \alpha \cdot \mathrm{EM} + \beta \cdot \mathrm{SS} + \gamma \cdot \mathrm{TF} + \lambda \cdot \mathrm{LP} + \delta \cdot \mathrm{LLM\!-\!Eval}

  • EM\mathrm{EM}: exact match between ground-truth ygty_{gt} and model output y^\hat{y}
  • SS\mathrm{SS}: BERTScore
  • TF\mathrm{TF}: ROUGE
  • LP\mathrm{LP}: length penalty 1/(1+y^)1/(1 + |\hat{y}|)
  • LLM ⁣ ⁣Eval\mathrm{LLM\!-\!Eval}: learned LLM-based evaluator

All coefficients are set to $0.2$ in reported experiments.

  • DPO step: Sample NN trajectories, select best/worst (τ+,τ)(\tau^+,\tau^-) based on r()r(\cdot), and optimize

JDPO(θr)=E(τ+,τ)[logσ(βDPO[logπθr(τ+)logπθr(τ)])]\mathcal{J}_{DPO}(\theta_r) =\mathbb{E}_{(\tau^+,\tau^-)}\left[ \log \sigma\Big( \beta_{\text{DPO}} [\log\pi_{\theta_r}(\tau^+) - \log\pi_{\theta_r}(\tau^-)] \Big) \right]

σ()\sigma(\cdot) is the sigmoid, βDPO\beta_{\text{DPO}} is a temperature or scaling hyperparameter.

3. Dynamic Selection of Document Count (kk)

By incorporating a STOP\texttt{STOP} token in its action space, πθr\pi_{\theta_r} can end the selection process at any step tt, thus choosing a variable number kk of supporting passages per query. Empirically, before RL, supervised models over-select (kk often peaking at 14–15 on NQ/HotpotQA); after RL (with explicit length penalty), kk shifts left (often peaking near 12–14), reflecting learned avoidance of prompt bloat and recognition that excess context can degrade answer quality.

4. Empirical Evaluation and Ablations

Benchmarks: Seven datasets spanning open-domain QA (Natural Questions, TriviaQA, ASQA, 2WikimQA), multi-hop (HotpotQA), fact verification (FEVER), and long-form QA (ELI5).

Baselines: Non-retrieval LLMs (GPT-3.5-Turbo, GPT-4, GPT-4o), RAG systems (IRCoT, ReAct, Reward-RAG, FLARE, RA-DIT), static rerankers (MonoT5, Vanilla-RAG, Self-RAG, RankRAG), and supervised fine-tuning.

Key Results:

Dataset SOTA Baseline EM/Accuracy/ROUGE (Baseline) DynamicRAG (LLaMA3-8B)
NQ RankRAG 50.6 (470k ex.) 48.4 (150k ex.)
HotpotQA RankRAG 35.3 36.7
ASQA ChatQA-1.5 46.8 56.3
FEVER - - 91.4 (Acc)
ELI5 - - 24.6 (ROUGE-L)
  • Reranker recall: R@20=86.8/67.2 (NQ/HotpotQA), matching or surpassing RankRAG with far less supervised data (20K vs 470K samples).
  • Inference cost: Only two LLM calls per query (one by reranker, one by generator), compared to 20 for RankRAG—∼17× throughput gain.

Ablation Insights:

  • Removing retrieval or the reranker crushes EM (–25 to –13 points).
  • Skipping RL drops EM by 7.2 points.
  • Omitting the exact match reward term degrades QA benchmarks most severely; other reward terms (BERTScore, ROUGE, length penalty, LLM-Eval) are also essential for peak performance.

5. Efficiency, Robustness, and Generalization

DynamicRAG achieves substantial computational and sample efficiency:

  • 150K training examples suffice vs. 470K for previous SOTA.
  • 2 LLM forward passes at inference versus 20+ in iterative baselines.

The framework is robust to retriever backbone: tested with DPR, Contriever, and monoT5 retrievers; always outperforms vanilla RAG. Model-scaling ablations reveal that larger rerankers and sharing parameters between reranker/generator maximize accuracy gains.

The learned reranker is specialized to generation-targeted rewards, not simply IR metrics, and adapts dynamically to the information requirements of each query (e.g., more passages for multi-hop or ambiguous queries, fewer when context is confidently sufficient).

6. Limitations and Future Directions

Limitations:

  • Supervised signals for reward construction require gold answers, limiting out-of-the-box deployment to domains lacking reference data.
  • DPO-based RL demands sampling/batching multiple trajectories, which can stress memory/compute during fine-tuning.
  • Current DynamicRAG formulation uses per-generation scalar rewards; it does not model stepwise (per-selection) rewards or hierarchical stopping.

Proposed directions for extension:

  • Incorporate external critic models for richer, more informed reward shaping.
  • Enable truly online learning: adapt on continual deployment feedback, not just reference answers.
  • Extend to modalities and tasks beyond QA (e.g., summarization, dialog), with possible reward redefinition and per-step adaptation.
  • Explore observation-based RL (per-step feedback) to allow finer-grained policy adjustments.

7. Comparative Position and Impact

DynamicRAG distinguishes itself from static or purely supervised RAG frameworks by:

  • Framing document selection and ordering as an RL policy, tightly coupled to multi-metric, LLM-feedback-based generation rewards.
  • Empirically achieving or exceeding state-of-the-art results on diverse benchmarks with notably fewer training data.
  • Nearly eliminating LLM prompt bloat and token overhead, while increasing computational and sample efficiency.

Given these attributes, DynamicRAG provides a principled and practically validated architecture for retrieval-augmented generation systems that require dynamic context selection and explainable, reward-driven adaptivity (Sun et al., 12 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to DynamicRAG.