Multi-turn Retrieval-Augmented Generation

Updated 24 November 2025

Multi-turn RAG is a neural paradigm that integrates dialogue history and dynamic retrieval to generate contextually enriched responses.
Architectural variants combine retrievers, context managers, and LLMs to address challenges like non-standalone queries and dynamic evidence shifts.
Advanced methods employ query rewriting, multi-hop planning, and reinforcement learning to boost retrieval accuracy and response fidelity.

Multi-turn Retrieval-Augmented Generation (RAG) refers to a class of neural systems where a LLM generates responses to a sequence of user queries, augmenting each response by actively retrieving relevant information from external corpora at each conversational turn. Unlike single-turn RAG, multi-turn RAG necessitates conditioning not only on the current input but also on prior conversational history, retrieved evidence, and/or dynamically evolving tool states. This paradigm is foundational for advanced question answering, dialogue agents, legal and technical consultations, tool-augmented planning, and multi-modal assistant systems, and is the subject of significant academic benchmarking and methodological innovation.

1. Formal Problem Definition and Multi-turn RAG Challenges

In multi-turn RAG, the system takes as input a conversation history $H_{t-1} = \{(q_1, r_1), \ldots, (q_{t-1}, r_{t-1})\}$ and a current user query $q_t$ . The RAG system must (a) retrieve a dynamic context $D_t$ from a large, external or hybrid corpus $\mathcal{C}$ , and (b) generate a response $r_t$ , typically by conditioning on $q_t$ , $D_t$ , and (crucially) the full or compressed conversational context $H_{t-1}$ (Katsis et al., 7 Jan 2025, Cheng et al., 2024, Li et al., 28 Feb 2025, Zhang et al., 25 Feb 2025).

Key challenges include:

Non-standalone queries: Later turns often reference previous dialogue content explicitly or via coreference, requiring context-aware rewriting and retrieval.
Dynamic retrieval: Relevance of passages or documents often shifts across turns, with required context evolving both in topic and specificity.
Unanswerable or ambiguous questions: Systems must reliably abstain or return "I don't know" when true answers are unavailable.
Domain adaptation and context drift: Real-world applications require robustness to domain-specific language, evolving user intent, abrupt topic shifts, and noise accumulation.

2. Architectural Variants and System Design

Multi-turn RAG architectures combine several modular components, typically orchestrated as a pipeline or agentic workflow:

Retriever(s): Accept the (rewritten or contextualized) query and return relevant documents or passages. State-of-the-art systems employ combinations of sparse (BM25/Elasticsearch), dense (transformer bi-encoder), and hybrid retrievers, often augmented by LLM-guided query rewriting to improve context awareness (Katsis et al., 7 Jan 2025, Zhang et al., 25 Feb 2025, Li et al., 28 Feb 2025).
Dialogue/context manager: Maintains conversation history, encodes context compressions ("last response," "summarize," "LLM rewrites"), and dynamically decides what information is necessary for subsequent turns (Cheng et al., 2024, Soni et al., 5 Jun 2025).
Generation module (LLM): Consumes current question, selected historical context, and retrieved evidence to produce an answer constrained for fidelity, completeness, and appropriateness (Katsis et al., 7 Jan 2025, Li et al., 28 Feb 2025).
Multi-turn planners / agentic controllers: Sophisticated systems (e.g., MA-RAG, LevelRAG) decompose the exchange into sub-queries, refinement, and step-wise evidence integration via orchestrated agent modules (Nguyen et al., 26 May 2025, Zhang et al., 25 Feb 2025).
Dynamic memory or cache: Context windows, attention-based caches, or dynamic historical information databases ensure retrieval and generation modules are informed by the most relevant ongoing dialogue (Zhang et al., 19 Feb 2025, Soni et al., 5 Jun 2025).

The table below summarizes major multi-turn RAG systems and their core design:

System/Benchmark	Retrieval Mechanism	History Encoding/Management	Generation Module
MTRAG (Katsis et al., 7 Jan 2025)	Sparse, dense, hybrid + LLM rewrite	Concatenated or rewritten turns	Llama/Mixtral/GPT Family
DH-RAG (Zhang et al., 19 Feb 2025)	Static + dynamic history, clustering, tree, chain-of-thought	Weighted/pruned history DB	LLM with context fusion
MA-RAG (Nguyen et al., 26 May 2025)	Sub-query per agent, multi-hop/agentic	Per-turn agent plans & histories	Agentic LLMs (Planner, etc)
LevelRAG (Zhang et al., 25 Feb 2025)	High-level logic planner; sparse/dense/web low-level	Per-turn summaries, cache	Generator on summarized ctx.
CRAG-MM (Wang et al., 30 Oct 2025)	Vision+web hybrid retrieval, query rewrite	Prior multi-modal turns	MM-LLM (image, text, dialogue)

3. Retrieval and Query Rewriting Strategies

Retrieval in multi-turn RAG deviates significantly from static QA pipelines. Systems must:

Actively rewrite queries using LLM-driven standalone question rewrites or context-aware paraphrases to support coreference resolution, context carry-over, and disambiguation (Katsis et al., 7 Jan 2025, Li et al., 28 Feb 2025, Cheng et al., 2024).
Condense or summarize history using automated compressors or LLM-based summarization to manage context window constraints and minimize retrieval noise (Cheng et al., 2024, Soni et al., 5 Jun 2025).
Integrate multi-hop logic via decomposition (Segment a complex question into atomic sub-queries, recursively retrieve and summarize relevant evidence, then verify and supplement incomplete chains) (Zhang et al., 25 Feb 2025, Nguyen et al., 26 May 2025).
Fuse hybrid sources using dense, sparse, and web-based retrieval in combination, with low-level operators such as Lucene-based query rewriting, pseudo-document generation, and dynamic context memory (Zhang et al., 25 Feb 2025, Wang et al., 30 Oct 2025).

In legal, technical, and open-domain settings, query rewriting for non-standalone queries consistently boosts retrieval accuracy (e.g. Recall@10 in LexRAG rises to 33.33% for GTE-Qwen2-1.5B + Query-Rewrite) (Li et al., 28 Feb 2025).

4. Generation, Evaluation Protocols, and Metrics

Multiple retrieval settings are typically defined for evaluation:

Reference: Gold-supporting passages only ("oracle"/upper-bound).
Reference + RAG: Reference plus top-k retrieved passages to simulate noisy upper-bound.
Full RAG: Top-k retrieved only (realistic pipeline) (Katsis et al., 7 Jan 2025, Cheng et al., 2024).

Generation is evaluated via:

Exact Match/F1: Overlap on span-level answers.
Hybrid metrics (RB, BLEU, ROUGE, BERTScore): Account for semantic overlap, informativeness, and fluency.
Ref-based LLM-Judge: Faithfulness, completeness, naturalness, appropriateness using multi-LLM scoring (Katsis et al., 7 Jan 2025, Li et al., 28 Feb 2025).
Citation accuracy: Statement-level grounding to supporting evidence in passage retrieval (Cheng et al., 2024).
Domain/task-specific measures: e.g., keyword-accuracy for legal, AST match for planning (Li et al., 28 Feb 2025, Soni et al., 5 Jun 2025).
Human annotation: FANC (Faithful, Appropriate, Natural, Complete) with inter-annotator agreement ≥90% in MTRAG (Katsis et al., 7 Jan 2025).

Performance degrades with increased turn count, increased noise from retrieval, and in domains with complex reasoning or ambiguous/unanswerable questions (e.g., Full RAG settings yield Answer Accuracy ≈0.86, vs. 0.98 in reference (Katsis et al., 7 Jan 2025)). Hallucinations and loss of faithfulness are recurrent failure modes.

5. Key Benchmarks and Empirical Findings

Major Multi-turn RAG Benchmarks

MTRAG (Katsis et al., 7 Jan 2025): 110 conversations, 842 turns, with passage diversity and active retrieval requirements; four domains (Wikipedia, financial forums, .gov/.mil, cloud).
CORAL (Cheng et al., 2024): Large-scale (>8000 conversations), open-domain, multi-turn from Wikipedia; tasks include passage retrieval, generation, and citation.
LexRAG (Li et al., 28 Feb 2025): 1,013 legal consultations (5-turns each), 17,228 Chinese legal articles, expert annotation.
DH-RAG (Zhang et al., 19 Feb 2025): Benchmarks on open-domain and customer service dialogue; focus on history-learning and dynamic context updating.
CRAG-MM (Wang et al., 30 Oct 2025): Multi-modal evaluation for vision-text QA with egocentric images and web evidence; 2,000+ multi-turn visual conversations.

Empirical Observations

Retrieval performance drops for later and non-standalone turns (e.g., Recall@5 for Elser drops from 0.89 first turn to 0.47 on later turns in MTRAG) (Katsis et al., 7 Jan 2025).
Query rewriting and context encoding yield clear improvements for both sparse and dense retrievers; hybrid setups combining query rewrite with powerful retrievers lead in benchmarks (Li et al., 28 Feb 2025, Katsis et al., 7 Jan 2025).
Noisy or imprecise retrieval degrades answer quality; reference+retrieval settings help disaggregate retriever vs. generator errors (Katsis et al., 7 Jan 2025, Li et al., 28 Feb 2025).
Multi-modal and vision-text systems exhibit further degradation under low-quality image conditions, egocentric perspectives, and high entity tailness (Wang et al., 30 Oct 2025).
Human evaluation indicates LLMs produce natural and appropriate responses, but faithfulness and completeness lag without gold retrieval (Katsis et al., 7 Jan 2025).

6. Advanced Methodologies: Agentic, RL, and Dynamic Context Approaches

Multi-agent orchestration (e.g., MA-RAG): Deploys specialized agents for planning, query definition, evidence extraction, and synthesis, using chain-of-thought prompting and intermediate confidence scoring. This modular approach achieves out-of-the-box robustness without fine-tuning, and demonstrates compositional scalability (Nguyen et al., 26 May 2025).
Reinforcement learning for retrieval and planning (IM-RAG, Q-RAG, Auto-RAG): Trains policy modules (Questioner, Embedder) to maximize coverage of supporting evidence across multiple rounds, optionally using mid-step progress rewards (Yang et al., 2024, Sorokin et al., 10 Nov 2025, Yu et al., 2024). RL-based methods significantly boost F1 on multi-hop benchmarks and allow efficient decoupling of retriever optimization from the LLM generator.
Dynamic historical context integration (DH-RAG, DCT): Maintains a relevance- and recency-weighted dynamic memory of prior queries, passages, and responses; applies attention-based fusion with static retrieval; and prunes/weights memory to manage context window budget (Zhang et al., 19 Feb 2025, Soni et al., 5 Jun 2025).
Planning over logic graphs (LevelRAG): Decomposes questions into atomic queries, applies multi-hop hybrid retrieval via independent sparse/dense/web operators, and incrementally supplements evidence until verification criteria are met (Zhang et al., 25 Feb 2025).

7. Open Problems and Prospects for Future Research

Published benchmarks and ablation studies converge on several enduring challenges and directions:

Dynamic context management: Further advances are needed for scaling memory, minimizing context pollution, and efficiently representing dialogue for retrieval and generation under tight context budgets (Soni et al., 5 Jun 2025, Zhang et al., 19 Feb 2025).
Hallucination and abstention: Improved answerability classifiers—incorporated as "IDK" judges or faithfulness scorers—are required to increase model reliability on unanswerable or ambiguous queries (Katsis et al., 7 Jan 2025).
Unified retriever–generator optimization: Current systems often treat retrieval and generation as distinct components; joint or RL-based optimization remains underexplored (Cheng et al., 2024).
Domain and modality transfer: Transferring multi-turn RAG to new domains (legal, technical, egocentric visual), complex toolflows, and cross-lingual/multimodal environments reveals bottlenecks in both retrieval and generative grounding (Li et al., 28 Feb 2025, Wang et al., 30 Oct 2025).
Evaluation scalability: Human evaluation is highly reliable (FANC scoring ≥90% agreement) but not scalable; automated LLM-Judge and reference-less metrics are critical for progress (Katsis et al., 7 Jan 2025, Li et al., 28 Feb 2025).
Compositional multi-agent and hybrid systems: Modular, agent-based orchestration enables fine-grained reasoning control and interpretability but presents new interface and efficiency bottlenecks (Nguyen et al., 26 May 2025, Zhang et al., 25 Feb 2025).

A plausible implication is that scalable multi-turn RAG progress will be driven by advances in dynamic context handling, multi-hop/agentic integration, unified retriever–generator co-optimization, and formal benchmarking across richer modalities and domains.

References:

MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems (Katsis et al., 7 Jan 2025)
IM-RAG: Multi-Round Retrieval-Augmented Generation Through Learning Inner Monologues (Yang et al., 2024)
DH-RAG: A Dynamic Historical Context-Powered Retrieval-Augmented Generation Method for Multi-Turn Dialogue (Zhang et al., 19 Feb 2025)
MA-RAG: Multi-Agent Retrieval-Augmented Generation via Collaborative Chain-of-Thought Reasoning (Nguyen et al., 26 May 2025)
LevelRAG: Enhancing Retrieval-Augmented Generation with Multi-hop Logic Planning over Rewriting Augmented Searchers (Zhang et al., 25 Feb 2025)
CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark (Wang et al., 30 Oct 2025)
LexRAG: Benchmarking Retrieval-Augmented Generation in Multi-Turn Legal Consultation Conversation (Li et al., 28 Feb 2025)
CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation (Cheng et al., 2024)
RAGBoost: Efficient Retrieval-Augmented Generation with Accuracy-Preserving Context Reuse (Jiang et al., 5 Nov 2025)
Dynamic Context Tuning for Retrieval-Augmented Generation: Enhancing Multi-Turn Planning and Tool Adaptation (Soni et al., 5 Jun 2025)
Q-RAG: Long Context Multi-step Retrieval via Value-based Embedder Training (Sorokin et al., 10 Nov 2025)
Auto-RAG: Autonomous Retrieval-Augmented Generation for LLMs (Yu et al., 2024)