Papers
Topics
Authors
Recent
2000 character limit reached

Multi-turn Retrieval-Augmented Generation

Updated 24 November 2025
  • Multi-turn RAG is a neural paradigm that integrates dialogue history and dynamic retrieval to generate contextually enriched responses.
  • Architectural variants combine retrievers, context managers, and LLMs to address challenges like non-standalone queries and dynamic evidence shifts.
  • Advanced methods employ query rewriting, multi-hop planning, and reinforcement learning to boost retrieval accuracy and response fidelity.

Multi-turn Retrieval-Augmented Generation (RAG) refers to a class of neural systems where a LLM generates responses to a sequence of user queries, augmenting each response by actively retrieving relevant information from external corpora at each conversational turn. Unlike single-turn RAG, multi-turn RAG necessitates conditioning not only on the current input but also on prior conversational history, retrieved evidence, and/or dynamically evolving tool states. This paradigm is foundational for advanced question answering, dialogue agents, legal and technical consultations, tool-augmented planning, and multi-modal assistant systems, and is the subject of significant academic benchmarking and methodological innovation.

1. Formal Problem Definition and Multi-turn RAG Challenges

In multi-turn RAG, the system takes as input a conversation history Ht1={(q1,r1),,(qt1,rt1)}H_{t-1} = \{(q_1, r_1), \ldots, (q_{t-1}, r_{t-1})\} and a current user query qtq_t. The RAG system must (a) retrieve a dynamic context DtD_t from a large, external or hybrid corpus C\mathcal{C}, and (b) generate a response rtr_t, typically by conditioning on qtq_t, DtD_t, and (crucially) the full or compressed conversational context Ht1H_{t-1} (Katsis et al., 7 Jan 2025, Cheng et al., 30 Oct 2024, Li et al., 28 Feb 2025, Zhang et al., 25 Feb 2025).

Key challenges include:

  • Non-standalone queries: Later turns often reference previous dialogue content explicitly or via coreference, requiring context-aware rewriting and retrieval.
  • Dynamic retrieval: Relevance of passages or documents often shifts across turns, with required context evolving both in topic and specificity.
  • Unanswerable or ambiguous questions: Systems must reliably abstain or return "I don't know" when true answers are unavailable.
  • Domain adaptation and context drift: Real-world applications require robustness to domain-specific language, evolving user intent, abrupt topic shifts, and noise accumulation.

2. Architectural Variants and System Design

Multi-turn RAG architectures combine several modular components, typically orchestrated as a pipeline or agentic workflow:

The table below summarizes major multi-turn RAG systems and their core design:

System/Benchmark Retrieval Mechanism History Encoding/Management Generation Module
MTRAG (Katsis et al., 7 Jan 2025) Sparse, dense, hybrid + LLM rewrite Concatenated or rewritten turns Llama/Mixtral/GPT Family
DH-RAG (Zhang et al., 19 Feb 2025) Static + dynamic history, clustering, tree, chain-of-thought Weighted/pruned history DB LLM with context fusion
MA-RAG (Nguyen et al., 26 May 2025) Sub-query per agent, multi-hop/agentic Per-turn agent plans & histories Agentic LLMs (Planner, etc)
LevelRAG (Zhang et al., 25 Feb 2025) High-level logic planner; sparse/dense/web low-level Per-turn summaries, cache Generator on summarized ctx.
CRAG-MM (Wang et al., 30 Oct 2025) Vision+web hybrid retrieval, query rewrite Prior multi-modal turns MM-LLM (image, text, dialogue)

3. Retrieval and Query Rewriting Strategies

Retrieval in multi-turn RAG deviates significantly from static QA pipelines. Systems must:

In legal, technical, and open-domain settings, query rewriting for non-standalone queries consistently boosts retrieval accuracy (e.g. Recall@10 in LexRAG rises to 33.33% for GTE-Qwen2-1.5B + Query-Rewrite) (Li et al., 28 Feb 2025).

4. Generation, Evaluation Protocols, and Metrics

Multiple retrieval settings are typically defined for evaluation:

  • Reference: Gold-supporting passages only ("oracle"/upper-bound).
  • Reference + RAG: Reference plus top-k retrieved passages to simulate noisy upper-bound.
  • Full RAG: Top-k retrieved only (realistic pipeline) (Katsis et al., 7 Jan 2025, Cheng et al., 30 Oct 2024).

Generation is evaluated via:

Performance degrades with increased turn count, increased noise from retrieval, and in domains with complex reasoning or ambiguous/unanswerable questions (e.g., Full RAG settings yield Answer Accuracy ≈0.86, vs. 0.98 in reference (Katsis et al., 7 Jan 2025)). Hallucinations and loss of faithfulness are recurrent failure modes.

5. Key Benchmarks and Empirical Findings

Major Multi-turn RAG Benchmarks

  • MTRAG (Katsis et al., 7 Jan 2025): 110 conversations, 842 turns, with passage diversity and active retrieval requirements; four domains (Wikipedia, financial forums, .gov/.mil, cloud).
  • CORAL (Cheng et al., 30 Oct 2024): Large-scale (>8000 conversations), open-domain, multi-turn from Wikipedia; tasks include passage retrieval, generation, and citation.
  • LexRAG (Li et al., 28 Feb 2025): 1,013 legal consultations (5-turns each), 17,228 Chinese legal articles, expert annotation.
  • DH-RAG (Zhang et al., 19 Feb 2025): Benchmarks on open-domain and customer service dialogue; focus on history-learning and dynamic context updating.
  • CRAG-MM (Wang et al., 30 Oct 2025): Multi-modal evaluation for vision-text QA with egocentric images and web evidence; 2,000+ multi-turn visual conversations.

Empirical Observations

  • Retrieval performance drops for later and non-standalone turns (e.g., Recall@5 for Elser drops from 0.89 first turn to 0.47 on later turns in MTRAG) (Katsis et al., 7 Jan 2025).
  • Query rewriting and context encoding yield clear improvements for both sparse and dense retrievers; hybrid setups combining query rewrite with powerful retrievers lead in benchmarks (Li et al., 28 Feb 2025, Katsis et al., 7 Jan 2025).
  • Noisy or imprecise retrieval degrades answer quality; reference+retrieval settings help disaggregate retriever vs. generator errors (Katsis et al., 7 Jan 2025, Li et al., 28 Feb 2025).
  • Multi-modal and vision-text systems exhibit further degradation under low-quality image conditions, egocentric perspectives, and high entity tailness (Wang et al., 30 Oct 2025).
  • Human evaluation indicates LLMs produce natural and appropriate responses, but faithfulness and completeness lag without gold retrieval (Katsis et al., 7 Jan 2025).

6. Advanced Methodologies: Agentic, RL, and Dynamic Context Approaches

  • Multi-agent orchestration (e.g., MA-RAG): Deploys specialized agents for planning, query definition, evidence extraction, and synthesis, using chain-of-thought prompting and intermediate confidence scoring. This modular approach achieves out-of-the-box robustness without fine-tuning, and demonstrates compositional scalability (Nguyen et al., 26 May 2025).
  • Reinforcement learning for retrieval and planning (IM-RAG, Q-RAG, Auto-RAG): Trains policy modules (Questioner, Embedder) to maximize coverage of supporting evidence across multiple rounds, optionally using mid-step progress rewards (Yang et al., 15 May 2024, Sorokin et al., 10 Nov 2025, Yu et al., 29 Nov 2024). RL-based methods significantly boost F1 on multi-hop benchmarks and allow efficient decoupling of retriever optimization from the LLM generator.
  • Dynamic historical context integration (DH-RAG, DCT): Maintains a relevance- and recency-weighted dynamic memory of prior queries, passages, and responses; applies attention-based fusion with static retrieval; and prunes/weights memory to manage context window budget (Zhang et al., 19 Feb 2025, Soni et al., 5 Jun 2025).
  • Planning over logic graphs (LevelRAG): Decomposes questions into atomic queries, applies multi-hop hybrid retrieval via independent sparse/dense/web operators, and incrementally supplements evidence until verification criteria are met (Zhang et al., 25 Feb 2025).

7. Open Problems and Prospects for Future Research

Published benchmarks and ablation studies converge on several enduring challenges and directions:

  • Dynamic context management: Further advances are needed for scaling memory, minimizing context pollution, and efficiently representing dialogue for retrieval and generation under tight context budgets (Soni et al., 5 Jun 2025, Zhang et al., 19 Feb 2025).
  • Hallucination and abstention: Improved answerability classifiers—incorporated as "IDK" judges or faithfulness scorers—are required to increase model reliability on unanswerable or ambiguous queries (Katsis et al., 7 Jan 2025).
  • Unified retriever–generator optimization: Current systems often treat retrieval and generation as distinct components; joint or RL-based optimization remains underexplored (Cheng et al., 30 Oct 2024).
  • Domain and modality transfer: Transferring multi-turn RAG to new domains (legal, technical, egocentric visual), complex toolflows, and cross-lingual/multimodal environments reveals bottlenecks in both retrieval and generative grounding (Li et al., 28 Feb 2025, Wang et al., 30 Oct 2025).
  • Evaluation scalability: Human evaluation is highly reliable (FANC scoring ≥90% agreement) but not scalable; automated LLM-Judge and reference-less metrics are critical for progress (Katsis et al., 7 Jan 2025, Li et al., 28 Feb 2025).
  • Compositional multi-agent and hybrid systems: Modular, agent-based orchestration enables fine-grained reasoning control and interpretability but presents new interface and efficiency bottlenecks (Nguyen et al., 26 May 2025, Zhang et al., 25 Feb 2025).

A plausible implication is that scalable multi-turn RAG progress will be driven by advances in dynamic context handling, multi-hop/agentic integration, unified retriever–generator co-optimization, and formal benchmarking across richer modalities and domains.


References:

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Multi-turn Retrieval-Augmented Generation (RAG).