Adaptive Multi-Agent Response Refinement in Conversational Systems (2511.08319v1)

Published 11 Nov 2025 in cs.CL, cs.AI, and cs.MA

Abstract: LLMs have demonstrated remarkable success in conversational systems by generating human-like responses. However, they can fall short, especially when required to account for personalization or specific knowledge. In real-life settings, it is impractical to rely on users to detect these errors and request a new response. One way to address this problem is to refine the response before returning it to the user. While existing approaches focus on refining responses within a single LLM, this method struggles to consider diverse aspects needed for effective conversations. In this work, we propose refining responses through a multi-agent framework, where each agent is assigned a specific role for each aspect. We focus on three key aspects crucial to conversational quality: factuality, personalization, and coherence. Each agent is responsible for reviewing and refining one of these aspects, and their feedback is then merged to improve the overall response. To enhance collaboration among them, we introduce a dynamic communication strategy. Instead of following a fixed sequence of agents, our approach adaptively selects and coordinates the most relevant agents based on the specific requirements of each query. We validate our framework on challenging conversational datasets, demonstrating that ours significantly outperforms relevant baselines, particularly in tasks involving knowledge or user's persona, or both.

Summary

The paper introduces MARA, a framework that employs dynamic multi-agent refinement to enhance factuality, personalization, and coherence in conversational systems.
MARA uses a planner agent to adaptively select and sequence specialized agents for fact-checking, persona alignment, and discourse coherence.
Experimental results demonstrate that MARA significantly outperforms single-agent and static multi-agent approaches on benchmarks like PersonaChat and FoCus.

Overview and Motivation

The paper introduces MARA (Multi-Agent Refinement with Adaptive agent selection), a modular framework to improve response quality in LLM-driven conversational systems by employing multiple specialized agents for post-generation refinement. The approach is motivated by the limited efficacy of single-agent self-refinement: previous work demonstrates that an LLM refining its own output tends to amplify initial biases and struggles to holistically address complex criteria such as factuality, personalization, and discourse-level coherence. MARA decomposes response assessment and revision across three dedicated agents—fact-checking, persona alignment, and coherence—coordinated dynamically according to conversational context via a planner agent.

Figure 1: Illustration of single-agent failure modes and the modular collaborative refinement process in MARA, showing agent specialization and dynamic sequencing.

Methodology

Formalization

Let $r^i$ denote the response to user query $q^i$ , generated by an LLM conditioned on preceding context. In MARA, an initial response is subjected to a secondary pipeline involving:

Fact-Refining Agent ( $A_{\text{fact}}$ ): Ensures factual grounding, mitigates hallucinations, and verifies explicit knowledge requirements.
Persona-Refining Agent ( $A_{\text{persona}}$ ): Aligns responses with user profiles, preferences, and conversational style.
Coherence-Refining Agent ( $A_{\text{coherence}}$ ): Enforces logical and discourse consistency spanning multi-turn exchanges.

Each agent is an unsupervised LLM instance, prompted with role-specific instructions. The refining process may follow either simultaneous or sequential collaboration strategies:

Simultaneous Aggregation: All agents independently refine the response in parallel, feeding their outputs to a "finalizer" agent.
Static Sequencing: Agents operate in a predetermined order (e.g., $A_{\text{fact}} \to A_{\text{coherence}} \to A_{\text{persona}}$ ), each building on the previous outcome.
Dynamic Sequencing (MARA): A planner agent ( $A_{\text{planner}}$ ) analyzes the query and initial output, then adaptively selects relevant agents and their optimal order, with stepwise justifications informing each refinement stage.

The planner’s decisions are conditioned on the specific requirements per conversational turn, allowing different queries (even in the same conversation) to invoke tailored refinement strategies.

Implementation

Base LLM: Claude Sonnet 3 or 3.5 (supporting evaluation with GPT-4o-mini, LLaMA 3.1 8B/70B for cross-model validation).
Modular prompts: Each agent instantiated with templates detailing its responsibilities and evaluation criteria.
Planner agent: Receives the query, conversation history, and initial response, then outputs agent selection and order, accompanied by rationale for each step.

Experimental Results

Comprehensive evaluations are conducted on datasets representing distinct conversational demands:

PersonaChat: User persona alignment.
INSCIT: Fact grounding via Wikipedia context.
FoCus: Joint persona and knowledge requirements.

MARA consistently outperforms both single-agent (Self-Refine, SPP) and alternative multi-agent (LLMvLLM, MADR, MultiDebate) baselines across coherence, groundedness, naturalness, and engagingness metrics, as measured by G-Eval and further validated through human annotation:

FoCus: MARA achieves 74.51 overall score compared to 60.47 (SPP, single-agent) and 54.81 (MultiDebate, multi-agent).
PersonaChat: MARA’s overall score is 62.00, significantly higher than Self-Refine (58.41) and MADR (23.21).
Human ratings show especially strong alignment between G-Eval and engagingness but less correlation for "naturalness," reflecting metric limitations in modeling conversational subtlety.

Ablation studies further demonstrate:

Each refining agent contributes distinct, non-redundant gains. Joint deployment yields superior improvements (cf. single-agent or iterative agent setups, which show error amplification).
Dynamic agent selection by the planner is essential. Fixed or random orderings degrade performance, while the gap to an ideal (oracle) agent selection shows potential for future optimization.
Figure 2: Refining agent selection patterns across datasets, revealing context-sensitive planner adaptation to persona and factuality requirements.

Discussion

Trade-Offs and Resource Implications

Computational Overhead: The multi-agent approach increases model invocation frequency, with the planner's design and agent specialization driving efficiency. Optimization (smarter planner, lightweight agents) is suggested for scalable deployment.
Model Specialization: Assigning distinct LLMs with strengths in factuality or discourse coherence further enhances composite performance beyond single-model setups.
Robustness: MARA generalizes across underlying LLM architectures and scales to domain-specific contexts (e.g., Ubuntu Dialogue Corpus), augmenting outputs even for already powerful models.

Limitations

The planner currently relies on unsupervised LLM induction. Performance could be further enhanced by supervised fine-tuning on agent selection/orchestration data.
G-Eval’s limitations necessitate broader evaluation—including metrics for social bias, safety, and user engagement beyond standardized quality components.

Implications and Future Directions

The explicit division of revision responsibilities among specialized agents enables modular system design, yielding enhanced explainability, fine-grained control, and potential for plug-and-play incorporation of external tools (e.g., RAG). This paradigm is extensible to scenarios demanding not only factual and persona alignment but also safety, style adaptation, and domain-specific requirements. Further improvements in orchestration algorithms (planner agent) and agent specialization (fine-tuned or tool-augmented LLMs) are anticipated to drive practical deployment for real-world conversational AI systems.

Case Studies

Included qualitative analyses highlight MARA's strengths: In contrast to SPP or MADR, which may fail to recognize context, hallucinate facts, or miss alignment with user interests, MARA delivers responses that integrate precise factual details and evoke user engagement through tailored follow-up questions and personalized narrative, while maintaining strict coherence across turns.

Conclusion

Adaptive Multi-Agent Response Refinement establishes a robust framework for conversational LLMs to generate quality-controlled, contextually aware, and personalized outputs. Modular agent specialization and dynamic planning yield substantial improvements over monolithic and static multi-agent approaches. The findings strongly support the utility of MARA for systems requiring nuanced conversational quality and open avenues for scalable multi-agent orchestration, tighter integration of domain-specific modules, and research into supervised planner optimization.