MARA: Adaptive Multi-Agent Response Refinement
- MARA is a zero-shot multi-agent framework that dynamically refines conversational responses using adaptive, per-query planning with specialized LLM agents.
- It orchestrates agents such as planner, fact-, persona-, and coherence-refiners to improve response quality by addressing factuality, personalization, and coherence.
- Empirical results show MARA significantly outperforms fixed-pipeline and self-refinement approaches on benchmarks like FoCus and PersonaChat, demonstrating robust conversational enhancement.
MARA (Adaptive Multi-Agent Response Refinement) is a zero-shot, multi-agent framework for conversational response refinement in LLM-based systems, designed to address the challenge of integrating multiple conversational quality dimensions—such as factuality, personalization, and coherence—by orchestrating aspect-specialized agents under adaptive, per-query planning. MARA dynamically selects and orders a subset of such agents for each user query, improving over both single-agent self-refinement and fixed-pipeline multi-agent approaches by explicitly tailoring the refinement process to context-specific needs, thus enhancing conversational alignment with user profiles and external knowledge sources (Jeong et al., 11 Nov 2025).
1. Formal Problem Setting and Motivation
In conversational systems, LLMs generate responses based on the user’s current query, dialogue history, and potentially external user or knowledge context. Despite strong in-domain performance, standard approaches frequently exhibit deficits in factual accuracy, personalization (user persona alignment), and dialogue coherence—deficits that are not reliably identified or corrected on-the-fly. The practical infeasibility of repeated user-in-the-loop refinement motivates automated, preemptive improvement of responses prior to user exposure.
Let be the user’s query at turn , the conversation history up to , a user profile/persona, and, optionally, a grounding fact (e.g., Wikipedia snippet). The initial LLM responds with . The refinement objective is to generate maximizing an aggregate conversation quality metric,
where , , , and score coherence, groundedness, naturalness, and engagingness, respectively, as assessed by G-Eval-style LLM-based evaluation (Jeong et al., 11 Nov 2025).
2. System Components and Agent Roles
MARA comprises five prompt-driven LLM agents, each associated with a specialized refinement aspect and prompt template:
- Responding Agent: Produces the initial response using context .
- Planner Agent: Adapts the agent-invocation workflow to each query. Consumes , emits a sequence over refining agents, plus justification for each.
- Fact-Refining Agent (): Validates and edits for consistency with , outputting both a verification and a hallucination-corrected revision.
- Persona-Refining Agent (): Ensures alignment of with user profile , integrating personal interests or rejecting misalignment.
- Coherence-Refining Agent (): Enforces logical continuity within the ongoing dialogue, correcting to better respect conversational context.
Each refining agent operates by applying an LLM to a dedicated prompt that encodes its aspect-specific constraints, with access to all necessary contextual features and the preceding agent’s output.
3. Adaptive Planning and Dynamic Communication
The planner agent governs the adaptive composition and sequencing of refinement agents, producing a query-specific plan with rationales. The communication protocol is as follows:
1 2 3 4 5 6 |
1. R_curr ← RespondAgent(Q, H)
2. plan ← LLM(𝒫_planner(Q, H, U, F, R_curr))
3. parse plan → sequence S = [a₁,…,a_k] and justifications J₁,…,J_k
4. for t = 1 … k:
R_curr ← LLM(𝒫_{a_t}(Q, H, U, F, R_curr, J_t))
5. Output R* = R_curr |
In this framework, only those agents relevant to the context, as determined by the planner, are invoked, and their order is flexibly adjusted. This dynamic approach contrasts with both static and simultaneous communication baselines, in which agent invocation order is fixed or all agents are run in parallel with results aggregated.
4. Feedback Merging and Refinement Mechanism
MARA employs a sequential feedback merge, applying each agent in turn: with each taking as input . Alternative “simultaneous” fusion—where all agents operate in parallel on and a finalizing agent merges their outputs—was empirically inferior, yielding lower aggregate conversational quality and greater integration complexity.
No cross-agent negotiation is performed: output at each stage is strictly the sole input for the next. This approach, while straightforward, was shown to yield both practical and empirical benefits in overall response integration.
5. Training and Inference Protocols
All agents in MARA—including the planner—are implemented as zero-shot LLMs with hand-crafted prompts. No gradient-based fine-tuning occurs. Inference requires LLM invocations per query. The framework is model-agnostic: for each task, agent LLMs can be replaced or specialized (e.g., higher-capacity models for fact agents).
Evaluation leverages G-Eval (GPT-4 mini) scores on multiple axes, with human evaluations further verifying LLM-based assessments.
6. Empirical Results and Ablative Analysis
MARA was evaluated on several benchmarks, including PersonaChat (persona alignment), INSCIT (knowledge grounding), FoCus (persona + fact), PRODIGy, and Ubuntu Dialogue. Performance metrics centered on G-Eval’s normalized overall score (integrating coherence, groundedness, naturalness, engagingness).
| Dataset | No Refine | Self-Refine | SPP | MultiDebate | MARA |
|---|---|---|---|---|---|
| FoCus | 56.7 | 47.1 | 60.5 | 54.8 | 74.5 |
| PersonaChat, INSCIT (ablation): MARA’s specialized agent variants consistently outperform single-aspect or static-sequence versions. |
- On FoCus, MARA achieves +18 points over the best baseline (SPP) and is favored by humans (82.9 vs. 65.5 for the strongest competitor).
- Single-aspect ablations (“fact only,” “persona only,” etc.) yield lower scores (62.3–68.8), and composing all three in sequence (without dynamic planning) still underperforms full MARA.
- Planner ablations: random (56.3), actual MARA (72.6), oracle (81.0).
- Communication ablations: fixed order (60–64), simultaneous (60.5), dynamic (74.4).
7. Limitations, Open Challenges, and Prospects
MARA’s primary limitations include sub-optimality of the planner agent in the absence of explicit supervision (the gap to oracle planning remains significant), and the resource costs introduced by multiple LLM agent calls per round. The paper recommends fine-tuning or reinforcement learning for planner optimization, exploration of distilled/lightweight agents, and integration of retrieval-augmented modules as future directions. The sequential nature of feedback merging, while empirically effective, could be sub-optimal in scenarios requiring more complex cross-aspect negotiation or interactive feedback.
8. Comparison to Related Adaptive Multi-Agent Frameworks
Compared to frameworks in adjacent domains, such as TCAndon-Router (TCAR) for multi-agent collaboration in enterprise ITSM (Zhao et al., 8 Jan 2026) or MAO-ARAG for adaptive RAG in QA (Chen et al., 1 Aug 2025), MARA is unique in its focus on conversational quality axes and dynamic per-query planning with per-aspect refinement. Whereas TCAR resolves multi-label agent conflicts via a downstream refining agent, and MAO-ARAG uses reinforcement-learning-trained planners to allocate executor agents for cost/accuracy trade-offs, MARA foregrounds agent specialization for conversational aspects and a human-interpretable planner rationale. A plausible implication is that the core MARA paradigm—dynamic planner-driven aspect-specialized response refinement—can generalize beyond dialogue to complex multi-dimensional response synthesis tasks in other LLM application domains.