Multi-Agent Critique & Revision

Updated 2 December 2025

Multi-agent Critique and Revision is a collaborative framework where diverse agents provide structured feedback and iterative refinements to improve output quality.
The methodology employs specialized agent roles, defined communication protocols, and decision aggregation strategies to mitigate issues like hallucination and error propagation.
Empirical studies show that limited critique-revision cycles with optimally diverse agents yield significant gains in accuracy, faithfulness, and robustness.

Multi-agent critique and revision refers to a class of collaborative computational frameworks in which multiple autonomous agents—typically LLMs, task-specific expert models, or hybrid systems—engage in structured cycles of mutual evaluation, critical feedback, and iterative refinement to improve the quality, reliability, or faithfulness of generated outputs. These frameworks are motivated by limitations of single-agent generation, such as hallucination, myopic reasoning, lack of robustness, and insufficiently diverse perspectives. Multi-agent critique and revision systems draw on principles from distributed problem-solving, peer review, debate, and hierarchical review workflows. Empirical studies demonstrate that such frameworks support error detection, factual correction, quality gains, and domain adaptability beyond those achievable by naive agent ensembles or self-refinement approaches.

1. Foundational Patterns in Multi-Agent Critique and Revision

Multi-agent critique and revision mechanisms can be classified along several axes: agent specialization, communication protocol, decision aggregation, and workflow depth.

Agent Specialization:

Frameworks often instantiate heterogeneous agent pools, each with specialized roles. For example, in conversation systems, separate Fact, Persona, and Coherence agents are orchestrated by a Planner agent that sequences or merges their critiques and revisions as needed (Jeong et al., 11 Nov 2025). In scientific computing, a Consultant (problem expansion), Programmer (code generation), and Reviewer (debugging via runtime audit) collaborate in a loop that incorporates both domain augmentation and technical scrutiny (Cheng et al., 28 Aug 2025). Some frameworks, such as Table-Critic for table reasoning, further emphasize a modular division of labor by introducing separate Judge, Critic, Refiner, and Curator agents, with the Curator managing a dynamic “template tree” of critique patterns for reuse and growth (Yu et al., 17 Feb 2025).

Communication Protocols:

Protocols include round-table debate (Wynn et al., 5 Sep 2025), role-based peer review (Xu et al., 2023), hierarchical review aggregation (e.g., MARS (Wang et al., 24 Sep 2025)), and asynchronous threaded discussion as in research ideation (Perspectra (Liu et al., 24 Sep 2025, Ueda et al., 11 Jul 2025)). Key design variables include whether agents see only each other’s critiques (not solutions), the use of confidence-weighted or majority-vote aggregation, and whether a meta-reviewer performs conflict resolution. Some systems eschew agent–agent communication in favor of Reviewer–Author–MetaReviewer chains for efficiency (Wang et al., 24 Sep 2025). Others employ explicit locution schemas—ISSUE, CLAIM, SUPPORT, REBUT, QUESTION—to structure agent discourse and enhance adversarial exploration (Liu et al., 24 Sep 2025).

Aggregation and Loop Termination:

Decision aggregation strategies include majority vote (default in many revision loops), confidence-weighted voting (reviewer feedback with scalar trust levels (Xu et al., 2023)), or acceptance by a meta-reviewer based on a threshold of error or inadequacy (Wang et al., 24 Sep 2025, Srinivas et al., 21 Sep 2024). Termination criteria are often empirical, such as consensus, satisfaction of correctness checks, or a bounded number of refinement rounds.

Workflow Depth and Iteration:

Studies consistently find that one or two critique–revision cycles suffice to realize most gains, with performance benefits saturating or even degrading at higher depths due to over-correction or echo-chamber dynamics (Xu et al., 2023, Yu et al., 17 Feb 2025, Ueda et al., 11 Jul 2025).

2. Formal Models and Canonical Algorithms

The core structure of these systems can be expressed using tuple-based workflows and explicit pseudocode. A general three-phase pattern is commonly observed:

Initial Generation: Each agent produces an independent output (answer, summary, plan) via a generative model.
Peer Critique: Agents provide critiques (often structured and scored) on the outputs of others, sometimes attaching confidence or severity levels.
Revision: Agents update their outputs using peer feedback; this may be done via re-generation, reranking, or selective modification.

A canonical algorithmic schema for peer-review-style revision (Xu et al., 2023, Wang et al., 24 Sep 2025):

Input: question/input x, agents {A₁, ..., Aₙ}
For each agent Aᵢ:
    Generate sᵢ = (rationaleᵢ, answerᵢ)
For each agent Aᵢ:
    For each agent Aⱼ ≠ Aᵢ:
        Provide feedback f_{ji}, with optional confidence c_{ji}
For each agent Aᵢ:
    Update revised answer aᵢ' using peer feedback {f_{ji}, c_{ji}}
Aggregate {a₁', ..., aₙ'} (e.g., majority vote)

In hierarchical pipelines (e.g., MARS (Wang et al., 24 Sep 2025), PatExpert (Srinivas et al., 21 Sep 2024)), a meta-reviewer integrates reviewer decisions and justifications before authoring revisions, enforcing a clear division between solution generation, critique, and final decision.

3. Empirical Effects and Evaluation Methodologies

Multi-agent critique and revision yields substantial improvements in accuracy, factuality, and robustness across diverse domains:

Reasoning Accuracy:

In mathematical and commonsense reasoning tasks, multi-agent peer review outperforms both self-correction and standard debate, with reported gains such as GSM8K: 75.3% (Zero-shot CoT) vs. 83.2% (Peer Review) (Xu et al., 2023). Review-based architectures (MARS) match or exceed the accuracy of round-table debate methods with half the token and compute cost (Wang et al., 24 Sep 2025).

Faithfulness in Generation:

In summarization and long-form QA, joint multi-agent and multi-model refinement (MAMM-Refine) achieves significant end-to-end gains in faithfulness metrics (e.g., MiniCheck +5.3 points; Likert +0.6) compared to single-agent and basic critique-refine pipelines, with rerank framing outperforming iterative generation (Wan et al., 19 Mar 2025). Empirically, the greatest gains are observed when both agent count and agent diversity are optimized, with three collaborators often representing the sweet spot for diversity/feasibility tradeoffs (Ueda et al., 11 Jul 2025, Wan et al., 19 Mar 2025).

Stability and Error Propagation:

Specialized modular frameworks—such as Table-Critic’s division into error detection, critique, and refinement—demonstrate low degradation rates (e.g., correcting 9.6% of errors, degrading only 0.7% of correct answers) (Yu et al., 17 Feb 2025). Iterative review–revision loops in scientific code generation drive execution success rates from ~60% (single pass) to ~85–87% post-review (Cheng et al., 28 Aug 2025).

Critical Thinking and Creativity:

Explicit argumentation structures and persona diversity (Perspectra, Multi-Agent Ideator) boost both the depth of critical thinking (as scored by Bloom–Anderson or similar taxonomies) and the novelty/feasibility balance in ideation tasks (Liu et al., 24 Sep 2025, Ueda et al., 11 Jul 2025).

Evaluation is performed using both objective metrics (accuracy, balanced accuracy, edit progress, execution success rates) and external judgments (LLM-based graders, human Likert scales, weak/strong correctness signals).

4. Theoretical Insights, Limitations, and Failure Modes

Empirical ablations, error analyses, and theoretical arguments inform best practices and unearth key pitfalls:

Positive Synergy of Diversity:

Diversity in reasoning style, persona, or underlying model (multi-model collaboration) enhances collective error detection and correction rates, but large capability gaps among agents can erode synergy—strong agents may “sycophantically” defer to persuasive but incorrect peers in naive debate (Xu et al., 2023, Wynn et al., 5 Sep 2025, Wan et al., 19 Mar 2025).

Failure Modes in Debate:

Unstructured debate (majority voting after argument exchange) may propagate errors via conformity and rhetorical bias, leading to accuracy degradation—even when strong agents are the majority (Wynn et al., 5 Sep 2025). Agreement bias (propensity to switch a correct answer to an incorrect peer answer) typically outweighs correction flips, with quantifiable transition counts (e.g., in 3×LLaMA, 53% correct→incorrect vs. ∼7% incorrect→correct over one round).

Aggregation and Over-correction:

Central meta-reviewers or confidence weighting can mitigate but not eliminate over-correction risks (rejecting correct answers due to overly critical reviews) (Wang et al., 24 Sep 2025). Confidence estimation and credibility tracking remain open challenges.

Resource–Performance Tradeoffs:

Hierarchical, role-decomposed architectures such as MARS scale reviewer count linearly, outperforming all-to-all communication schemes in both efficiency and accuracy at equivalent or lower computational cost (Wang et al., 24 Sep 2025). Gains from additional reviewers or deeper interaction depth saturate after three agents / two rounds of critique–revision (Ueda et al., 11 Jul 2025, Xu et al., 2023).

Convergence Guarantees:

While empirical convergence (to consensus or a correct solution) is typically observed after a handful of rounds, few systems provide formal theoretical guarantees. Table-Critic in particular lacks formal error-reduction proofs, though reports rapid convergence in practice (Yu et al., 17 Feb 2025).

5. Domain Extensions and Best-Practice Guidelines

Multi-agent critique and revision has generalized to domains including scientific computing (Cheng et al., 28 Aug 2025), code review (Tang et al., 3 Feb 2024), patent analysis (Srinivas et al., 21 Sep 2024), science storytelling (Zhang et al., 3 Mar 2025), and research proposal ideation (Liu et al., 24 Sep 2025, Ueda et al., 11 Jul 2025). Key guidelines emerging from cross-domain studies include:

Composition: For generation tasks, maximize diversity among critics, not necessarily among revisers. For reasoning, ensure agents have similar but not identical capabilities.
Interaction Protocol: Prefer reranking or voting among finite candidates to open-ended generation for critique and refinement—this ensures computable convergence and minimizes drift (Wan et al., 19 Mar 2025).
Iteration Depth and Agent Count: Limit critique–revision cycles to two or three rounds, and use three agents for a strong trade-off. Beyond this, resource costs outstrip gains (Ueda et al., 11 Jul 2025).
Aggregation: Use confidence-weighted aggregation or majority vote, but incorporate meta-reviewers or explicit justifications to avoid sycophancy and over-correction.
Critique Focus: Encapsulate critique as structured units (Analytical Critique Units with severity and revision suggestion (Lan et al., 20 Oct 2024)) to aid both SFT and RL, as in the MultiCritique pipeline.
Transparency and User-in-the-loop: When supporting human users, expose both agent reasoning chains and allow graduated agency in applying revision (RevTogether (Zhang et al., 3 Mar 2025)).
Adaptation: Dynamically engage only those agents whose contributions are justified for each input (dynamic planning as in conversational response refinement (Jeong et al., 11 Nov 2025)).

6. Key Research Systems and Comparative Results

Framework	Domain(s)	Agent Roles / Protocol	Main Empirical Results
Peer Review Collaboration	Reasoning	Independent agents, peer feedback, confidence	+7.9 pts over zero-shot CoT (GSM8K), optimal with 3–4 agents (Xu et al., 2023)
MARS	Reasoning	Author, multiple reviewers, meta-reviewer	Halves token/time overhead vs. debate at same accuracy (Wang et al., 24 Sep 2025)
Table-Critic	Table QA	Judge, Critic, Refiner, Curator, template tree	+8.9% net error correction over baseline; low degradation (Yu et al., 17 Feb 2025)
MAMM-Refine	Summarization, QA	Multi-agent, multi-model, rerank loops	+2.5–5.3 pts on fact metrics, outperforms single/refine (Wan et al., 19 Mar 2025)
Re⁴	Sci. computing	Consultant, Programmer, Reviewer	+20% execution success, +30–50% solving rate on hard PDEs (Cheng et al., 28 Aug 2025)
Perspectra	Research ideation	Forum, adversarial locutions, user control	Significantly higher proposal quality, clarity, and edit volume (Liu et al., 24 Sep 2025)

7. Theoretical and Practical Extensions

The general critique-and-revision paradigm abstracts to any coordination mechanism cast as a constraint on system behaviors, as formalized by Dell'Anna et al. for norm revision—relaxations (enlarging allowable runs) or strengthenings (reducing them) are dual operators on sets of behaviors, with critique triggered by empirical or contextual mismatch between outcomes and system objectives (Dell'Anna et al., 2018). This principle underpins both discrete norm alteration and statistical calibration of multi-agent outputs—agent outputs and sanctioning (or scoring) map to joints in a system’s allowed run space.

Key practical extensions include domain-agnostic judge–critic roles (e.g., Gold-LLM-as-a-Judge, Reward-LLM-as-a-Judge (Srinivas et al., 21 Sep 2024)), dynamic agent orchestration by planners (Jeong et al., 11 Nov 2025), MARS-style hierarchical collaboration for compute scaling, and critic–actor multi-agent RL architectures with explicit critique trajectories (Yang et al., 20 Mar 2025). The explicit use of structured, actionable critique units (e.g., Analytical Critique Units (Lan et al., 20 Oct 2024)) enables construction of high-fidelity SFT and RL datasets, further improving the critique abilities of both base and meta-agents.

The open challenges include robust calibration of agent confidence, theoretical estimates of error-reduction per critique–revision round, automated role adaptation and spawning, and increasing transparency or “explainability” of agent decision chains for human-in-the-loop systems.

References

Adaptive Multi-Agent Response Refinement in Conversational Systems (Jeong et al., 11 Nov 2025)
Towards Reasoning in LLMs via Multi-Agent Peer Review Collaboration (Xu et al., 2023)
Talk Isn't Always Cheap: Understanding Failure Modes in Multi-Agent Debate (Wynn et al., 5 Sep 2025)
Table-Critic: A Multi-Agent Framework for Collaborative Criticism and Refinement in Table Reasoning (Yu et al., 17 Feb 2025)
RevTogether: Supporting Science Story Revision with Multiple AI Agents (Zhang et al., 3 Mar 2025)
CodeAgent: Autonomous Communicative Agents for Code Review (Tang et al., 3 Feb 2024)
Re4: Scientific Computing Agent with Rewriting, Resolution, Review and Revision (Cheng et al., 28 Aug 2025)
Training LLMs to Critique With Multi-agent Feedback (Lan et al., 20 Oct 2024)
Perspectra: Choosing Your Experts Enhances Critical Thinking in Multi-Agent Research Ideation (Liu et al., 24 Sep 2025)
Exploring Design of Multi-Agent LLM Dialogues for Research Ideation (Ueda et al., 11 Jul 2025)
Reasoning about Norms Revision (Dell'Anna et al., 2018)
The Lighthouse of Language: Enhancing LLM Agents via Critique-Guided Improvement (Yang et al., 20 Mar 2025)
MARS: toward more efficient multi-agent collaboration for LLM reasoning (Wang et al., 24 Sep 2025)
Towards Automated Patent Workflows: AI-Orchestrated Multi-Agent Framework for Intellectual Property Management and Analysis (Srinivas et al., 21 Sep 2024)
MAMM-Refine: A Recipe for Improving Faithfulness in Generation with Multi-Agent Collaboration (Wan et al., 19 Mar 2025)