Papers
Topics
Authors
Recent
2000 character limit reached

Multi-Agent Critique Aggregation

Updated 18 December 2025
  • Multi-agent critique aggregation is a formal approach that combines outputs from multiple autonomous agents, such as LLMs and specialist evaluators, to overcome individual biases and enhance overall decision quality.
  • It employs various methods—including weighted averaging, majority voting, and iterative debate—with mathematical guarantees to ensure robustness and improved convergence.
  • Applications span reinforcement learning, reward modeling, systematic reviews, and summarization, demonstrating measurable gains in accuracy, calibration, and interpretability.

Multi-agent critique aggregation encompasses a set of formal methods by which outputs, evaluations, or critiques generated by multiple autonomous agents—typically LLMs or specialist evaluators—are fused into a unified signal for downstream use. This paradigm has achieved broad impact in LLM alignment, structured evaluation, multi-step reasoning, reinforcement learning, and decision support, where single-agent evaluations are limited by bias, lack of coverage, or susceptibility to idiosyncratic failure. Recent literature explores both fixed heuristics (e.g., voting, averaging) and adaptive, deliberative, or debate-based aggregation routines, often with mathematical guarantees or empirical gains across diverse tasks (Lan et al., 20 Oct 2024, Yu et al., 17 Feb 2025, Wan et al., 19 Mar 2025, Hu et al., 14 Oct 2025, Chang et al., 6 Oct 2025, Yang et al., 20 Nov 2025, Mushtaq et al., 21 Sep 2025).

1. Foundational Principles and Theoretical Motivation

Multi-agent critique aggregation is motivated by the limitations of single-critic or single-judge systems (e.g., mode collapse, shared blind spots, or failure to capture conflicting desiderata). By exposing a base artifact (text, reasoning trace, policy rollout, research article) to a panel of agents with independent perspectives, the system collects a diversity of critiques, spanning coverage of errors and evaluative axes inaccessible to any single agent. Aggregation aims to improve correctness, robustness, and interpretability by harnessing the strengths of the group through explicit mechanisms for selection, weighting, and refinement.

In structured frameworks, aggregation is formalized via operators such as weighted mean, majority vote, mixture models, or iterative deliberative debate. Theoretical analyses prove correctness amplification, error cascading prevention, or convergence properties under task-specific assumptions (Yu et al., 17 Feb 2025, Hu et al., 14 Oct 2025, Chang et al., 6 Oct 2025).

2. Architectures and Workflow Patterns

Most modern systems instantiate multi-agent critique aggregation via specialist agents orchestrated in pipelines with explicit data and control flow. Key archetypes include:

Typical workflows involve: (1) decomposition of input; (2) assignment of sub-tasks or artifacts to multiple agents; (3) collection and structuring of critiques; (4) aggregation via deterministic or learnable schema; (5) derivation of the final metric, artifact, or decision.

3. Aggregation Mechanisms and Mathematical Formulations

A variety of aggregation schemes are employed, depending on modality and task:

Averaging and Weighted Sums:

  • Scores sis_i from agent ii are aggregated as Q=i=1MwisiQ = \sum_{i=1}^M w_i s_i, with wiw_i learned or uniform (Mushtaq et al., 21 Sep 2025, Yang et al., 20 Nov 2025).
  • Agreement and penalty terms (e.g., for consensus or repetition) can be introduced: Rt=i=1Nwirti+λagreeAtλrepPtR_t = \sum_{i=1}^N w_i r_t^i + \lambda_{\mathrm{agree}} A_t - \lambda_{\mathrm{rep}} P_t (Yang et al., 20 Nov 2025).

Majority/Plurality Voting:

Rerank Voting:

  • Agents choose among a fixed set of candidate critiques or revisions; the consensus is the candidate with maximal votes or best mean score (Wan et al., 19 Mar 2025).

Iterative Debate and Bayesian Update:

  • Agents update beliefs over latent concepts, with each round reducing group uncertainty. Judge consensus is modeled as evolving via mixture models (Beta–Binomial), with adaptive stopping triggered by plateau tests (Kolmogorov–Smirnov distance between posterior distributions) (Hu et al., 14 Oct 2025, Chang et al., 6 Oct 2025).

Template and Knowledge Tree Accumulation:

  • Critiques are accumulated and distilled into a template tree, guiding future agents' reasoning and enabling systematic pattern aggregation (Yu et al., 17 Feb 2025).

Human-in-the-loop or Meta-LLM Filtering:

  • A strong judge LLM evaluates all candidate critiques, labels their quality, and selects or merges only high-quality units (Analytical Critique Units, ACUs) (Lan et al., 20 Oct 2024).

4. Applications and Empirical Outcomes

Reinforcement Learning and Reward Modeling

Multi-agent aggregation is central to collaborative reward modeling, where each evaluator agent targets a distinct property (e.g., factuality, helpfulness), and a centralized aggregator fuses their partial rewards, sometimes with explicit encouragement of agreement. Empirical results show that multi-agent setups outperform single scalar reward models in accuracy, variance reduction, and interpretability across tasks such as GSM8K (Yang et al., 20 Nov 2025).

LLM Critique and Feedback Generation

MultiCritique demonstrates that fine-tuning on critiques aggregated from multiple LLMs—after filtering for perfect ACUs—substantially increases critique quality and performance relative to single-agent SFT. Quantitatively, models trained with MultiCritique data show +0.6–1.2 points improvement in text quality metrics over single-agent baselines and higher F1 on zero-shot critique detection (Lan et al., 20 Oct 2024).

Reasoning and Data Processing Pipelines

Frameworks such as Table-Critic orchestrate specialized agents (Judge, Critic, Refiner, Curator) in a loop, iteratively refining reasoning chains based on collaborative critique aggregation, with proven reductions in error propagation and improved convergence rates (Yu et al., 17 Feb 2025).

Summarization, QA, and Long-Form Generation

MAMM-Refine applies multi-agent and multi-model aggregation in the detection, critique, and refinement steps, consistently improving faithfulness on summarization and QA benchmarks, with the largest gains observed for diverse, rerank-based agent pools (Wan et al., 19 Mar 2025).

Debate, Deliberation, and Adaptive Control

Flexible controllers, such as MACI, employ dials over evidence quality and contentiousness, gating both admissible contributions and adversarial challenge schedules, resulting in improved accuracy, calibration, and sample efficiency in tasks where deliberation and consensus building are required (Chang et al., 6 Oct 2025). Debate-based frameworks with formalized stopping criteria provably amplify correctness while controlling cost and compute (Hu et al., 14 Oct 2025).

5. Evaluation, Guarantees, and Limitations

Multi-agent critique aggregation has been evaluated via:

  • Agreement rates (e.g., SLR evaluation 84% agreement with PRISMA-aligned human annotation (Mushtaq et al., 21 Sep 2025)).
  • End-to-end accuracy and calibration curves, e.g., MACI reduces expected calibration error by 20–30%, with gains of 3–6 points in absolute accuracy (Chang et al., 6 Oct 2025).
  • Convergence analysis, such as Table-Critic's 3–6 iteration convergence in table reasoning (Yu et al., 17 Feb 2025).
  • Ablations revealing that agent diversity, appropriately designed aggregation (e.g., rerank over open-ended generation), and careful filtering are all necessary for maximal performance (see Table 2 in (Lan et al., 20 Oct 2024) for critique source ablation; Table 3 for preference filtering impact).

Theoretical guarantees offered include monotonic reduction in dispersion (KL divergence among agent predictions), termination proofs under bounded improvements, and regret bounds for adjustable schedule controllers (Chang et al., 6 Oct 2025, Hu et al., 14 Oct 2025). Adaptive mechanisms for majority/debate aggregation provably reduce cost while maintaining or amplifying correctness (Hu et al., 14 Oct 2025).

Limitations include potential performance degradation when agents are low-quality, the need for careful calibration of aggregation weights, and the computational burden introduced by multi-turn debates and multi-agent inference. Empirical results indicate diminishing returns in judgment accuracy gains for ensembles larger than 7 agents (Hu et al., 14 Oct 2025) and highlight the necessity of agent diversity to avoid correlated bias.

Multi-agent critique aggregation provides a general blueprint for ensemble-based evaluation, robust decision-making, and interpretability in AI systems. It underpins advances in:

A plausible implication is that as LLMs scale and are deployed to gate critical or safety-sensitive domains, multi-agent critique aggregation will become a default design for mediating between conflicting evaluative axes, adversarial example defense, or synthesizing structured feedback efficiently and transparently.

7. Summary Table of Representative Multi-Agent Critique Aggregation Frameworks

Framework Aggregation Scheme Application Domain
MultiCritique (Lan et al., 20 Oct 2024) Meta-LLM filtering & merge of ACUs LLM critique ability, RL/SFT
CRM (Yang et al., 20 Nov 2025) Weighted sum + agreement & penalties RLHF multi-objective reward
Table-Critic (Yu et al., 17 Feb 2025) Judge→Critic→Refiner→Curator loop Table reasoning (step correction)
MAMM-Refine (Wan et al., 19 Mar 2025) Rerank voting at subtask layers Summarization/QA faithfulness
Debate/Stability (Hu et al., 14 Oct 2025) Iterative debate, Beta–Binomial mix, KS stop LLM judge ensembles
MACI (Chang et al., 6 Oct 2025) Dial-gated debate + reliability, plateau halt Medical/diagnosis/news bias
SLR Copilot (Mushtaq et al., 21 Sep 2025) Weighted/majority vote on checklists Systematic review (PRISMA)

These frameworks illustrate the diversity of aggregation mechanisms and their alignment with the unique structure and requirements of multi-agent systems across machine learning subfields.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Multi-Agent Critique Aggregation.