Free-MAD: Consensus-Free Debate

Updated 9 February 2026

Free-MAD is a consensus-free, multi-agent debate protocol that aggregates agent responses using a global trajectory-aware scoring function to improve reasoning accuracy and token efficiency.
It eliminates forced consensus by incorporating anti-conformity mechanisms, reducing error propagation and ensuring robustness against adversarial influences.
Empirical evaluations show Free-MAD reduces token usage by about 50% while boosting factual accuracy and adapting across domains and languages.

Free-MAD Framework

The Free-MAD framework is a consensus-free, multi-agent debate protocol designed to enhance the reasoning accuracy, efficiency, and robustness of LLMs through agent interaction, while eliminating the weaknesses associated with traditional consensus-based MAD approaches. In Free-MAD, agent responses are aggregated using a global, trajectory-aware scoring function that operates across all debate rounds and agents, instead of relying on majority voting in the final round. This principled detachment from forced consensus, coupled with explicit anti-conformity mechanisms, yields improvements in token efficiency, robustness to adversarial influence, and fairness. Free-MAD is employed as a foundational paradigm in recent multi-agent long-form factuality assessment systems such as MAD-Fact (Ning et al., 27 Oct 2025), and it has been comprehensively analyzed in both theoretical and empirical contexts (Cui et al., 14 Sep 2025).

1. Design Principles and Motivation

The core motivation for Free-MAD arises from the limitations observed in previous MAD protocols:

Token inefficiency: Prior MAD approaches require multiple rounds of interaction (often $R\ge2$ ), substantially increasing token consumption and system latency.
Conformity-induced error propagation: LLMs tend to overweight peer responses, leading early-correct agents to abandon correct answers under majority influence.
Randomness and unfairness in decisions: Final answers being chosen by majority vote in the last round disregard the evolution of agent reasoning, rendering the protocol vulnerable to tie-breaking artifacts and random drift.

Free-MAD is specifically engineered to:

Enable high reasoning accuracy with single-round debate ( $R=1$ ), minimizing token usage.
Evaluate agent outputs impartially, using a deterministic scoring mechanism that considers the full debate history.
Mitigate conformity bias through prompt-level anti-conformity and algorithmic downweighting of later-round opinion changes.
Provide robust, byzantine attack resistance by ensuring that final selection is not unduly influenced by a majority of compromised agents.

2. Formal Debate Protocol and Mathematical Foundations

Let $A = \{a_i\}_{i=1}^N$ denote $N$ agents, with each agent $a_i$ generating a sequence of responses $T_i = (r_i^0, r_i^1, ..., r_i^R)$ , where $r_i^k$ is the $k$ -th round response. Debate proceeds as follows:

Initialization ( $k=0$ ): For each $i$ , $r_i^0 \sim P_{a_i}(r|x, p)$ , with $x$ the input and $p$ the debate prompt.
Debate ( $k\ge1$ ): Each agent observes peers' previous utterances $C^{(k-1)}$ and samples $r_i^k \sim P_{a_i}(r|C^{(k-1)}, p)$ .

To quantify the influence of conformity, $P_{a_i}(r|C, p)$ is decomposed as: $P_{a_i}(r|C, p) = \frac{1}{Z} P_{\text{in}}(r|x, p) \exp(\beta(p) S_{\text{con}}(r, C))$ where $P_{\text{in}}$ is the intrinsic policy, $S_{\text{con}}$ measures alignment with peers, and $\beta(p)$ regulates (anti-)conformity.

The central innovation is the global scoring rule, which aggregates all agent-round pairings: $S(r) = \sum_{i=1}^N \sum_{k=0}^R c_i^k(r)$ with $c_i^k(r)$ contributions defined as: $c_i^k(r) = \begin{cases} w_1 f(0) & k=0,\ r_i^0 = r \ w_4 f(k) & k\ge 1,\ r_i^k = r_i^{k-1} = r \ w_3 f(k) & k\ge 1,\ r_i^k = r \neq r_i^{k-1} \ -w_2 f(k) & k\ge 1,\ r_i^{k-1} = r \neq r_i^k \ 0 & \text{otherwise} \end{cases}$ where $w_1, w_2, w_3, w_4$ are tunable weights and $f(k) = 1/(k+1)$ downweights later rounds.

The final answer is $r_{\text{final}} = \mathrm{arg\,max}_r\, S(r)$ , with random tie-breaking if needed.

3. Anti-Conformity and Robustness Mechanisms

Free-MAD explicitly counteracts LLM conformity:

Prompt-level anti-conformity ( $\beta(p)<0$ ): Chain-of-Thought prompts direct agents to independently justify their responses, enumerate flaws in others' arguments, and revise only with justification.
Algorithmic downweighting ( $f(k)$ ): Early-round contributions, reflecting independent reasoning, are prioritised; later changes—more likely to be conformity-driven—are suppressed.
Scoring-based selection: No agent or answer receives structural privilege; all candidate outputs are assessed over the full interaction trace, avoiding lock-in to spurious consensus.

This architecture ensures that the protocol is resilient to communication attacks and byzantine agents; selection remains robust provided a strict minority of agents attempt malicious influence (Cui et al., 14 Sep 2025).

4. Instantiation in Factuality Evaluation: MAD-Fact

MAD-Fact extends Free-MAD to the domain of long-form factuality verification (Ning et al., 27 Oct 2025) using the following modules:

Clerk agent: Decomposes a generated long-form answer $a_i$ into atomic, fact-checkable claims $\{c_{i,1}, ..., c_{i,T}\}$ .
Jury (multi-role evaluator agents): Each agent, acting as Challenger, Critic, News Author, Scientist, Psychologist, or Data Analyst, debates claims over $R$ rounds, employing direct, retrieval-based, or conditional strategies.
Judge agent: Aggregates final opinions via majority (or Free-MAD scoring), delivering binary verdicts per claim.

A formal fact-importance hierarchy is constructed: reference answers from $G$ domain-expert LLMs are decomposed to atomic claims, merged into a golden set $\{g_{i,1}, ..., g_{i,K_{\text{gold}}}\}$ , and organized into a pyramid by (inverse) reference frequency. Each claim receives weight $\omega_{\ell(c)}$ , with $\ell(c)$ the pyramid layer.

Weighted factuality metrics are then computed: $\mathrm{Prec}_w(a_i) = \frac{\sum_{j\in S}I(c_{i,j})}{\sum_{j=1}^T I(c_{i,j})}$

$R_w^{@\gamma}(a_i) = \min\Bigl(\tfrac1\gamma\, \frac{\sum_{j\in S}I(c_{i,j})}{\sum_{k=1}^{K_{\text{gold}}} I(g_{i,k})}, 1\Bigr)$

$F_1^{@\gamma}(a_i) = \frac{2\,\mathrm{Prec}_w\,R_w^{@\gamma}}{\mathrm{Prec}_w + R_w^{@\gamma}}$

where $S$ is the set of true positives.

5. Empirical Evaluation and Findings

Free-MAD and its derivatives have been evaluated across reasoning, mathematical, and factuality-oriented benchmarks:

Efficiency: Free-MAD, with $R=1$ , achieves or surpasses the accuracy of multi-round baselines using $\sim50\%$ fewer tokens (Cui et al., 14 Sep 2025).
Robustness: Under simulated communication attacks, standard baselines lose up to 20% accuracy; Free-MAD maintains performance within $\pm2\%$ of its clean accuracy.
Factuality benchmarks: MAD-Fact (with GPT-4o-mini and Rule 1 debate) achieved best F1 scores in 8 of 10 label-category slots versus baseline fact-checkers, with a win-rate of 80% (Ning et al., 27 Oct 2025).
Human alignment: Weighted [email protected] correlates with human judgments at $r=0.701$ , $p=0.036$ .
Language/domain adaptability: On LongFact (English), GPT-4-Turbo leads (F1 $\approx$ 0.569); on LongHalluQA (Chinese), domestic models (QwQ-32B F1 $\approx$ 0.592) outperform GPT-4-Turbo, indicating a substantial multilingual/domain gap.

6. Adaptation Across Domains and Practical Considerations

The Free-MAD architecture is agnostic to domain and language, provided that agent roles and reference hierarchies reflect the properties of the application:

Reference hierarchy: Domain-expert LLMs supply gold reference answers; the number of experts $G$ and layer weights $\{\omega_m\}$ can be domain-specific and possibly learned.
Agent specialization: Custom prompts and agent roles (e.g., “Physician,” “Lawyer”) are readily substituted.
Retrieval integration: Domain corpora or library APIs can replace open-domain web search, and multilingual retrieval capacities enable adaptation to low-resource languages.
Metric hyperparameters: The recall normalization factor $\gamma$ can be optimized for expected answer depth.

A plausible implication is that this protocol could be incorporated into any LLM-based evaluation infrastructure requiring robust, efficient, and fair aggregation of multi-agent outputs without the artifacts of forced consensus.

7. Limitations and Future Directions

Reported limitations include:

Fixed scoring weights: The default $\mathcal{W}=\{20,25,30,20\}$ may not be optimal; further empirical tuning could enhance accuracy or robustness (Cui et al., 14 Sep 2025).
Agent homogeneity: Empirical evaluations have primarily used homogeneous agent pools; heterogeneous ensembles (across model size or architecture) may affect system dynamics.
Adversarial scope: While Free-MAD demonstrates robustness to communication denial, more sophisticated adversarial settings (e.g., prompt injection) require further analysis.

Future research directions include systematic search over scoring hyperparameters, dynamic agent pool selection, deeper theoretical analysis of security properties under indirect attacks, and broader empirical validation in settings with substantial agent/model diversity.

References:

"MAD-Fact: A Multi-Agent Debate Framework for Long-Form Factuality Evaluation in LLMs" (Ning et al., 27 Oct 2025)
"Free-MAD: Consensus-Free Multi-Agent Debate" (Cui et al., 14 Sep 2025)

Markdown Report Issue Upgrade to Chat

References (2)

MAD-Fact: A Multi-Agent Debate Framework for Long-Form Factuality Evaluation in LLMs (2025)

Free-MAD: Consensus-Free Multi-Agent Debate (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Free-MAD Framework.