Canonical Role-Based MARS Framework

Updated 3 March 2026

Canonical role-based MARS is a structured multi-agent review system that assigns distinct roles (author, reviewer, meta-reviewer) to efficiently generate, evaluate, and revise solutions.
It achieves linear communication scaling by restricting interactions to a star topology, reducing token usage and inference time compared to traditional round-table debates.
Empirical benchmarks show that MARS maintains competitive accuracy while significantly lowering computational resources, making it a scalable approach for complex reasoning tasks.

Canonical role-based MARS (Multi-Agent Review System) is a computational framework for collaborative reasoning among LLMs, structured on the analogy to academic review processes. Unlike prior approaches such as Multi-Agent Debate (MAD), which employ round-table agent interactions with quadratic communication scaling, MARS enforces role separation—author, reviewer(s), meta-reviewer—achieving linear communication cost while maintaining inferential accuracy. The system formalizes agent responsibilities and interaction patterns to efficiently elicit and integrate diverse model judgments for complex reasoning tasks (Wang et al., 24 Sep 2025).

1. Formalization of Agent Roles and System Structure

MARS instantiates three canonical agent types for each problem instance:

Author Agent ( $\mathcal{A}$ ): Receives input query $Q$ , generates a full solution $S_0 = (t, y)$ , where $t = (t^1, t^2, ..., t^K)$ is the chain-of-thought trace and $y$ is the answer. $S_0 = f_{\text{author}}(Q)$ ; concretely, $(t, y) = \mathcal{A}(Q)$ .
$m$ Reviewer Agents ( $\mathcal{R}_1, ..., \mathcal{R}_m$ ): Each sees only $S_0$ $S_{0}$ , producing independently for each $j$ $j$ :
- Decision $D_j \in \{\text{accept}, \text{reject}\}$ ;
- Confidence score $c_j \in [1,5]$ ;
- Justification $C_j$ .
- Output $r_j = (D_j, c_j, C_j) = \mathcal{R}_j(Q, t, y)$ . Functionally $D_j = f_{\text{review},j}(S_0)$ , $C_j = g_{\text{review},j}(S_0)$ .
Meta-Reviewer Agent ( $\mathcal{M}$ ): Aggregates $S_0$ $S_{0}$ and $\{r_j\}$ ${r_{j}}$ , issues:
- Meta-decision $D_{\text{meta}} \in \{\text{accept}, \text{reject}\}$ ,
- Rationale $J_{\text{meta}}$ ,
- If rejected: actionable feedback $F_{\text{meta}}$ .
- Formally: $m_{\text{out}} = \mathcal{M}(Q, t, y, r_1, ..., r_m)$ . Integration rule: $S_1 = h_{\text{meta}}(S_0, \{D_j, C_j\}_{j=1..m})$ .

The system halts if $D_{\text{meta}} = \text{accept}$ , or after a maximum number of rounds $T$ (by default, $T=1$ in canonical MARS).

2. Algorithmic Workflow and Data Flow

Canonical MARS operates as a four-stage sequential protocol:

Author agent $\mathcal{A}$ produces the initial solution $(t, y) = \mathcal{A}(Q)$ .
Each reviewer $\mathcal{R}_j$ independently evaluates $S_0$ , issuing $r_j = (D_j, c_j, C_j)$ .
Meta-reviewer $\mathcal{M}$ receives all $r_j$ and $S_0$ , emits $m_{\text{out}} = (D_{\text{meta}}, J_{\text{meta}}, F_{\text{meta}})$ .
If $D_{\text{meta}} = \text{accept}$ , the process stops and outputs $y$ . If rejected, the author revises using $F_{\text{meta}}$ : $y^* = \mathcal{A}_{\text{revise}}(t, y, F_{\text{meta}})$ .

Pseudocode summary:

Input: Q, 𝒜, {ℛ₁,…,ℛₘ}, 𝒨
Output: y*

(t, y) = 𝒜.generate_solution(Q)
for j in 1…m:
    r[j] = ℛⱼ.evaluate(Q, t, y)
m_out = 𝒨.integrate_and_decide(Q, t, y, {r[1],…,r[m]})
if m_out.D_meta == accept:
    y* = y
else:
    y* = 𝒜.revise(t, y, m_out.F_meta)
return y*

Information flow is strictly from author to reviewers, from reviewers to meta-reviewer, and meta-reviewer back to author if revision is needed.

3. Mathematical Formulation and Communication Complexity

Key equations defining the system:

Author generation: $S_0 = f_{\text{author}}(Q)$ , $(t, y) = \mathcal{A}(Q)$
Reviewer output: $r_i = \mathcal{R}_i(Q, t, y) = (D_i, c_i, C_i)$
Meta-review: $S_{t+1} = h_{\text{meta}}(S_t, \{D_i, C_i\}_{i=1..m})$ , $m_{\text{out}} = \mathcal{M}(Q, t, y, r_1 \oplus ... \oplus r_m)$
Rebuttal/revision (if needed): $y^* = \mathcal{A}_{\text{revise}}(t, y, m_{\text{out}}.F_{\text{meta}})$

Token and time complexity:

For round-table MAD (multi-agent debate with $m$ agents): $\mathrm{Tokens}_{\text{MAD}} \approx m \cdot L_{\text{author}} + m \cdot (m-1) \cdot L_{\text{msg}} \implies O(m^2)$
For MARS: $\mathrm{Tokens}_{\text{MARS}} \approx L_{\text{author}} + m \cdot L_{\text{rev}} + L_{\text{meta}} + L_{\text{reb}} \implies O(m)$

Empirically, for $m=2$ reviewers:

$\mathrm{Tokens}_{\text{MAD}} \approx 5042$ (ChatGPT, GPQA)
$\mathrm{Tokens}_{\text{MARS}} \approx 2479$
Approximate $50\%$ reduction in both token usage and wall-clock inference time when $m$ is modest.

4. Communication Pattern: Elimination of Reviewer-to-Reviewer Dependence

A core property of canonical MARS is that each reviewer only sees the author’s output $S_0$ , never the views of other reviewers. The meta-reviewer serves as the sole aggregation point. This "star" topology replaces the complete $m$ -node graph of message-passing in MAD with a strict pipeline: author $\to$ reviewers $\to$ meta-reviewer $\to$ (author if revision). The implication is linear scaling; in contrast, MAD’s message passing is quadratic in the number of agents due to repeated cross-review. Empirical measurements with $m=2$ or $3$ confirm a reliable $\approx 50\%$ saving in resource consumption for MARS compared to MAD.

5. Empirical Benchmark Performance

Experiments were conducted with GPT-3.5-turbo and Mixtral-8×22B as model backbones, $m=2$ reviewers, and one round each for review and rebuttal. Benchmarks included GPQA, MMLU, and GSM8K. Key results per Table 1 of (Wang et al., 24 Sep 2025):

GPQA (ChatGPT): MAD—31.00% accuracy, 5042 tokens, 11.92s; MARS—36.33%, 2479 tokens, 6.01s.
MMLU (ChatGPT): MAD—71.33% accuracy, 3194 tokens, 7.64s; MARS—71.00%, 1702 tokens, 4.71s.
GSM8K (ChatGPT): MAD—79.00% accuracy, 2906 tokens, 7.92s; MARS—75.67%, 1655 tokens, 4.32s.
Mixtral-8×22B results exhibit qualitatively similar scaling, with tokens and inference time approximately halved.

Statistical analysis over 1000 samples per benchmark shows no significant difference in accuracy ( $p>0.1$ ) but highly significant improvements in efficiency ( $p<0.001$ ).

Backbone	Benchmark	MAD Accuracy	MARS Accuracy	Token Saving (%)	Time Saving (%)
GPT-3.5-turbo	GPQA	31.00%	36.33%	50%	~50%
GPT-3.5-turbo	MMLU	71.33%	71.00%	47%	~38%
GPT-3.5-turbo	GSM8K	79.00%	75.67%	43%	~45%
Mixtral-8×22B	GPQA	47.00%	44.00%	56%	~55%

The reduction in token and compute usage is attributable directly to the elimination of reviewer-to-reviewer communications.

6. Experimental and Implementation Details

The canonical configuration:

LLMs: GPT-3.5-turbo, Mixtral-8×22B (via NVIDIA NIM).
Agents: $m=2$ reviewers, 1 meta-reviewer, all using the same model backbone.
Prompts: Templates for author, reviewer, meta-reviewer, and rebuttal, as detailed in Appendix C.
LLM parameters: temperature=0.7, max_tokens=2048.
Hardware: NVIDIA A100 GPUs.
Protocol: One review and one rebuttal round; process stops when $D_{\text{meta}} = \text{accept}$ or after $T$ rounds.

The architecture can be summarized textually:

Stage 1 (Author): $Q \to$ CoT reasoning $\to S_0$
Stage 2 (Review): $S_0 \to$ reviewers $[\mathcal{R}_1,..,\mathcal{R}_m]$ (evaluated in parallel) $\to \{r_i\}$
Stage 3 (Meta): $\{r_i\}, S_0 \to \mathcal{M} \to D_{\text{meta}}, J_{\text{meta}}, F_{\text{meta}}$
Stage 4 (Rebuttal): If $D_{\text{meta}} = \text{reject}$ , $S_0$ revised per $F_{\text{meta}}$ , produce final $y^*$

No special fine-tuning or proprietary data is used; all agents employ standard, publicly available models and default parameters.

7. Comparative Significance and Paradigm Implications

Canonical MARS demonstrates that role-based, hierarchical agent decomposition achieves the collaborative accuracy benefits observed in previous multi-agent LLM protocols, while reducing resource footprint by eliminating quadratic communication dependencies. The strict segregation of reviewer perspectives and single-aggregation meta-decision is empirically validated as both efficient and robust, with performance matching or exceeding MAD in accuracy under controlled experimental conditions. This establishes the MARS architecture as both a theoretically and practically scalable protocol for multi-agent LLM reasoning systems (Wang et al., 24 Sep 2025).

A plausible implication is that as LLM collaborative frameworks scale to larger agent pools or more complex decision processes, design patterns emulating human academic workflows—centralized arbiter, independent review, iterative revision—may become increasingly advantageous, both for efficiency and for controllability of agent interactions.

Markdown Report Issue Upgrade to Chat

References (1)

MARS: toward more efficient multi-agent collaboration for LLM reasoning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Canonical Role-Based MARS.