Papers
Topics
Authors
Recent
Search
2000 character limit reached

Canonical Role-Based MARS Framework

Updated 3 March 2026
  • Canonical role-based MARS is a structured multi-agent review system that assigns distinct roles (author, reviewer, meta-reviewer) to efficiently generate, evaluate, and revise solutions.
  • It achieves linear communication scaling by restricting interactions to a star topology, reducing token usage and inference time compared to traditional round-table debates.
  • Empirical benchmarks show that MARS maintains competitive accuracy while significantly lowering computational resources, making it a scalable approach for complex reasoning tasks.

Canonical role-based MARS (Multi-Agent Review System) is a computational framework for collaborative reasoning among LLMs, structured on the analogy to academic review processes. Unlike prior approaches such as Multi-Agent Debate (MAD), which employ round-table agent interactions with quadratic communication scaling, MARS enforces role separation—author, reviewer(s), meta-reviewer—achieving linear communication cost while maintaining inferential accuracy. The system formalizes agent responsibilities and interaction patterns to efficiently elicit and integrate diverse model judgments for complex reasoning tasks (Wang et al., 24 Sep 2025).

1. Formalization of Agent Roles and System Structure

MARS instantiates three canonical agent types for each problem instance:

  • Author Agent (A\mathcal{A}): Receives input query QQ, generates a full solution S0=(t,y)S_0 = (t, y), where t=(t1,t2,...,tK)t = (t^1, t^2, ..., t^K) is the chain-of-thought trace and yy is the answer. S0=fauthor(Q)S_0 = f_{\text{author}}(Q); concretely, (t,y)=A(Q)(t, y) = \mathcal{A}(Q).
  • mm Reviewer Agents (R1,...,Rm\mathcal{R}_1, ..., \mathcal{R}_m): Each sees only S0S_0, producing independently for each jj:
    • Decision Dj{accept,reject}D_j \in \{\text{accept}, \text{reject}\};
    • Confidence score cj[1,5]c_j \in [1,5];
    • Justification CjC_j.
    • Output rj=(Dj,cj,Cj)=Rj(Q,t,y)r_j = (D_j, c_j, C_j) = \mathcal{R}_j(Q, t, y). Functionally Dj=freview,j(S0)D_j = f_{\text{review},j}(S_0), Cj=greview,j(S0)C_j = g_{\text{review},j}(S_0).
  • Meta-Reviewer Agent (M\mathcal{M}): Aggregates S0S_0 and {rj}\{r_j\}, issues:
    • Meta-decision Dmeta{accept,reject}D_{\text{meta}} \in \{\text{accept}, \text{reject}\},
    • Rationale JmetaJ_{\text{meta}},
    • If rejected: actionable feedback FmetaF_{\text{meta}}.
    • Formally: mout=M(Q,t,y,r1,...,rm)m_{\text{out}} = \mathcal{M}(Q, t, y, r_1, ..., r_m). Integration rule: S1=hmeta(S0,{Dj,Cj}j=1..m)S_1 = h_{\text{meta}}(S_0, \{D_j, C_j\}_{j=1..m}).

The system halts if Dmeta=acceptD_{\text{meta}} = \text{accept}, or after a maximum number of rounds TT (by default, T=1T=1 in canonical MARS).

2. Algorithmic Workflow and Data Flow

Canonical MARS operates as a four-stage sequential protocol:

  1. Author agent A\mathcal{A} produces the initial solution (t,y)=A(Q)(t, y) = \mathcal{A}(Q).
  2. Each reviewer Rj\mathcal{R}_j independently evaluates S0S_0, issuing rj=(Dj,cj,Cj)r_j = (D_j, c_j, C_j).
  3. Meta-reviewer M\mathcal{M} receives all rjr_j and S0S_0, emits mout=(Dmeta,Jmeta,Fmeta)m_{\text{out}} = (D_{\text{meta}}, J_{\text{meta}}, F_{\text{meta}}).
  4. If Dmeta=acceptD_{\text{meta}} = \text{accept}, the process stops and outputs yy. If rejected, the author revises using FmetaF_{\text{meta}}: y=Arevise(t,y,Fmeta)y^* = \mathcal{A}_{\text{revise}}(t, y, F_{\text{meta}}).

Pseudocode summary:

1
2
3
4
5
6
7
8
9
10
11
12
Input: Q, 𝒜, {ℛ,,ℛₘ}, 𝒨
Output: y*

(t, y) = 𝒜.generate_solution(Q)
for j in 1m:
    r[j] = ℛⱼ.evaluate(Q, t, y)
m_out = 𝒨.integrate_and_decide(Q, t, y, {r[1],,r[m]})
if m_out.D_meta == accept:
    y* = y
else:
    y* = 𝒜.revise(t, y, m_out.F_meta)
return y*
Information flow is strictly from author to reviewers, from reviewers to meta-reviewer, and meta-reviewer back to author if revision is needed.

3. Mathematical Formulation and Communication Complexity

Key equations defining the system:

  • Author generation: S0=fauthor(Q)S_0 = f_{\text{author}}(Q), (t,y)=A(Q)(t, y) = \mathcal{A}(Q)
  • Reviewer output: ri=Ri(Q,t,y)=(Di,ci,Ci)r_i = \mathcal{R}_i(Q, t, y) = (D_i, c_i, C_i)
  • Meta-review: St+1=hmeta(St,{Di,Ci}i=1..m)S_{t+1} = h_{\text{meta}}(S_t, \{D_i, C_i\}_{i=1..m}), mout=M(Q,t,y,r1...rm)m_{\text{out}} = \mathcal{M}(Q, t, y, r_1 \oplus ... \oplus r_m)
  • Rebuttal/revision (if needed): y=Arevise(t,y,mout.Fmeta)y^* = \mathcal{A}_{\text{revise}}(t, y, m_{\text{out}}.F_{\text{meta}})

Token and time complexity:

  • For round-table MAD (multi-agent debate with mm agents): TokensMADmLauthor+m(m1)Lmsg    O(m2)\mathrm{Tokens}_{\text{MAD}} \approx m \cdot L_{\text{author}} + m \cdot (m-1) \cdot L_{\text{msg}} \implies O(m^2)
  • For MARS: TokensMARSLauthor+mLrev+Lmeta+Lreb    O(m)\mathrm{Tokens}_{\text{MARS}} \approx L_{\text{author}} + m \cdot L_{\text{rev}} + L_{\text{meta}} + L_{\text{reb}} \implies O(m)

Empirically, for m=2m=2 reviewers:

  • TokensMAD5042\mathrm{Tokens}_{\text{MAD}} \approx 5042 (ChatGPT, GPQA)
  • TokensMARS2479\mathrm{Tokens}_{\text{MARS}} \approx 2479
  • Approximate 50%50\% reduction in both token usage and wall-clock inference time when mm is modest.

4. Communication Pattern: Elimination of Reviewer-to-Reviewer Dependence

A core property of canonical MARS is that each reviewer only sees the author’s output S0S_0, never the views of other reviewers. The meta-reviewer serves as the sole aggregation point. This "star" topology replaces the complete mm-node graph of message-passing in MAD with a strict pipeline: author \to reviewers \to meta-reviewer \to (author if revision). The implication is linear scaling; in contrast, MAD’s message passing is quadratic in the number of agents due to repeated cross-review. Empirical measurements with m=2m=2 or $3$ confirm a reliable 50%\approx 50\% saving in resource consumption for MARS compared to MAD.

5. Empirical Benchmark Performance

Experiments were conducted with GPT-3.5-turbo and Mixtral-8×22B as model backbones, m=2m=2 reviewers, and one round each for review and rebuttal. Benchmarks included GPQA, MMLU, and GSM8K. Key results per Table 1 of (Wang et al., 24 Sep 2025):

  • GPQA (ChatGPT): MAD—31.00% accuracy, 5042 tokens, 11.92s; MARS—36.33%, 2479 tokens, 6.01s.
  • MMLU (ChatGPT): MAD—71.33% accuracy, 3194 tokens, 7.64s; MARS—71.00%, 1702 tokens, 4.71s.
  • GSM8K (ChatGPT): MAD—79.00% accuracy, 2906 tokens, 7.92s; MARS—75.67%, 1655 tokens, 4.32s.
  • Mixtral-8×22B results exhibit qualitatively similar scaling, with tokens and inference time approximately halved.

Statistical analysis over 1000 samples per benchmark shows no significant difference in accuracy (p>0.1p>0.1) but highly significant improvements in efficiency (p<0.001p<0.001).

Backbone Benchmark MAD Accuracy MARS Accuracy Token Saving (%) Time Saving (%)
GPT-3.5-turbo GPQA 31.00% 36.33% 50% ~50%
GPT-3.5-turbo MMLU 71.33% 71.00% 47% ~38%
GPT-3.5-turbo GSM8K 79.00% 75.67% 43% ~45%
Mixtral-8×22B GPQA 47.00% 44.00% 56% ~55%

The reduction in token and compute usage is attributable directly to the elimination of reviewer-to-reviewer communications.

6. Experimental and Implementation Details

The canonical configuration:

  • LLMs: GPT-3.5-turbo, Mixtral-8×22B (via NVIDIA NIM).
  • Agents: m=2m=2 reviewers, 1 meta-reviewer, all using the same model backbone.
  • Prompts: Templates for author, reviewer, meta-reviewer, and rebuttal, as detailed in Appendix C.
  • LLM parameters: temperature=0.7, max_tokens=2048.
  • Hardware: NVIDIA A100 GPUs.
  • Protocol: One review and one rebuttal round; process stops when Dmeta=acceptD_{\text{meta}} = \text{accept} or after TT rounds.

The architecture can be summarized textually:

  • Stage 1 (Author): QQ \to CoT reasoning S0\to S_0
  • Stage 2 (Review): S0S_0 \to reviewers [R1,..,Rm][\mathcal{R}_1,..,\mathcal{R}_m] (evaluated in parallel) {ri}\to \{r_i\}
  • Stage 3 (Meta): {ri},S0MDmeta,Jmeta,Fmeta\{r_i\}, S_0 \to \mathcal{M} \to D_{\text{meta}}, J_{\text{meta}}, F_{\text{meta}}
  • Stage 4 (Rebuttal): If Dmeta=rejectD_{\text{meta}} = \text{reject}, S0S_0 revised per FmetaF_{\text{meta}}, produce final yy^*

No special fine-tuning or proprietary data is used; all agents employ standard, publicly available models and default parameters.

7. Comparative Significance and Paradigm Implications

Canonical MARS demonstrates that role-based, hierarchical agent decomposition achieves the collaborative accuracy benefits observed in previous multi-agent LLM protocols, while reducing resource footprint by eliminating quadratic communication dependencies. The strict segregation of reviewer perspectives and single-aggregation meta-decision is empirically validated as both efficient and robust, with performance matching or exceeding MAD in accuracy under controlled experimental conditions. This establishes the MARS architecture as both a theoretically and practically scalable protocol for multi-agent LLM reasoning systems (Wang et al., 24 Sep 2025).

A plausible implication is that as LLM collaborative frameworks scale to larger agent pools or more complex decision processes, design patterns emulating human academic workflows—centralized arbiter, independent review, iterative revision—may become increasingly advantageous, both for efficiency and for controllability of agent interactions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Canonical Role-Based MARS.