Multi-Agent Reflexion (MAR)

Updated 7 March 2026

Multi-Agent Reflexion (MAR) is a framework where multiple autonomous agents, each with a unique persona, iteratively critique and refine outputs.
It employs structured interaction protocols—such as panel-style moderation, debate, and critique loops—to reconcile divergent stakeholder perspectives.
MAR is applied in generative design, personalized content, legal summarization, and argument generation to enhance diversity and alignment of output.

Multi-Agent Reflexion (MAR) refers to frameworks in which multiple autonomous agents, each endowed with explicit, contextually grounded personas (or preferences), engage in iterative critique, discussion, and consensus-building to critique, refine, and steer the generation or evaluation of artifacts such as text, designs, or decisions. MAR protocols have been recently instantiated as practical systems in generative design, creative writing, summarization, and other domains where capturing diverse stakeholder perspectives or user-centric criteria is essential.

1. Formal Foundations and Definitions

MAR extends traditional single-agent reflexion (where a model iteratively critiques and refines its own outputs) to a multi-agent paradigm. The defining characteristic is the instantiation of $N \geq 2$ agents, typically modeled as LLM "personas," that operate independently (in parallel or sequence), first generating feedback or proposals from diverse, sometimes conflicting vantage points, followed by a structured interaction—debate, moderation, or aggregation—to synthesize or select among these outputs.

In a canonical MAR setup:

Let $\mathcal{A}_1,\ldots,\mathcal{A}_N$ denote agents, each with its own persona parameterization $\pi_i$ (e.g., user profile, audience segment, expert archetype).
For an input artifact $x$ (e.g., poster draft, story, legal summary), each agent evaluates or proposes edits $f_i(x)$ , possibly under a unique rubric $r_i$ .
A moderator module (another LLM, or algorithmic process) $M$ observes $\{f_i(x)\}$ , optionally anchors on user input, facilitates discussion or questioning across agents, and generates a consensus output $\hat{y}$ or a critique summary.

This process is codified by conditional generation and evaluation chains such as:

$\hat{y} = M\left(\{f_i(x,\pi_i,r_i)\}_{i=1}^N, \text{user comment}\right)$

The approach systematically exposes, reconciles, or calibrates conflicts between differing persona-grounded critiques to yield outputs that are more robust, diverse, or stakeholder-aligned than single-agent or persona-agnostic baselines (Shin et al., 24 Jul 2025, Ueda et al., 16 Sep 2025, Hu et al., 2024).

2. Persona Construction and Assignment

Persona specification is foundational in MAR. Current systems leverage multimodal LLMs or trait-extraction pipelines to instantiate personas via two primary strategies:

Dimension-based Elicitation: PosterMate (Shin et al., 24 Jul 2025) prompts an LLM to extract two "steerable" audience dimensions (e.g., fashion sensitivity vs. price sensitivity) from an input brief. Extrema in each dimension define a $\mathcal{A}_1,\ldots,\mathcal{A}_N$ 0 grid of personas.
Data-driven Clustering: Proxona (Choi et al., 2024) analyzes large-scale audience comment corpora to extract dimension–value taxonomies, assigns each comment a tuple of values, embeds them (SBERT), and clusters to yield empirically grounded personas representing diverse audience segments.
Profile Synthesis from User Histories: PREFINE (Ueda et al., 16 Sep 2025) synthesizes a pseudo-user agent by prompting a high-capacity LLM on a user's historical selections to extract explicit preferences, then bases the persona on these bullet-pointed observations.

The diversity, granularity, and realism of these personas directly affect the fidelity and coverage of the ensuing multi-agent MAR process. Automatic, LLM-prompted persona construction is favored due to reduced variance and higher task relevance compared to handcrafted personas (Kim et al., 2024).

Approach	Data Source	Representation
Dimension-based	Task/Design brief	LLM-generated matrix
Clustering-based	User comments	Embedding + k-means
Profile-based	User history	LLM-synthesized text

3. Agent-to-Agent Interaction and Consensus Mechanisms

MAR systems differ in how agent critiques are synthesized. Three core paradigms have emerged:

Panel-Style Moderation: PosterMate (Shin et al., 24 Jul 2025) aggregates persona feedback, automatically detects points of conflict, and has a moderator LLM orchestrate open-ended discussion, prompting each persona for compromise, then fusing responses. No explicit voting; majority rationales are privileged.
Debate and Critic Protocols: Debate-to-Write (Hu et al., 2024) forms N=3 agents (distinct personas) plus a "critic" persona. Agents take turns debating, responding to previous statements, until the critic is satisfied—yielding a nonlinear, multi-turn transcript that is distilled into an output plan.
Critique-and-Refine Loops: PREFINE (Ueda et al., 16 Sep 2025) iteratively critiques and revises an artifact according to user-specific rubrics synthesized from the persona, alternating between proposal, rubric-guided critique, and refinement, until convergence or satisfaction.

These frameworks make explicit both the divergences and admissible consensus among simulated stakeholders, supporting traceable rationales for downstream edits or decisions.

4. Evaluation Metrics and Empirical Results

Assessment of MAR hinges on evaluating both artifact quality and the degree to which persona-driven diversity/consistency is realized. Notable evaluation dimensions:

Persona Appropriateness: PosterMate's Study 2 measured the accuracy with which human raters could attribute feedback to the correct persona (text: 52.1%, image: 64.3%, theme: 21.9%; chance = 25% for four personas) (Shin et al., 24 Jul 2025).
Diversity and Coverage: Debate-to-Write reported semantic and perspective diversity via Self-BLEU, Self-Emb, and PersDiv, showing MAR yields outputs with lowest similarity across runs and highest diversity relative to baselines (Hu et al., 2024).
Stakeholder-Specific Evaluation: PersonaMatrix (Pang et al., 19 Sep 2025) introduced persona-by-criterion matrices and the Diversity-Coverage Index (DCI) to quantify how distinct persona preferences are along controlled dimensions (depth, technicality, narrative), confirming substantial divergence between legal experts, journalists, and the public.
Personalization and Rubric Contribution: PREFINE ablation on story generation tasks showed both the explicit persona (EP) and user-specific rubric (R) are necessary; removing either dropped win rates by 15–20 points (Ueda et al., 16 Sep 2025).
Consensus Quality: PosterMate found that consensus outputs were preferred over any individual persona's feedback by sizeable margins (text: 40.5% vs. 14.9%; image: 69.5% vs. 7.6%; theme: 42.5% vs. 14.3%) (Shin et al., 24 Jul 2025).

5. Applications and Domain Extensions

MAR is applicable wherever reconciling multiple, potentially competing stakeholder perspectives enhances the value of an output. Documented instantiations include:

Design Feedback and Prototyping: PosterMate assists in poster design, but the MAR approach generalizes to e-commerce UI, packaging, email marketing, and signage, with multimodal artifact representations seamlessly integrated via LLMs and image models (Shin et al., 24 Jul 2025).
Personalized Content Generation: PREFINE demonstrates personalized story and dialogue generation by simulating critics tailored to individual user profiles, obviating the need for continual explicit feedback or fine-tuning (Ueda et al., 16 Sep 2025).
Opinion and Argument Generation: Debate-to-Write establishes that multi-agent persona debate yields arguments with enhanced diversity and coherence, outperforming linear planning (Hu et al., 2024).
Legal Summarization: PersonaMatrix's persona-by-criterion grids reveal divergences in summary preference across litigators, journalists, educators, and advocates, providing actionable critique for summary refinement (Pang et al., 19 Sep 2025).
Audience Sensemaking: Proxona operationalizes MAR to support creators in interrogating clustered audience personas drawn from real comment data, closing the loop between user data and content ideation (Choi et al., 2024).

6. Implementation Patterns and Limitations

Best practices for MAR construction gleaned from empirical studies include:

Use LLM-generated personas to minimize variance and maximize specific task relevance (Kim et al., 2024).
Structure agent pipelines in modular stages (persona builder, agent feedback, moderator/critic, integrator) to support extensibility.
Moderate persona diversity to balance coverage with cognitive load (overabundance of personas can overwhelm end users).
Ground persona construction in empirical data (e.g., real comments, user histories) to enhance relevance and interpretability.
Aggregate outputs via rationality-based moderation (rather than strict voting), aligning with the majority rationale absent outlier bias.

Recognized limitations comprise the risk of hallucination or drift in LLM-generated persona criteria, and challenges in calibrating consensus when simulated personas lack grounding in real user distribution or when task structure is highly open-ended.

7. Broader Implications and Outlook

MAR introduces a scalable, systematic mechanism for simulating, interrogating, and reconciling diverse perspectives in generative AI workflows. The formalization of persona-driven critics—as structured, interactive agents equipped with explicit priorities—enables coverage of niche or underrepresented viewpoints and transparent trade-off surfacing. Frameworks such as DCI operationalize the measurement of diversity and stakeholder coverage. Ongoing research explores extending MAR to more complex domain scenarios (multi-level clustering, multimodal tasks), incorporating atomic-level fidelity metrics for in-character generation (Shin et al., 24 Jun 2025), and optimizing critique/refinement loops via real-time and ensemble-based protocols. Open challenges include scaling to higher agent counts, refining criteria aggregation techniques, and continued validation against real human judgments. MAR provides a general recipe for aligning AI outputs with multidimensional, stakeholder- or user-centric criteria in high-stakes and creative settings.