Co-Evolving Multi-Agent Systems via Interaction Rewards

Updated 12 March 2026

The paper's main contribution is demonstrating that agents can self-evolve using intrinsic rewards from inter-agent interactions, removing the need for external labels.
CoMAS employs methodologies like multi-stage workflows, peer evaluation, and reward-sharing in both LLM and Markov game frameworks to boost collaborative performance.
Empirical results reveal performance gains (e.g., a 4.54% improvement on challenging benchmarks) and enhanced stability over extended reinforcement learning periods.

Co-Evolving Multi-Agent Systems via Interaction Rewards (CoMAS) refers to a class of frameworks in which multiple autonomous agents iteratively improve their policies by leveraging richly structured, intrinsic rewards derived exclusively from inter-agent interactions—eschewing external labels or human-curated reward functions. In both language task domains and broader Markovian environments, the CoMAS paradigm enables agents to advance their reasoning, collaboration, and task performance through self-generated feedback, mutual evaluation, and decentralized reinforcement learning. Central instantiations of this methodology include modern frameworks for LLMs engaging in discussion-based self-improvement and general multi-agent Markov games with learned incentive structures (Xue et al., 9 Oct 2025, Chen et al., 27 Oct 2025, Kölle et al., 2023).

1. Foundational Principles of Co-Evolving Multi-Agent Systems

CoMAS systems are predicated on the hypothesis that agents can achieve continual self-evolution when the reward landscape is shaped by the dynamic information exchange among agents. In text-based reasoning domains, these interactions typically manifest as agents proposing solutions, evaluating peers' responses, and negotiating a consensus or a ranking, with judgments explicitly formulated through agent-generated scores. In Markov games and grid-worlds, CoMAS is instantiated by having agents directly internalize each other's rewards via participation mechanisms or share trading.

Formally, agents $u_k$ are parameterized by policies $\pi_{\theta_k}$ , which are updated on the basis of rewards derived from discussion, critique, and mutual assessment—rather than from environment-defined extrinsic goals or externally validated datasets (Xue et al., 9 Oct 2025, Chen et al., 27 Oct 2025, Kölle et al., 2023).

2. CoMAS Architectures and Protocols

Two representative architectures have emerged:

Discussion-based LLM CoMAS: Agents engage in a multi-stage workflow comprising solution generation, peer evaluation, and scoring. Each step involves rolling, agent-sampled rounds, with a shared or agent-specific LLM acting as both problem solver and judge. Generated outputs, evaluations, and metaevaluations are parsed for reward computation (Xue et al., 9 Oct 2025, Chen et al., 27 Oct 2025).
Reward-Participation CoMAS: In environments formalized as general-sum Markov games, each agent’s reward at each timestep is defined as a linear combination (via share matrix $W$ ) of the raw rewards of all agents. Participation rights—tradable shares of others’ returns—are learned and adjusted as part of each agent’s policy (Kölle et al., 2023).

The following table summarizes salient aspects:

CoMAS Variant	Domain	Interaction Mechanism
Discussion/Evaluation (LLM)	Text synthesis	Peer evaluation + zero-sum rewards
Reward Participation (Markov games)	Environment RL	Reward share trading

3. Intrinsic Reward Formulation and Role-Specific Incentives

CoMAS frameworks define intrinsic rewards through explicit assessment of agent-to-agent interactions:

LLM-based: For each solution $s_i$ and evaluation $e_{i,j}$ , the agent-scored output $\tau_{i,j}$ is parsed to $\hat{\tau}_{i,j} \in \{1,2,3\}$ and linearly normalized to form a zero-sum coupling:

$r(s_i) = \frac{\hat{\tau}_{i,j} - 1}{2}, \quad r(e_{i,j}) = 1 - r(s_i) = \frac{3 - \hat{\tau}_{i,j}}{2}$

This ties solver and evaluator incentives in a strictly adversarial manner, driving truthful critique and robust solution finding (Xue et al., 9 Oct 2025).

Participation-based: For agent $i$ , the adjusted reward is

$\tilde{r}_i^t = \sum_{j=1}^N w_{ij} r_j (s_t, a_t^1, \dots, a_t^N)$

Reward-distribution shares $\pi_{\theta_k}$ 0 are themselves learned (via actor-critic or analytic-gradient methods) so agents internalize group-level externalities (Kölle et al., 2023).

Multi-role LLM: In MAE, three agents (Proposer $\pi_{\theta_k}$ 1, Solver $\pi_{\theta_k}$ 2, Judge $\pi_{\theta_k}$ 3) are instantiated from a single LLM. Role-specific returns integrate answer correctness, question difficulty (1 minus average solver success), and regulatory format compliance:

$\pi_{\theta_k}$ 4

$\pi_{\theta_k}$ 5

$\pi_{\theta_k}$ 6

This multifaceted scheme allows both competitive and cooperative pressures, increasing the generative robustness and quality of all agent policies (Chen et al., 27 Oct 2025).

4. Reinforcement Learning Optimization and Synchronized Co-Evolution

In all cited frameworks, policy optimization proceeds via RL objectives targeting cumulative intrinsic (interaction-based) reward:

LLM/Discussion: Actor-specific rollouts are collected and policy gradient updates performed via REINFORCE++ with KL regularization relative to a static reference model, stabilizing learning and mitigating mode collapse. Batch normalization of the advantage and reward clipping are standard (Xue et al., 9 Oct 2025).
Reward Participation: Policy and participation-share parameters are updated concurrently either via joint gradient steps (two-agent theory) or as part of an augmented action space (actor-critic). Constraints ( $\pi_{\theta_k}$ 7) are imposed via regularization.
Synchronized Updates (MAE): Role-specific advantages are normalized, and a lock-step sum of all roles’ policy gradients is applied to the shared LLM parameter vector $\pi_{\theta_k}$ 8 in each iteration. This ensures stable and efficient gradient flow in a multi-agent LLM setting (Chen et al., 27 Oct 2025).

5. Empirical Evaluation and Scalability

Experiments consistently demonstrate:

Performance Gains: Across benchmarks in mathematics, coding, science, and general knowledge, CoMAS frameworks yield absolute gains over both untrained LLMs and rule-based or voting-based interaction schemes. For example, MAE achieves an average improvement of 4.54% on challenging benchmarks without human-annotated data (Chen et al., 27 Oct 2025), while General CoMAS attains 1–20% gains depending on interaction protocol, agent count, and evaluation setting (Xue et al., 9 Oct 2025).
Stability: Training stability is preserved over hundreds of RL fine-tuning steps, with avoidance of catastrophic collapse or degenerate behaviors—contrasting with collapse beyond 50 steps in prior self-play LLMs (Chen et al., 27 Oct 2025).
Role and Reward Ablations: Removing evaluation, scoring, or format components degrades performance, and reward hacking emerges unless adversarial or role-balanced reward signals are applied.
Scalability: Increasing agent pool size boosts performance, with best gains in collaborative or debate protocols. Agent heterogeneity (e.g., Qwen + Llama) amplifies gains, consistent with theoretical expectations that diverse policies yield richer mutual correction and exploration (Xue et al., 9 Oct 2025).
Emergent Specialization: In Markovian environments, participation-based CoMAS schemes lead to emergent division of labor and endogenous role specialization (Kölle et al., 2023).

6. Theoretical Insights and Limitations

CoMAS does not guarantee formal Nash equilibrium convergence in the general multi-agent LLM setting. However, several mechanisms contribute to empirical robustness:

Cooperative components (such as a regulatory Judge) break adversarial cycles.
Difficulty-based rewards implement a “curriculum learning” dynamic, stabilizing training progress.
Task-relative reward normalization and synchronized updates reduce gradient variance and enhance learning efficiency (Chen et al., 27 Oct 2025).

Limitations include dependence on prompt engineering and format parsing for reward extraction, susceptibility to collusion in unregulated scoring scenarios, and empirical focus on text-based or small-agent-pool domains. Broader real-world generalization and adversarial resilience remain active research questions (Xue et al., 9 Oct 2025).

7. Extensions and Ongoing Directions

Continuing avenues include:

Broadening CoMAS to open-ended tasks (planning, multimodal reasoning) and real-world environments.
Automating protocol design (dynamic role assignments, adaptive interaction depths).
Integrating human-in-the-loop feedback and simulated external environments for richer reward structures.
Exploring more expressive intrinsic reward functions (e.g., rewarding novelty, consensus, or diversity within agent collections).
Scaling frameworks to larger agent populations and longer interaction horizons (Xue et al., 9 Oct 2025).

These developments position Co-Evolving Multi-Agent Systems via Interaction Rewards as a principal methodology for scalable, data-efficient, and decentralized self-improvement in contemporary AI systems.

Markdown Report Issue Upgrade to Chat

References (3)

CoMAS: Co-Evolving Multi-Agent Systems via Interaction Rewards (2025)

Multi-Agent Evolve: LLM Self-Improve through Co-evolution (2025)

Learning to Participate through Trading of Reward Shares (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Co-Evolving Multi-Agent Systems via Interaction Rewards (CoMAS).

Co-Evolving Multi-Agent Systems via Interaction Rewards

1. Foundational Principles of Co-Evolving Multi-Agent Systems

2. CoMAS Architectures and Protocols

3. Intrinsic Reward Formulation and Role-Specific Incentives

4. Reinforcement Learning Optimization and Synchronized Co-Evolution

5. Empirical Evaluation and Scalability

6. Theoretical Insights and Limitations

7. Extensions and Ongoing Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Co-Evolving Multi-Agent Systems via Interaction Rewards

1. Foundational Principles of Co-Evolving Multi-Agent Systems

2. CoMAS Architectures and Protocols

3. Intrinsic Reward Formulation and Role-Specific Incentives

4. Reinforcement Learning Optimization and Synchronized Co-Evolution

5. Empirical Evaluation and Scalability

6. Theoretical Insights and Limitations

7. Extensions and Ongoing Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research