Co-Evolving Multi-Agent Systems via Interaction Rewards
- The paper's main contribution is demonstrating that agents can self-evolve using intrinsic rewards from inter-agent interactions, removing the need for external labels.
- CoMAS employs methodologies like multi-stage workflows, peer evaluation, and reward-sharing in both LLM and Markov game frameworks to boost collaborative performance.
- Empirical results reveal performance gains (e.g., a 4.54% improvement on challenging benchmarks) and enhanced stability over extended reinforcement learning periods.
Co-Evolving Multi-Agent Systems via Interaction Rewards (CoMAS) refers to a class of frameworks in which multiple autonomous agents iteratively improve their policies by leveraging richly structured, intrinsic rewards derived exclusively from inter-agent interactions—eschewing external labels or human-curated reward functions. In both language task domains and broader Markovian environments, the CoMAS paradigm enables agents to advance their reasoning, collaboration, and task performance through self-generated feedback, mutual evaluation, and decentralized @@@@1@@@@. Central instantiations of this methodology include modern frameworks for LLMs engaging in discussion-based self-improvement and general multi-agent Markov games with learned incentive structures (Xue et al., 9 Oct 2025, Chen et al., 27 Oct 2025, Kölle et al., 2023).
1. Foundational Principles of Co-Evolving Multi-Agent Systems
CoMAS systems are predicated on the hypothesis that agents can achieve continual self-evolution when the reward landscape is shaped by the dynamic information exchange among agents. In text-based reasoning domains, these interactions typically manifest as agents proposing solutions, evaluating peers' responses, and negotiating a consensus or a ranking, with judgments explicitly formulated through agent-generated scores. In Markov games and grid-worlds, CoMAS is instantiated by having agents directly internalize each other's rewards via participation mechanisms or share trading.
Formally, agents are parameterized by policies , which are updated on the basis of rewards derived from discussion, critique, and mutual assessment—rather than from environment-defined extrinsic goals or externally validated datasets (Xue et al., 9 Oct 2025, Chen et al., 27 Oct 2025, Kölle et al., 2023).
2. CoMAS Architectures and Protocols
Two representative architectures have emerged:
- Discussion-based LLM CoMAS: Agents engage in a multi-stage workflow comprising solution generation, peer evaluation, and scoring. Each step involves rolling, agent-sampled rounds, with a shared or agent-specific LLM acting as both problem solver and judge. Generated outputs, evaluations, and metaevaluations are parsed for reward computation (Xue et al., 9 Oct 2025, Chen et al., 27 Oct 2025).
- Reward-Participation CoMAS: In environments formalized as general-sum Markov games, each agent’s reward at each timestep is defined as a linear combination (via share matrix ) of the raw rewards of all agents. Participation rights—tradable shares of others’ returns—are learned and adjusted as part of each agent’s policy (Kölle et al., 2023).
The following table summarizes salient aspects:
| CoMAS Variant | Domain | Interaction Mechanism |
|---|---|---|
| Discussion/Evaluation (LLM) | Text synthesis | Peer evaluation + zero-sum rewards |
| Reward Participation (Markov games) | Environment RL | Reward share trading |
3. Intrinsic Reward Formulation and Role-Specific Incentives
CoMAS frameworks define intrinsic rewards through explicit assessment of agent-to-agent interactions:
- LLM-based: For each solution and evaluation , the agent-scored output is parsed to and linearly normalized to form a zero-sum coupling:
This ties solver and evaluator incentives in a strictly adversarial manner, driving truthful critique and robust solution finding (Xue et al., 9 Oct 2025).
- Participation-based: For agent , the adjusted reward is
Reward-distribution shares are themselves learned (via actor-critic or analytic-gradient methods) so agents internalize group-level externalities (Kölle et al., 2023).
- Multi-role LLM: In MAE, three agents (Proposer , Solver , Judge ) are instantiated from a single LLM. Role-specific returns integrate answer correctness, question difficulty (1 minus average solver success), and regulatory format compliance:
This multifaceted scheme allows both competitive and cooperative pressures, increasing the generative robustness and quality of all agent policies (Chen et al., 27 Oct 2025).
4. Reinforcement Learning Optimization and Synchronized Co-Evolution
In all cited frameworks, policy optimization proceeds via RL objectives targeting cumulative intrinsic (interaction-based) reward:
- LLM/Discussion: Actor-specific rollouts are collected and policy gradient updates performed via REINFORCE++ with KL regularization relative to a static reference model, stabilizing learning and mitigating mode collapse. Batch normalization of the advantage and reward clipping are standard (Xue et al., 9 Oct 2025).
- Reward Participation: Policy and participation-share parameters are updated concurrently either via joint gradient steps (two-agent theory) or as part of an augmented action space (actor-critic). Constraints () are imposed via regularization.
- Synchronized Updates (MAE): Role-specific advantages are normalized, and a lock-step sum of all roles’ policy gradients is applied to the shared LLM parameter vector in each iteration. This ensures stable and efficient gradient flow in a multi-agent LLM setting (Chen et al., 27 Oct 2025).
5. Empirical Evaluation and Scalability
Experiments consistently demonstrate:
- Performance Gains: Across benchmarks in mathematics, coding, science, and general knowledge, CoMAS frameworks yield absolute gains over both untrained LLMs and rule-based or voting-based interaction schemes. For example, MAE achieves an average improvement of 4.54% on challenging benchmarks without human-annotated data (Chen et al., 27 Oct 2025), while General CoMAS attains 1–20% gains depending on interaction protocol, agent count, and evaluation setting (Xue et al., 9 Oct 2025).
- Stability: Training stability is preserved over hundreds of RL fine-tuning steps, with avoidance of catastrophic collapse or degenerate behaviors—contrasting with collapse beyond 50 steps in prior self-play LLMs (Chen et al., 27 Oct 2025).
- Role and Reward Ablations: Removing evaluation, scoring, or format components degrades performance, and reward hacking emerges unless adversarial or role-balanced reward signals are applied.
- Scalability: Increasing agent pool size boosts performance, with best gains in collaborative or debate protocols. Agent heterogeneity (e.g., Qwen + Llama) amplifies gains, consistent with theoretical expectations that diverse policies yield richer mutual correction and exploration (Xue et al., 9 Oct 2025).
- Emergent Specialization: In Markovian environments, participation-based CoMAS schemes lead to emergent division of labor and endogenous role specialization (Kölle et al., 2023).
6. Theoretical Insights and Limitations
CoMAS does not guarantee formal Nash equilibrium convergence in the general multi-agent LLM setting. However, several mechanisms contribute to empirical robustness:
- Cooperative components (such as a regulatory Judge) break adversarial cycles.
- Difficulty-based rewards implement a “curriculum learning” dynamic, stabilizing training progress.
- Task-relative reward normalization and synchronized updates reduce gradient variance and enhance learning efficiency (Chen et al., 27 Oct 2025).
Limitations include dependence on prompt engineering and format parsing for reward extraction, susceptibility to collusion in unregulated scoring scenarios, and empirical focus on text-based or small-agent-pool domains. Broader real-world generalization and adversarial resilience remain active research questions (Xue et al., 9 Oct 2025).
7. Extensions and Ongoing Directions
Continuing avenues include:
- Broadening CoMAS to open-ended tasks (planning, multimodal reasoning) and real-world environments.
- Automating protocol design (dynamic role assignments, adaptive interaction depths).
- Integrating human-in-the-loop feedback and simulated external environments for richer reward structures.
- Exploring more expressive intrinsic reward functions (e.g., rewarding novelty, consensus, or diversity within agent collections).
- Scaling frameworks to larger agent populations and longer interaction horizons (Xue et al., 9 Oct 2025).
These developments position Co-Evolving Multi-Agent Systems via Interaction Rewards as a principal methodology for scalable, data-efficient, and decentralized self-improvement in contemporary AI systems.