Multi-Agent Evolve: LLM Self-Improve

Updated 29 October 2025

Multi-Agent Evolve (MAE) is a framework where a Proposer, Solver, and Judge co-evolve to enhance large language model reasoning through reinforcement learning.
The system employs a closed-loop interaction where the Proposer creates challenging questions, the Solver provides answers, and the Judge evaluates quality, enabling data-efficient improvements.
MAE demonstrates a 4.54% average boost across benchmarks, offering a scalable and minimal-supervision approach to general reasoning enhancement.

The article "Multi-Agent Evolve: LLM Self-Improve through Co-evolution" introduces a scalable and data-efficient framework for enabling LLMs to improve their reasoning capabilities across diverse tasks, such as mathematics, reasoning, and general knowledge question-answering, without dependency on human-annotated data. This approach leverages a novel framework that uses a triplet of interacting agents—Proposer, Solver, and Judge—to optimize behavior through reinforcement learning. These agents co-evolve in a closed-loop system, where the Proposer generates questions, the Solver attempts solutions, and the Judge evaluates both, introducing a scalable methodology for LLMs to self-evolve. Experiments demonstrated an average improvement of 4.54% across multiple benchmarks, highlighting MAE as an effective method for general reasoning improvement with minimal reliance on human supervision.

1. Motivation and Framework Overview

Motivation

Traditional reinforcement learning (RL) methods applied to LLMs typically require human-curated datasets and explicit, verifiable rewards, making them challenging to scale across general domains. The paper proposes overcoming these limitations through Multi-Agent Evolve (MAE), which allows LLMs to self-improve in solving tasks spanning mathematics, reasoning, and knowledge question-answering without human annotation or ground-truth environments.

Framework Overview

The MAE framework consists of three agents:

Proposer: Generates challenging questions to stimulate Solver's capabilities.
Solver: Attempts to solve the questions generated by the Proposer.
Judge: Evaluates the quality and correctness of both the questions and answers, providing reward signals.

All agents share a backbone LLM and optimize their behaviors through reinforcement learning in a co-evolutionary setup. This configuration supports a data-efficient, scalable evolution of reasoning abilities.

2. Agent Roles and Interactions

Proposer

Function: Create high-quality questions designed to challenge the Solver.
Reward Structure: Reward based on question quality, difficulty, and format, calculated as:

$R_P(q) = \lambda_{\text{quality}} R_{\text{quality}} + \lambda_{\text{difficulty}} R_{\text{difficulty}} + \lambda_{\text{format}} R_{\text{format}}$

Difficulty Reward: Measured based on Solver's struggle, encouraging the generation of challenging yet solvable questions.

Solver

Function: Attempt to solve questions and provide solutions.
Reward Structure: Evaluated on the correctness and format of answers provided, defined as:

$R_S(a) = \lambda_{\text{judge}} R_{\text{judge}} + \lambda_{\text{format}} R_{\text{format}}$

Judge

Function: Provide evaluative feedback on both questions and answers, using rubric-driven LLM-based reward signals.
Rubrics: Define criteria for question clarity and answer correctness, ensuring objective grading.

3. Reinforcement Learning Methodology

Task-Relative REINFORCE++

MAE implements a task-relative reinforcement algorithm, Task-Relative REINFORCE++, which involves:

Role-specific baselines for variance reduction, computed as:

$A_{\text{role}}^{\text{norm}} = \frac{r - \mu_{\text{role}}}{\sigma_{\text{role}}}$

Synchronous updates across shared LLM parameters of all agent roles, encouraging stable, synchronized evolution.

Optimization Strategy

The reinforcement learning setup includes methods such as quality filtering and format rewards to maintain stability and maximize improvement across iterations.

4. Self-Evolution and Co-evolution Dynamics

MAE's unique feature is the co-evolution of the three agents:

Proposer improves question difficulty based on the Solver's performance.
Solver adapts to increasingly complex questions from the Proposer.
Judge maintains rubrics that enforce quality, enhancing model skills over time.

This setup fosters an auto-curriculum learning environment, where both problem complexity and agent capability scale naturally, without external supervision.

5. Experimental Results and Benchmark Performance

Benchmarks

MAE is evaluated using the Qwen2.5-3B-Instruct LLM across varied benchmarks, including:

Mathematics: GSM8K, MATH, AMC
Reasoning/Logic: ARC, MMLU
QA/Commonsense: SQuAD, TriviaQA
Code: MBPP, HumanEval

Quantitative Results

Overall Improvement: Demonstrated a 4.54% increase in average accuracy compared to baseline models.
Domain Generality: Excelled not only in coding tasks but also in general reasoning and QA tasks.
Ablation Studies: Showed that disabling any agent resulted in degraded overall performance, validating the integral nature of the triadic system.

6. Reward Mechanisms and Stability Considerations

Self-Rewarding System

Domain-Agnostic: Replaces traditional external rewards with LLM-driven, rubric-based assessments, minimizing dependency on ground-truth datasets.
Stability Features: Quality filtering and format rewards prevent data degradation, maintain consistency, and ensure robust evolution.

Optimization Details

MAE employs the AdamW optimizer with specific learning rate settings and batch sizes, optimized for long self-improvement training runs without reward drift or collapse.

7. Broader Implications for LLM Training

Scalability and Efficiency

MAE offers data-efficient scalability by removing dependency on costly human datasets, allowing arbitrary expansion with focused computational resources.
Supports versatile general-domain applications beyond game engines or coding environments, indicating potential for broad LLM evolution applications.

Flexibility and Generalization

Utilizes a single backbone LLM for all roles, fostering flexible adaptation across diverse tasks without role-specific infrastructure.
Ensures agent diversity, preventing single-agent collapse, and maintains long-term training stability.

MAE provides a template for scalable, efficient LLM evolution, paving the way toward advanced self-improving AI systems capable of achieving and surpassing human-level reasoning autonomously.

References: See the original paper for formulas, detailed algorithm strategies, Task-Relative REINFORCE++ specifics, and comprehensive results analysis.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Multi-Agent Evolve (MAE).