Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 119 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 27 tok/s Pro
GPT-5 High 17 tok/s Pro
GPT-4o 60 tok/s Pro
Kimi K2 196 tok/s Pro
GPT OSS 120B 423 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Multi-Agent Evolve: LLM Self-Improve

Updated 29 October 2025
  • Multi-Agent Evolve (MAE) is a framework where a Proposer, Solver, and Judge co-evolve to enhance large language model reasoning through reinforcement learning.
  • The system employs a closed-loop interaction where the Proposer creates challenging questions, the Solver provides answers, and the Judge evaluates quality, enabling data-efficient improvements.
  • MAE demonstrates a 4.54% average boost across benchmarks, offering a scalable and minimal-supervision approach to general reasoning enhancement.

The article "Multi-Agent Evolve: LLM Self-Improve through Co-evolution" introduces a scalable and data-efficient framework for enabling LLMs to improve their reasoning capabilities across diverse tasks, such as mathematics, reasoning, and general knowledge question-answering, without dependency on human-annotated data. This approach leverages a novel framework that uses a triplet of interacting agents—Proposer, Solver, and Judge—to optimize behavior through reinforcement learning. These agents co-evolve in a closed-loop system, where the Proposer generates questions, the Solver attempts solutions, and the Judge evaluates both, introducing a scalable methodology for LLMs to self-evolve. Experiments demonstrated an average improvement of 4.54% across multiple benchmarks, highlighting MAE as an effective method for general reasoning improvement with minimal reliance on human supervision.

1. Motivation and Framework Overview

Motivation

Traditional reinforcement learning (RL) methods applied to LLMs typically require human-curated datasets and explicit, verifiable rewards, making them challenging to scale across general domains. The paper proposes overcoming these limitations through Multi-Agent Evolve (MAE), which allows LLMs to self-improve in solving tasks spanning mathematics, reasoning, and knowledge question-answering without human annotation or ground-truth environments.

Framework Overview

The MAE framework consists of three agents:

  • Proposer: Generates challenging questions to stimulate Solver's capabilities.
  • Solver: Attempts to solve the questions generated by the Proposer.
  • Judge: Evaluates the quality and correctness of both the questions and answers, providing reward signals.

All agents share a backbone LLM and optimize their behaviors through reinforcement learning in a co-evolutionary setup. This configuration supports a data-efficient, scalable evolution of reasoning abilities.

2. Agent Roles and Interactions

Proposer

  • Function: Create high-quality questions designed to challenge the Solver.
  • Reward Structure: Reward based on question quality, difficulty, and format, calculated as:

RP(q)=λqualityRquality+λdifficultyRdifficulty+λformatRformatR_P(q) = \lambda_{\text{quality}} R_{\text{quality}} + \lambda_{\text{difficulty}} R_{\text{difficulty}} + \lambda_{\text{format}} R_{\text{format}}

  • Difficulty Reward: Measured based on Solver's struggle, encouraging the generation of challenging yet solvable questions.

Solver

  • Function: Attempt to solve questions and provide solutions.
  • Reward Structure: Evaluated on the correctness and format of answers provided, defined as:

RS(a)=λjudgeRjudge+λformatRformatR_S(a) = \lambda_{\text{judge}} R_{\text{judge}} + \lambda_{\text{format}} R_{\text{format}}

Judge

  • Function: Provide evaluative feedback on both questions and answers, using rubric-driven LLM-based reward signals.
  • Rubrics: Define criteria for question clarity and answer correctness, ensuring objective grading.

3. Reinforcement Learning Methodology

Task-Relative REINFORCE++

MAE implements a task-relative reinforcement algorithm, Task-Relative REINFORCE++, which involves:

  • Role-specific baselines for variance reduction, computed as:

Arolenorm=rμroleσroleA_{\text{role}}^{\text{norm}} = \frac{r - \mu_{\text{role}}}{\sigma_{\text{role}}}

  • Synchronous updates across shared LLM parameters of all agent roles, encouraging stable, synchronized evolution.

Optimization Strategy

The reinforcement learning setup includes methods such as quality filtering and format rewards to maintain stability and maximize improvement across iterations.

4. Self-Evolution and Co-evolution Dynamics

MAE's unique feature is the co-evolution of the three agents:

  • Proposer improves question difficulty based on the Solver's performance.
  • Solver adapts to increasingly complex questions from the Proposer.
  • Judge maintains rubrics that enforce quality, enhancing model skills over time.

This setup fosters an auto-curriculum learning environment, where both problem complexity and agent capability scale naturally, without external supervision.

5. Experimental Results and Benchmark Performance

Benchmarks

MAE is evaluated using the Qwen2.5-3B-Instruct LLM across varied benchmarks, including:

  • Mathematics: GSM8K, MATH, AMC
  • Reasoning/Logic: ARC, MMLU
  • QA/Commonsense: SQuAD, TriviaQA
  • Code: MBPP, HumanEval

Quantitative Results

  • Overall Improvement: Demonstrated a 4.54% increase in average accuracy compared to baseline models.
  • Domain Generality: Excelled not only in coding tasks but also in general reasoning and QA tasks.
  • Ablation Studies: Showed that disabling any agent resulted in degraded overall performance, validating the integral nature of the triadic system.

6. Reward Mechanisms and Stability Considerations

Self-Rewarding System

  • Domain-Agnostic: Replaces traditional external rewards with LLM-driven, rubric-based assessments, minimizing dependency on ground-truth datasets.
  • Stability Features: Quality filtering and format rewards prevent data degradation, maintain consistency, and ensure robust evolution.

Optimization Details

MAE employs the AdamW optimizer with specific learning rate settings and batch sizes, optimized for long self-improvement training runs without reward drift or collapse.

7. Broader Implications for LLM Training

Scalability and Efficiency

  • MAE offers data-efficient scalability by removing dependency on costly human datasets, allowing arbitrary expansion with focused computational resources.
  • Supports versatile general-domain applications beyond game engines or coding environments, indicating potential for broad LLM evolution applications.

Flexibility and Generalization

  • Utilizes a single backbone LLM for all roles, fostering flexible adaptation across diverse tasks without role-specific infrastructure.
  • Ensures agent diversity, preventing single-agent collapse, and maintains long-term training stability.

MAE provides a template for scalable, efficient LLM evolution, paving the way toward advanced self-improving AI systems capable of achieving and surpassing human-level reasoning autonomously.


References: See the original paper for formulas, detailed algorithm strategies, Task-Relative REINFORCE++ specifics, and comprehensive results analysis.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Multi-Agent Evolve (MAE).