Multi-Agent Evolve: LLM Self-Improve
- Multi-Agent Evolve (MAE) is a framework where a Proposer, Solver, and Judge co-evolve to enhance large language model reasoning through reinforcement learning.
- The system employs a closed-loop interaction where the Proposer creates challenging questions, the Solver provides answers, and the Judge evaluates quality, enabling data-efficient improvements.
- MAE demonstrates a 4.54% average boost across benchmarks, offering a scalable and minimal-supervision approach to general reasoning enhancement.
The article "Multi-Agent Evolve: LLM Self-Improve through Co-evolution" introduces a scalable and data-efficient framework for enabling LLMs to improve their reasoning capabilities across diverse tasks, such as mathematics, reasoning, and general knowledge question-answering, without dependency on human-annotated data. This approach leverages a novel framework that uses a triplet of interacting agents—Proposer, Solver, and Judge—to optimize behavior through reinforcement learning. These agents co-evolve in a closed-loop system, where the Proposer generates questions, the Solver attempts solutions, and the Judge evaluates both, introducing a scalable methodology for LLMs to self-evolve. Experiments demonstrated an average improvement of 4.54% across multiple benchmarks, highlighting MAE as an effective method for general reasoning improvement with minimal reliance on human supervision.
1. Motivation and Framework Overview
Motivation
Traditional reinforcement learning (RL) methods applied to LLMs typically require human-curated datasets and explicit, verifiable rewards, making them challenging to scale across general domains. The paper proposes overcoming these limitations through Multi-Agent Evolve (MAE), which allows LLMs to self-improve in solving tasks spanning mathematics, reasoning, and knowledge question-answering without human annotation or ground-truth environments.
Framework Overview
The MAE framework consists of three agents:
- Proposer: Generates challenging questions to stimulate Solver's capabilities.
- Solver: Attempts to solve the questions generated by the Proposer.
- Judge: Evaluates the quality and correctness of both the questions and answers, providing reward signals.
All agents share a backbone LLM and optimize their behaviors through reinforcement learning in a co-evolutionary setup. This configuration supports a data-efficient, scalable evolution of reasoning abilities.
2. Agent Roles and Interactions
Proposer
- Function: Create high-quality questions designed to challenge the Solver.
- Reward Structure: Reward based on question quality, difficulty, and format, calculated as:
- Difficulty Reward: Measured based on Solver's struggle, encouraging the generation of challenging yet solvable questions.
Solver
- Function: Attempt to solve questions and provide solutions.
- Reward Structure: Evaluated on the correctness and format of answers provided, defined as:
Judge
- Function: Provide evaluative feedback on both questions and answers, using rubric-driven LLM-based reward signals.
- Rubrics: Define criteria for question clarity and answer correctness, ensuring objective grading.
3. Reinforcement Learning Methodology
Task-Relative REINFORCE++
MAE implements a task-relative reinforcement algorithm, Task-Relative REINFORCE++, which involves:
- Role-specific baselines for variance reduction, computed as:
- Synchronous updates across shared LLM parameters of all agent roles, encouraging stable, synchronized evolution.
Optimization Strategy
The reinforcement learning setup includes methods such as quality filtering and format rewards to maintain stability and maximize improvement across iterations.
4. Self-Evolution and Co-evolution Dynamics
MAE's unique feature is the co-evolution of the three agents:
- Proposer improves question difficulty based on the Solver's performance.
- Solver adapts to increasingly complex questions from the Proposer.
- Judge maintains rubrics that enforce quality, enhancing model skills over time.
This setup fosters an auto-curriculum learning environment, where both problem complexity and agent capability scale naturally, without external supervision.
5. Experimental Results and Benchmark Performance
Benchmarks
MAE is evaluated using the Qwen2.5-3B-Instruct LLM across varied benchmarks, including:
- Mathematics: GSM8K, MATH, AMC
- Reasoning/Logic: ARC, MMLU
- QA/Commonsense: SQuAD, TriviaQA
- Code: MBPP, HumanEval
Quantitative Results
- Overall Improvement: Demonstrated a 4.54% increase in average accuracy compared to baseline models.
- Domain Generality: Excelled not only in coding tasks but also in general reasoning and QA tasks.
- Ablation Studies: Showed that disabling any agent resulted in degraded overall performance, validating the integral nature of the triadic system.
6. Reward Mechanisms and Stability Considerations
Self-Rewarding System
- Domain-Agnostic: Replaces traditional external rewards with LLM-driven, rubric-based assessments, minimizing dependency on ground-truth datasets.
- Stability Features: Quality filtering and format rewards prevent data degradation, maintain consistency, and ensure robust evolution.
Optimization Details
MAE employs the AdamW optimizer with specific learning rate settings and batch sizes, optimized for long self-improvement training runs without reward drift or collapse.
7. Broader Implications for LLM Training
Scalability and Efficiency
- MAE offers data-efficient scalability by removing dependency on costly human datasets, allowing arbitrary expansion with focused computational resources.
- Supports versatile general-domain applications beyond game engines or coding environments, indicating potential for broad LLM evolution applications.
Flexibility and Generalization
- Utilizes a single backbone LLM for all roles, fostering flexible adaptation across diverse tasks without role-specific infrastructure.
- Ensures agent diversity, preventing single-agent collapse, and maintains long-term training stability.
MAE provides a template for scalable, efficient LLM evolution, paving the way toward advanced self-improving AI systems capable of achieving and surpassing human-level reasoning autonomously.
References: See the original paper for formulas, detailed algorithm strategies, Task-Relative REINFORCE++ specifics, and comprehensive results analysis.