Self-Evolving LLM Agents

Updated 26 September 2025

The paper demonstrates that self-evolving LLM agents autonomously refine their reasoning policies through integrated monitoring, analysis, and feedback cycles.
Self-evolving LLM agents are frameworks where large language models dynamically adjust control loops, toolsets, and collaboration protocols in response to real-time feedback.
These systems leverage techniques such as symbolic learning, trajectory optimization, and multi-agent consensus to achieve significant performance gains across diverse tasks.

Self-evolving LLM-based agents are architectural and algorithmic frameworks in which LLMs actively and autonomously adapt their reasoning policies, control loops, toolsets, or collaboration protocols in response to feedback, performance outcomes, and environmental shifts. These systems leverage internal mechanisms—such as symbolic or gradient-based optimization, reflection, trajectory evolution, or multi-agent self-organization—to continuously update both model-internal and system-level parameters with the objective of increasing robustness, adaptability, and task efficacy without direct, manual human intervention.

1. Core Architectures and Control Loops

At the heart of self-evolving LLM agents is the explicit embedding of the LLM within a dynamic control architecture; a prominent example is the adaptation of the MAPE-K loop as demonstrated in MAS integrations (Nascimento et al., 2023). This pipeline is structured as:

$\begin{aligned} \text{M (Monitoring)} &: \text{Acquire state, sensor, and peer input} \ \text{A (Analyzing)} &: \text{LLM analyzes state/history contextually} \ \text{P (Planning)} &: \text{LLM generates candidate strategies/actions} \ \text{E (Execution)} &: \text{Translate LLM output to commands} \ \text{K (Knowledge)} &: \text{Local/global knowledge stores and updates} \end{aligned}$

LLMs are embedded in the Analyzing, Planning, and Knowledge modules, acting as interpreters of high-dimensional state and history. After Monitoring gathers inputs, the data is converted to a prompt, dispatched to the LLM, and the LLM’s response directly influences both policy updates and communication. This tight coupling allows for expressive, adaptive, and context-dependent problem solving, particularly in dynamic or competitive settings. The feedback cycle—encompassing message expressiveness, emergent negotiation strategies, and online updating—marks a paradigm shift compared to traditional, fixed-protocol agent systems.

2. Methods for Autonomous Adaptation and Self-Reflection

Self-evolving LLM agents implement a range of mechanisms to enable adaptation beyond static fine-tuning:

Iterative Feedback and Reflection: Agents operate in closed loops with structured evaluation phases, such as iterative output–checker cycles (Liang et al., 1 Sep 2024). After each agent output $o_t \sim \pi_\theta(o_t|s_t, r_t, f_t^{i})$ , the agent receives feedback and, if not satisfactory, iterates with policy or memory adjustments. Reflective mechanisms further allow the agent to aggregate and analyze performance histories: $f_t = \text{ref}(o_{1:t}, r_{1:t})$ , which are then stored in memory for future reference and meta-learning.
Self-Talk and Self-Generated Data: In dialogue systems, agents simulate both user and agent roles, generating dialogue corpora for supervised fine-tuning (self-talk) (Ulmer et al., 10 Jan 2024). Generated interactions are filtered (e.g., via ROUGE-L–based subgoal completion) to select high-quality data for further model refinement.
Self-Agentic Modification and Code Rewriting: Certain agents are designed to modify their own source code and system logic based on observed benchmark performance and reflection-driven meta-policies (Robeyns et al., 21 Apr 2025). These agents iteratively archive, select, and improve their own implementation, operationalizing agentic self-improvement at the software scaffolding layer.
Symbolic Learning and Trajectory Optimization: Agents optimize over symbolic “networks” where the weights are not numerical but are prompts, tool call definitions, and pipeline topologies. Language-based analogs of loss and gradient (constructed and applied via prompts) enable holistic self-evolution of symbolic weights—i.e., pipeline-level architecture, prompts, and tool usage—using language as the optimization substrate (Zhou et al., 26 Jun 2024).

3. Multi-Agent Collaboration Structures and Self-Organization

Advances in self-evolving agents are not limited to the single-agent setting. Several approaches pursue self-evolution through dynamic multi-agent organization:

Decentralized Self-Evolving Profiles: Frameworks such as MorphAgent (Lu et al., 19 Oct 2024) enable each LLM-based agent within a multi-agent collective to autonomously adapt its “profile”—a vectorized representation of expertise and responsibility—while optimizing for Role Clarity (RCS), Role Differentiation (RDS), and Task-Role Alignment (TRAS). Agents perform Observe–Think–Act cycles, iteratively updating their profiles based on both environmental feedback and inter-agent complementarities.
Self-Organizing Agent Structures: S-Agents employ a tree-based organization, with a root “leadership agent” coordinating asynchronous leaf agents (Chen et al., 7 Feb 2024). The hourglass agent architecture filters synthesized sensory and communication input through a bottleneck to long- and short-term objectives, and non-obstructive collaboration removes round-based bottlenecks, facilitating robust real-time adaptation.
Multi-Round Consensus Aggregation: MDTeamGPT (Chen et al., 18 Mar 2025) demonstrates agents in multi-disciplinary medical consultation settings using residualized multi-agent discussion, consensus aggregation, and knowledge base construction (both CorrectKB and ChainKB) for experience distillation and future inference refinement.

4. Evolutionary Algorithms and Trajectory-Based Self-Improvement

Trajectory evolution plays a central role in recent self-evolving agents:

Revision, Recombination, and Refinement: SE-Agent (Lin et al., 4 Aug 2025) formalizes agent evolution as an iterative process where the agent maintains a pool of pilot trajectories (reasoning chains), which are repeatedly revised (self-critique and flaw elimination), recombined (cross-trajectory integration), and refined (via a reward function such as $Reward(t, T) = \alpha \cdot TaskCompletion + \beta \cdot ReasoningQuality + \gamma \cdot Efficiency$ ). Hybridization of reasoning paths overcomes local optimum traps and promotes discovery of previously unexplored solution spaces.
Partial Masking and Self-Reflected Trajectories: STeP (Chen et al., 26 May 2025) further refines trajectory-based training by generating self-reflective error-corrected trajectories (with explicit error marking and correction) and employing loss masking (ignoring tokens associated with errors in fine-tuning) to prevent overfitting to failure modes.
Monte Carlo Tree Search and Group-wise RL: SEEA-R1 (Tian et al., 26 Jun 2025) leverages Tree-GRPO, combining MCTS with group relative policy optimization to assign denser, more informative rewards to intermediate agent actions, enabling more effective multi-step credit assignment and robust reinforcement fine-tuning in embodied, multi-modal environments.

5. Self-Evolving Agents in Strategic, Embodied, and Real-World Domains

The self-evolving paradigm has been validated across a variety of real-world and simulation-intensive settings:

Strategic Planning in Multi-Agent Games: Complex domains such as diplomacy (Richelieu (Guan et al., 9 Jul 2024)) and Settlers of Catan (Belle et al., 5 Jun 2025) serve as testbeds for agents that autonomously refine long-term strategies through reflective memory modules, sub-goal planning, negotiation, and coordinated prompt/code rewriting via specialized Analyst, Researcher, and Coder roles.
Web and Software Agents: WebEvolver (Fang et al., 23 Apr 2025) integrates a co-evolving world model that predicts next-step web observations to support lookahead simulation and policy improvement; A Self-Improving Coding Agent (Robeyns et al., 21 Apr 2025) demonstrates measurable gains on SWE Bench Verified by iteratively updating its own code via archive selection and meta-reasoning.
Proactive, Privacy-Preserving Assistants: The Galaxy framework (Bao et al., 6 Aug 2025) unifies cognitive architecture and system-level design by embedding a Cognition Forest, with agents (KoRa and Kernel) that detect, reflect, and proactively instantiate new functionalities and privacy-preserving pipelines, closing the cognitive–system feedback loop.

6. Evaluation, Performance Metrics, and Limitations

Benchmarks and empirical evaluations consistently validate the utility of self-evolving mechanisms:

Framework/Paper	Benchmark	Reported Improvement / Metric
MDTeamGPT (Chen et al., 18 Mar 2025)	MedQA, PubMedQA	90.1%, 83.9% accuracy
S-Agents (Chen et al., 7 Feb 2024)	Minecraft	Outperforms chain/fc structures (TC)
SE-Agent (Lin et al., 4 Aug 2025)	SWE-bench Verified	Up to 55% relative improvement
STeP (Chen et al., 26 May 2025)	ALFWorld, WebShop	~10% reward/completion rate gains
WebEvolver (Fang et al., 23 Apr 2025)	Mind2Web-Live, etc.	~10% performance gain
SEEA-R1 (Tian et al., 26 Jun 2025)	ALFWorld	85.07% (textual), 36.19% (multimodal)
EvolveSearch (Zhang et al., 28 May 2025)	7 MHQA datasets	Avg. +4.7% over SOTA
AgentGym/AgentEvol (Xi et al., 6 Jun 2024)	WebShop, ALFWorld, etc.	SOTA-level performance

Performance improvements are generally established via fine-grained quantitative metrics such as task accuracy, success rates, average reward, or task completion steps. Iterative self-evolution and trajectory-driven learning consistently outperform static imitation or singular reinforcement paradigms, especially in domains featuring complex multi-step reasoning, tool use, and large or dynamic action spaces.

Limiting factors include model capability bottlenecks, context window or memory constraints, non-trivial resource requirements for multi-agent or trajectory-based training, and challenges in aligning symbolic evolution with parametric updates.

7. Future Directions and Implications

Recent developments indicate several major avenues for further exploration:

Holistic Agent–Environment Co-evolution: Combining world modeling (virtual environment simulation) with real-time policy evolution for robust, low-cost lookahead and imagination-guided planning (Fang et al., 23 Apr 2025).
Embodied, Multi-Modal, and Tool-Augmented Agents: Scaling symbolic learning, trajectory recombination, and multi-modal reward estimation for agents operating in richly interactive or sensorimotor-rich settings (Tian et al., 26 Jun 2025, Zhou et al., 26 Jun 2024).
Privacy, Adaptation, and Proactivity: Integration of privacy-preserving protocols, meta-cognitive oversight mechanisms, and proactive behavior generation (Bao et al., 6 Aug 2025).
Data-centric, Autonomous Skill Acquisition: Reducing hand-crafted prompt/tool engineering in favor of genuinely data-driven evolution, leveraging symbolic network optimization, and reflective or meta-architectural loops (Zhou et al., 26 Jun 2024, Chen et al., 18 Mar 2025).
Scalable Multi-Agent Specialization and Robustness: Advancing decentralized, profile-adaptive agent collectives capable of flexible, resilient role reallocation and collaborative problem solving (Lu et al., 19 Oct 2024, Chen et al., 7 Feb 2024).

In summary, self-evolving LLM-based agent frameworks represent a substantive conceptual expansion of autonomous system design. They incorporate self-adaptive control, asynchronous collaboration, automatic tool acquisition, symbolic and parametric optimization, and trajectory-level self-refinement—collectively enabling systems to achieve higher generality, adaptability, and robustness in open, complex tasks.