AgentEvolver: Autonomous Self-Evolving Agents

Updated 14 November 2025

AgentEvolver is a self-evolving framework that autonomously generates tasks and refines policies using LLM feedback.
It employs self-questioning, experience-guided exploration, and fine-grained reward attribution to enhance efficiency and scalability.
Its modular architecture integrates multi-agent systems, hybrid optimization, and evolutionary algorithms for continual, unsupervised improvement.

AgentEvolver refers to a class of self-evolving agent architectures in which LLM-driven or multi-agent systems autonomously generate, select, and incorporate new tasks, workflows, experiences, or policy refinements to incrementally improve across complex and open-ended environments. Distinct from traditional RL pipelines requiring human-constructed datasets and fixed exploration strategies, AgentEvolver systems are characterized by curiosity-driven task generation, autonomous workflow and prompt evolution, experience summarization and reuse, and fine-grained credit assignment—often orchestrated in closed feedback loops involving multiple specialized agents, modular infrastructure, and hybrid optimization mechanisms. These frameworks aim to enhance exploration efficiency, sample utilization, and adaptability, supporting continually improving agentic performance without manual supervision.

1. Foundational Principles and Motivation

The AgentEvolver paradigm targets the limitations of manual data curation, brute-force reinforcement learning, and static workflow design in LLM-based and agentic AI systems (Zhai et al., 13 Nov 2025, Wang et al., 4 Jul 2025, Yuksel et al., 2024). Traditional RL-based agents are subject to high data-construction costs, inefficient exploration due to random trial-and-error sampling, and poor sample utilization, as sparse trajectory-level rewards often fail to assign differentiated credit to individual steps or decisions. Furthermore, in open-ended or unstructured novel environments, the absence of human-crafted tasks or reliable reward functions inhibits effective agent learning and adaptation.

AgentEvolver approaches address these bottlenecks by placing the LLM or multi-agent system at the center of its own learning loop: synthesizing new tasks, guiding exploration by reusing distilled experience, and autonomously attributing credit using natural language feedback from an LLM judge. This decouples agent development from fixed pipelines and enables scalable, cost-effective, and continual improvement.

2. Core Mechanisms: Self-Questioning, Self-Navigating, Self-Attributing

AgentEvolver systems—which in some works is itself the system name (Zhai et al., 13 Nov 2025) and elsewhere refers to the overall paradigm—typically integrate three synergistic mechanisms:

2.1 Self-Questioning (Curiosity-Driven Task Generation)

The agent explores the environment using a high-temperature policy (π_explore) to sample diverse actions and states.
Observed trajectories are processed by an LLM to synthesize new, diverse proxy tasks (g ∈ 𝒢), along with reference solutions extracted from successful (or unsuccessful) trajectories.
Deduplication and feasibility checks ensure that generated tasks are non-redundant and solvable; hallucinated or infeasible tasks are filtered by reference solution execution.

2.2 Self-Navigating (Experience-Guided Exploration)

Past rollouts are summarized into natural-language “experiences” that encode success or failure cases and stored in an experience pool (𝒫_exp).
For new tasks, top-k relevant experiences are retrieved via embedding similarity and used to guide new rollouts, balancing vanilla exploration and experience-guided rollouts.
During policy optimization (e.g., via Group Relative Policy Optimization, GRPO), experience-guided samples receive selective-boosting in the clipped loss to propagate learned heuristics efficiently.

2.3 Self-Attributing (Differentiated Reward Assignment)

An LLM is prompted with the full trajectory and the final outcome to assign token-level or stepwise “GOOD/BAD” labels based on causal contribution to success.
These attributions are normalized within-trajectory and fused with outcome-based rewards to yield composite step rewards and advantages.
Fine-grained credit assignment doubles sample efficiency and accelerates convergence, compared to trajectory-level rewards.

3. Architectural Characteristics and Integrated Algorithms

AgentEvolver platforms are built on modular multi-agent or service-oriented infrastructures (Zhai et al., 13 Nov 2025, Yuksel et al., 2024), including:

Environment Services: Gym-compatible, Ray-backed APIs for generality across domains.
LLM Services: For prompt-based reasoning and self-assessment, typically using state-of-the-art LLMs (e.g., Qwen2.5-7B/14B, Llama 3, Claude, GPT-4o).
Experience Pools and Context Managers: For storing, retrieving, and filtering past experiences and in-context cues.
Orchestrators: Master agents or scripts that coordinate task synthesis, rollout scheduling, experience updating, and policy optimization.

Integrated optimization methods span:

RL variants: PPO, GRPO (critical for off-policy, high-throughput environments).
Evolutionary Algorithms: Applied to workflows (e.g., EvoFlow (Zhang et al., 11 Feb 2025)), agent code (e.g., AgentEvolver in strategic planning tasks (Belle et al., 5 Jun 2025)), or agentic graphs/topologies (e.g., InfiAgent (Yu et al., 26 Sep 2025)).
LLM-Driven Feedback Loops: Agents for hypothesis generation, code modification, evaluation, and documentation, all orchestrated in iterative cycles (Yuksel et al., 2024).

4. Mathematical Formalism

The formal learning objective of an AgentEvolver-style system is to maximize expected return over the true (but unknown) target task distribution: $J_{\rm target}(\theta) = \mathbb{E}_{g\sim p_{\rm target},s_0\sim p_0}\, V^{\pi_\theta}(s_0,g),\quad V^{\pi_\theta}(s_0,g) = \mathbb{E}\Big[\sum_{t=0}^\infty \gamma^t R_g(s_t,a_t)\,|\,s_0,g,\pi_\theta\Big]$ Because both $p_{\rm target}$ and $R_g$ are generally unknown, AgentEvolver introduces learned proxy task and reward functions: $F_{\rm task}:\mathcal{E}\to\Delta(\mathcal{G}),\quad F_{\rm reward}:(\mathcal{E},\mathcal{G})\to (\mathcal{S}\times\mathcal{A}\to\mathbb{R})$ so that training is performed on generated proxy tasks and rewards. Experience-guided rollouts and fine-grained reward attributions are formalized as instance weights and token-level rewards, propagated through advantage estimation in the RL objective.

For workflow and agentic architecture evolution, multi-objective (e.g., cost vs. performance) Pareto optimization and evolutionary search play central roles (Zhang et al., 11 Feb 2025, Wang et al., 4 Jul 2025). Workflow graphs and prompts are evolved via mutation, crossover, and selection, with dominance and Pareto-front maintenance.

5. Empirical Evaluation and Quantitative Benchmarks

AgentEvolver frameworks are evaluated on a range of agentic and tool-augmented benchmarks, including AppWorld, BFCL-v3, HotPotQA, MBPP, MATH, and generalist settings (e.g., AgentGym/AgentEvol across 89 heterogeneous tasks (Xi et al., 2024)). Representative results:

Model	Avg@8 Baseline	+Questioning	+Questioning+Navigating	+Questioning+Attributing	AgentEvolver Overall
Qwen2.5-7B	15.8	36.1	39.8	41.3	45.2
Qwen2.5-14B	29.8	52.3	54.1	56.4	57.6

Key findings across benchmarks:

Each core mechanism yields complementary gains; self-questioning drives a ~20 pp improvement over vanilla RL (Zhai et al., 13 Nov 2025).
Experience-guided and attributed rollouts yield a further 3–5 pp.
Convergence is accelerated: doubling of sample efficiency (convergence in 40 vs. 90 RL epochs).
Ablations confirm the necessity of experience summarization and token-level attribution for highest performance.

6. Extensions, Strengths, and Limitations

AgentEvolver systems generalize across domains and tasks, as demonstrated by performance on both in-domain and out-of-domain splits (AppWorld, BFCL-v3, generalist AgentGym evaluation (Xi et al., 2024)). They can be combined with curiosity-based curriculum generation, multi-agent workflow evolution, and modular experience architectures (e.g., EvolveR (Wu et al., 17 Oct 2025), InfiAgent (Yu et al., 26 Sep 2025)).

Strengths:

Autonomy: Eliminates human-engineered task and reward design.
Continual improvement: Frameworks support lifelong learning via closed feedback loops.
Efficiency: Better exploration, utilization, and credit assignment yield higher performance at reduced computational cost.

Limitations:

Quality and stability of LLM judgments for experience and reward attributions.
Potential for accumulating spurious or misleading experiences; requires robust filtering.
Scaling overhead for large experience or workflow libraries.

7. Significance and Future Directions

AgentEvolver represents a paradigm shift toward autonomous, LLM-driven agentic self-improvement, unifying curiosity, experience abstraction, navigation, and differentiated credit in a single system. It forms the foundation for future work on challenge-oriented curricula, model scaling, and unification of all stages of self-evolution (task synthesis, navigation, attribution, policy update) in a single continually evolving LLM loop (Zhai et al., 13 Nov 2025, Wu et al., 17 Oct 2025).

Open challenges include curriculum design for safety-critical workflows, experience library compression and retrieval, automatic agent birth/death/specialization in hierarchical graphs, integration with symbolic planners or Monte Carlo tree search, and deeper meta-learning for zero-shot agent selection.

In summary, AgentEvolver provides a comprehensive framework for efficient, scalable, and autonomous LLM-based agent evolution, demonstrating robust gains in practical agentic tasks and establishing a pathway toward fully self-improving, generalist artificial intelligence.