SEARL: Joint Optimization of Policy and Tool Graph Memory for Self-Evolving Agents

Published 9 Apr 2026 in cs.AI and cs.LG | (2604.07791v2)

Abstract: Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have demonstrated significant potential in single-turn reasoning tasks. With the paradigm shift toward self-evolving agentic learning, models are increasingly expected to learn from trajectories by synthesizing tools or accumulating explicit experiences. However, prevailing methods typically rely on large-scale LLMs or multi-agent frameworks, which hinder their deployment in resource-constrained environments. The inherent sparsity of outcome-based rewards also poses a substantial challenge, as agents typically receive feedback only upon completion of tasks. To address these limitations, we introduce a Tool-Memory based self-evolving agentic framework SEARL. Unlike approaches that directly utilize interaction experiences, our method constructs a structured experience memory that integrates planning with execution. This provides a novel state abstraction that facilitates generalization across analogous contexts, such as tool reuse. Consequently, agents extract explicit knowledge from historical data while leveraging inter-trajectory correlations to densify reward signals. We evaluate our framework on knowledge reasoning and mathematics tasks, demonstrating its effectiveness in achieving more practical and efficient learning.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces a joint optimization framework that refines agent policies and a structured tool graph memory to support compositional reasoning.
A hybrid advantage estimation mechanism combines episode-level and tool-anchored step-level rewards to improve credit assignment in sparse-reward environments.
Empirical evaluations on multi-hop QA and mathematical benchmarks demonstrate SEARL's modularity, efficient transfer, and superior performance.

SEARL: Joint Optimization of Policy and Tool Graph Memory for Self-Evolving Agents

Motivation and Problem Statement

Contemporary RL-based agent frameworks for LLMs are limited by sparse reward signals, static predefined tool sets, and unstructured external memory systems. Monolithic tool synthesis frequently results in combinatorial overwhelm and lack of compositional generalization, especially when leveraging small-to-midsize LLMs in resource-limited deployments. Additionally, current memory augmentation practices do not support explicit inter-tool dependencies, which constrains long-term knowledge accumulation and multi-step planning.

Figure 1: Existing monolithic paradigms overwhelm LLMs and lack the structural memory connectivity required for compositional reasoning.

SEARL Framework Overview

The SEARL framework addresses these deficiencies by jointly optimizing both the agent policy and a structured external Tool Graph Memory. The architecture leverages: (1) task decomposition and trajectory encoding based on agent-environment interaction, (2) tool-anchored step grouping for fine-grained advantage estimation, and (3) a dynamic, self-evolving graph-structured memory that encodes tool semantics and operational dependencies.

Figure 2: SEARL architecture integrates structured agent-environment interaction, anchor-action step grouping, and a dynamic tool graph memory for continual agent self-evolution.

Methodology

Tool Graph Memory and Task Decomposition

SEARL formalizes external memory as a directed Tool Graph $\mathcal{T}_G = (V,E)$ , where nodes denote MCP tools and edges represent inter-tool dependency relations derived from execution order within generated plans. For each new task, the agent generates a DAG-like subplan structure, which is mapped to a corresponding subgraph in the tool memory. Modular tools are encouraged, improving reusability and facilitating decomposition strategies that mitigate the burdens of monolithic reasoning.

Hybrid Advantage Estimation

Credit assignment in sparse-reward RL environments is improved using a two-level advantage structure:

Episode-Level Relative Advantage: Aggregates cumulative group returns across trajectory rollouts under the same task, offering a stabilized global reward signal.
Tool-Anchored Step-Level Advantage: For each trajectory, steps are grouped by the tool-invocation anchor, allowing for process-level, tool-specific reward assignment that abstracts away environment state sparsity.

This dual estimation provides both coarse and fine-grained optimization, where tool-anchored groupings are essential for isolating reusable reasoning primitives—an observation confirmed by ablation studies.

Memory Consolidation and Evolution

Tool creation is decoupled from immediate registration to avoid redundancy. After each rollout, only tools w.r.t. highest-rewarded trajectories are preserved and merged into the global memory. Tool equivalence is determined via semantic embedding and cosine similarity, enabling robust merging and effective accumulation of domain experience. Edge integration maintains the causal precedence of tool invocation, resulting in ever-expanding, richly structured agent memory.

Empirical Evaluation

Extensive experiments on long-horizon reasoning tasks—including AIME24, MATH500, GSM8K for mathematics, and HotpotQA, 2WikiMultihopQA, Musique, Bamboogle, and WebWalker for knowledge reasoning—demonstrate the efficiency and generalization capabilities of SEARL-trained agents.

On multi-hop QA, SEARL surpasses all baselines, especially in evidence integration and compositional question answering. Baselines relying on transient context or unstructured memories perform poorly under retrieval noise, while SEARL's tool graph delivers persistent, fine-grained retrieval and reduces hallucination rates.
For challenging mathematical benchmarks, while a minor drop is observed on simple tasks (tool creation overhead), SEARL excels on highly compositional benchmarks (e.g., AIME24), matching or exceeding ARPO and significantly outperforming static-policy baselines.
Figure 3: SEARL yields consistently higher training rewards and maintains greater entropy (exploration) than standard GRPO optimization.

Ablation analysis further demonstrates that step-level grouping is critical: removing this mechanism causes marked degradation, particularly in structurally complex datasets.

Figure 4: The ablation removing step-level grouping causes the sharpest accuracy decline, highlighting its importance for advantage estimation.

Tool Graph Efficiency and Dynamical Evolution

In compositional reasoning scenarios (e.g., sequence optimization), SEARL prevents brute-force search explosion by constructing modular tool chains, hierarchically enforcing constraints, and maximizing reusability.

Figure 5: Modular tool chains induced by SEARL prune search spaces and outperform monolithic, brute-force strategies in constrained optimization.

The learned tool graph evolves from fragmented, disjoint structures to densely connected, multi-branch topologies, reflecting transfer and cross-domain synthesis across training.

Figure 6: Tool memory evolves from single-path reuse to complex, multi-branch convergence, indicating accumulation of diverse problem-solving experience.

Theoretical and Practical Implications

SEARL formalizes a new paradigm for agent optimization where external knowledge is not a static resource but a continually adapted, topologically structured memory that is intimately linked with policy optimization. Tight coupling between policy learning and structured memory fosters adaptive behavior, efficient transfer, and domain generalization. Process-level reward shaping embedded in the group-based estimation framework provides robust gradients even in environments with sparse outcome rewards.

Practically, this joint optimization unlocks efficient long-horizon reasoning capabilities in small LLMs with minimal deployment overhead, making high-quality agentic capabilities accessible for real-world, resource-constrained environments. The compositional Tool Graph can serve as a persistent skill repository for transfer and meta-learning scenarios.

Theoretically, the structured abstraction and the compositional credit assignment mitigate well-known RL pathologies such as reward sparsity, credit diffusion, and catastrophic forgetting.

Future Prospects

Further work may extend the framework with richer syntactic and semantic tool signatures, cross-agent tool sharing, and hierarchical meta-controller policies governing tool synthesis. This may open avenues toward foundation models capable of structure-aware, persistent, and open-ended learning and reasoning.

Conclusion

SEARL establishes a robust self-evolution paradigm for tool-using agents, wherein policy and graph-structured tool memory co-evolve. The system achieves efficient long-range credit assignment and robust generalization. Empirical results confirm significant improvements in complex multi-hop and mathematical reasoning, especially where compositional reuse and persistent skill accumulation are essential. The approach provides a modular, pluggable framework for continual agent improvement and represents a significant advancement toward structure-aware reinforcement learning in agentic AI (2604.07791).