DeepAgent Framework
- DeepAgent Framework is an end-to-end autonomous agent system that integrates general reasoning, dynamic tool discovery, memory management, and reinforcement learning to solve long-horizon tasks.
- It employs an innovative memory folding mechanism to compress episodic, working, and tool memories, effectively managing context length and mitigating error accumulation.
- The ToolPO reinforcement learning strategy, coupled with dense tool retrieval, ensures stable, efficient tool use that outperforms traditional workflow-based agents in diverse benchmarks.
DeepAgent Framework is an end-to-end autonomous agent architecture designed to integrate general reasoning, robust tool use, memory management, and reinforcement learning into a single, unified process for solving complex, long-horizon tasks. The framework centers on a large reasoning model and introduces several technical innovations: dynamic tool discovery and dense retrieval, an autonomous memory folding mechanism that compresses task history, and the ToolPO reinforcement learning strategy for stable and efficient tool-use learning. DeepAgent establishes a globally coherent agentic reasoning paradigm, consistently outperforming traditional workflow-based and monolithic agent baselines in empirical evaluations across diverse tool-use benchmarks and real-world applications (Li et al., 24 Oct 2025).
1. Unified Agentic Reasoning Process
DeepAgent implements a tightly interleaved reasoning and action cycle. At its core, the large reasoning model (LRM) generates a continuous agentic reasoning stream that alternates between internal thought, tool discovery, tool use, and memory folding. The agent invokes special tokens in its output (e.g., <tool_search>, <tool_call>, <fold_thought>) to initiate specific system operations:
- Internal Thought: The LRM produces free-form language-based reflections that guide the subsequent step in the process.
- Dynamic Tool Discovery: On demand, DeepAgent formulates dense retrieval queries to select among tens of thousands of external tools or APIs. Tool selection is conditioned on the full episodic context and the original user task, using cosine similarity to rank candidates: where is the full current history, and , represent the user query and global instruction, respectively.
- Action Execution: After tool selection, the agent's structured output is parsed and executed by the system, and the tool result is compactly integrated back into the agent context.
- Memory Folding: When triggered, an auxiliary LLM compresses the dialogue history via , resulting in a structured tuple representing episodic, working, and tool memory for efficient reasoning continuation.
Notably, the process eschews the traditional fixed Reason–Act–Observe loop in favor of a unified context, ensuring that tool use and memory management remain dynamically accessible throughout the agent's entire operation.
2. Autonomous Memory Folding Mechanism
DeepAgent's memory folding mechanism is engineered to address context window limitations and avoid error accumulation in extended, multi-phase interactions. At agent-chosen checkpoints, the auxiliary LLM synthesizes a compressed memory state:
- Episodic Memory (M_E): A high-level summary of the ongoing task, capturing major decisions and their rationales.
- Working Memory (M_W): Fine-grained record of immediate sub-goals, problem decompositions, and active challenges.
- Tool Memory (M_T): Trace of tool usage, parameterizations, outcome patterns, and experiential lessons.
This triadic schema is serialized in a structured (JSON-like) format that maintains critical context yet enables drastic token reduction. This process allows DeepAgent to revisit and re-strategize tasks mid-process without losing essential informational dependencies.
3. Reinforcement Learning via ToolPO
To support general-purpose, robust tool use, DeepAgent introduces ToolPO—a reinforcement learning regime tailored for high-dimensional tool invocation:
- LLM-Simulated APIs: To sidestep instability and cost associated with live API endpoints, LLM-based simulators emulate RESTful tools during training, providing a controlled-yet-representative interaction landscape.
- Tool-Call Advantage Attribution: Unlike sparse end-of-trajectory rewards, ToolPO propagates feedback to all trajectory tokens involved in tool use or memory operations. For each token :
where masks relevant tokens, with and representing global and intermediate rewards, respectively.
This approach is operationalized via a clipped surrogate policy optimization (akin to PPO) that directly encourages both global task success and precise intermediate actions—yielding stable, fine-grained tool-using behavior and reducing spurious or redundant tool calls.
4. Toolset Integration and Dynamic Retrieval
DeepAgent deploys a large-scale, dense tool retriever component. Tool documentation (potentially exceeding 16,000 APIs) is embedded, indexed, and rapidly retrieved to support <tool_search> queries. The retriever's conditioning on the full reasoning context and explicit user requirements enables not only labeled-tool but also open-set tool scenarios—where the agent must discover and utilize previously unseen or unlisted APIs. This highly generalizable approach contrasts with static workflow-tree baselines, allowing for adaptive execution in unfamiliar or evolving domains.
Auxiliary LLMs are tasked with the pre-processing and summarization of tool documentation and output, ensuring that excessive tool-related information does not overwhelm the reasoning context.
5. Empirical Results
DeepAgent was benchmarked on eight datasets, spanning both general-purpose tool-use and downstream application tasks:
- General Tool-Use: ToolBench, API-Bank, TMDB, Spotify, ToolHop. DeepAgent-32B-RL achieved a success rate of 89.0% on TMDB (labeled-tool), far exceeding workflow-based baselines (≈55.0%). Substantial improvements were observed for open-set tool retrieval as well.
- Downstream Tasks: ALFWorld (embodied simulation), WebShop (e-commerce simulation), GAIA, and Humanity’s Last Exam. DeepAgent consistently outperformed prior methods in both success rates and reasoning flexibility, attributed to its end-to-end agentic reasoning, on-the-fly tool integration, and memory compression.
Experimental evidence indicates that the unified agentic process is markedly more effective than prior sequential or modularized frameworks for real-world, long-horizon, and tool-intensive scenarios.
6. Applications and Prospective Development
DeepAgent's architecture lends itself to:
- Autonomous information synthesis across web sources and APIs, with integrated multi-step or multi-modal tool chains.
- Embodied and simulated environments (e.g., ALFWorld), where context adapts and toolsets expand dynamically.
- E-commerce and virtual assistant platforms, leveraging open-ended tool discovery and action orchestration.
- Advanced AI assistants for multidisciplinary domains, as evidenced by superior GAIA and HLE benchmark performance.
Future work anticipates scaling LRM capacity, extending tool retriever coverage, refining the memory folding implementation to further minimize context loss, enhancing RL methodologies (e.g., fine-grained reward shaping beyond advantage attribution), and exploring increasingly challenging, longer-horizon agentic tasks.
7. Relationship to Other Agent Frameworks
In contrast to predefined workflow agents or planners that rely on monolithic or staged execution (e.g., ReAct, CodeAct), DeepAgent's continuous, context-unified design and dynamic tool retrieval represent a significant architectural divergence. The explicit memory folding mechanism provides a solution to the context-length explosion uniquely prevalent in multi-tool, long-horizon interactions.
Notably, DeepAgent's ToolPO RL method introduces a higher granularity of learning signal in tool-use token attribution than previously standard, yielding superior stability and behavioral adaptation relative to prior end-to-end training regimes.
An implication is that prevalent agentic systems should formalize dynamic memory compaction and dense toolset retrieval as core design primitives—rather than post hoc or ad hoc extensions—to attain greater generality and integration into real-world deployments (Li et al., 24 Oct 2025).