DeepAgent: A General Reasoning Agent with Scalable Toolsets (2510.21618v1)

Published 24 Oct 2025 in cs.AI, cs.CL, cs.IR, and cs.LG

Abstract: Large reasoning models have demonstrated strong problem-solving abilities, yet real-world tasks often require external tools and long-horizon interactions. Existing agent frameworks typically follow predefined workflows, which limit autonomous and global task completion. In this paper, we introduce DeepAgent, an end-to-end deep reasoning agent that performs autonomous thinking, tool discovery, and action execution within a single, coherent reasoning process. To address the challenges of long-horizon interactions, particularly the context length explosion from multiple tool calls and the accumulation of interaction history, we introduce an autonomous memory folding mechanism that compresses past interactions into structured episodic, working, and tool memories, reducing error accumulation while preserving critical information. To teach general-purpose tool use efficiently and stably, we develop an end-to-end reinforcement learning strategy, namely ToolPO, that leverages LLM-simulated APIs and applies tool-call advantage attribution to assign fine-grained credit to the tool invocation tokens. Extensive experiments on eight benchmarks, including general tool-use tasks (ToolBench, API-Bank, TMDB, Spotify, ToolHop) and downstream applications (ALFWorld, WebShop, GAIA, HLE), demonstrate that DeepAgent consistently outperforms baselines across both labeled-tool and open-set tool retrieval scenarios. This work takes a step toward more general and capable agents for real-world applications. The code and demo are available at https://github.com/RUC-NLPIR/DeepAgent.

Summary

The paper introduces a fully autonomous agent that dynamically discovers, retrieves, and executes tools through end-to-end RL training.
The methodology employs a main LRM alongside an auxiliary LLM for context management, selective memory folding, and structured tool invocation.
Empirical results demonstrate significant performance gains over static workflows across diverse benchmarks, highlighting scalable, robust long-horizon reasoning.

DeepAgent: A General Reasoning Agent with Scalable Toolsets

Motivation and Agentic Paradigm Shift

The DeepAgent framework addresses critical limitations in existing LLM-powered agent architectures, particularly their reliance on rigid, predefined workflows and static toolsets. Traditional agentic systems, such as ReAct and Plan-and-Solve, operate via explicit Reason-Act-Observe cycles or fixed planning-execution loops, which restrict autonomy, hinder dynamic tool discovery, and fail to scale to real-world, long-horizon tasks. Recent advances in large reasoning models (LRMs) have demonstrated stepwise "slow thinking" capabilities, but their integration with external tools remains constrained to narrow, preselected APIs, limiting generalization and applicability.

DeepAgent introduces a fully autonomous agentic reasoning paradigm, enabling LRMs to dynamically discover, retrieve, and invoke tools from arbitrarily large toolsets within a unified, continuous reasoning process. This approach eliminates the need for upfront tool selection and rigid workflow templates, allowing the agent to maintain a global perspective and adaptively manage its strategy throughout task execution.

Figure 1: Comparison of agent paradigms, highlighting DeepAgent's fully autonomous reasoning and dynamic tool invocation capabilities.

Framework Architecture and Memory Management

The DeepAgent architecture is centered on a main LRM that orchestrates all reasoning, tool discovery, and action execution. An auxiliary LLM supports the main agent by filtering verbose tool documentation, summarizing execution outputs, and compressing long interaction histories. This division of labor ensures that the main LRM focuses on high-level strategic reasoning, while the auxiliary model handles context management and information distillation.

A key innovation is the autonomous memory folding mechanism, which allows the agent to compress its interaction history into structured episodic, working, and tool memories at any logical point in the reasoning trajectory. This brain-inspired memory schema mitigates context length explosion, reduces error accumulation, and enables the agent to "take a breath" and reconsider its strategy after unsuccessful exploration paths. The use of a structured JSON schema for memory components ensures stability and lossless information compression, facilitating robust long-horizon reasoning.

Figure 2: Overview of the DeepAgent framework, illustrating autonomous tool discovery, action execution, and memory folding within a unified reasoning process.

Autonomous Tool Discovery and Execution

DeepAgent operationalizes tool discovery and invocation via dense retrieval over large toolsets. The agent generates natural language queries for tool search, which are embedded and matched against indexed tool documentation using cosine similarity. Retrieved tools are summarized and injected into the agent's context for subsequent invocation. Tool calls are structured and parsed for execution, with outputs summarized as needed to maintain concise and relevant context.

This dynamic, on-demand tool retrieval strategy is shown to outperform pre-retrieved tool approaches, especially in open-set scenarios with large toolsets (e.g., ToolBench with 16k APIs, ToolHop with 3.9k tools). DeepAgent's architecture synergizes with autonomous tool retrieval, achieving substantial performance gains over workflow-based baselines.

End-to-End RL Training: ToolPO

To efficiently and stably train DeepAgent for general-purpose tool use, the authors propose Tool Policy Optimization (ToolPO), an RL algorithm leveraging LLM-simulated APIs and fine-grained advantage attribution. ToolPO samples groups of trajectories per prompt, assigning both global task success rewards and local action-level rewards for correct tool calls and efficient memory folding. The use of a tool simulator circumvents the instability and cost of interacting with real-world APIs during training, while advantage attribution provides precise learning signals to the specific tokens responsible for tool invocation and memory folding.

ToolPO optimizes a clipped surrogate objective, encouraging the model to increase the probability of advantageous actions and overall task accomplishment. Ablation studies confirm the centrality of ToolPO training, memory folding, and tool-call advantage attribution to DeepAgent's superior performance.

Figure 3: Visualization of training dynamics, showing reward and validation score trajectories for DeepAgent trained with ToolPO.

Empirical Results and Scaling Analysis

Extensive experiments on eight benchmarks, spanning general tool-use (ToolBench, API-Bank, TMDB, Spotify, ToolHop) and downstream applications (ALFWorld, WebShop, GAIA, HLE), demonstrate that DeepAgent consistently outperforms workflow-based and prior agentic baselines. Notably, DeepAgent-32B-RL achieves success rates of 89.0% on TMDB and 75.4% on Spotify (labeled-tool tasks), and 64.0% on ToolBench and 40.6% on ToolHop (open-set retrieval), substantially exceeding the strongest baselines.

On downstream applications requiring long-horizon planning and robust interaction, DeepAgent-32B-RL attains 91.8% success on ALFWorld and 53.3 on GAIA, outperforming hierarchical and workflow-based agents. Scaling analysis reveals that DeepAgent maintains and widens its performance margin over ReAct as the maximum action limit increases, indicating superior scalability and strategic action selection.

Figure 4: Overall performance comparison on general tool usage tasks and downstream applications, normalized to best score as 100%.

Figure 5: Scaling analysis of performance with respect to maximum action limits on WebShop and GAIA datasets.

Theoretical and Practical Implications

DeepAgent's unified agentic reasoning paradigm, autonomous memory folding, and dynamic tool discovery represent a significant advance in the design of general-purpose reasoning agents. The framework demonstrates that end-to-end RL with simulated environments and fine-grained credit assignment is effective for training agents to master large, diverse toolsets and robust long-horizon interaction. The structured memory schema and context management strategies address key challenges in scaling agentic reasoning to real-world tasks.

The results suggest that future agentic systems should eschew rigid workflows and static tool selection in favor of dynamic, context-aware reasoning processes. The integration of structured memory and autonomous context compression is likely to become standard in agents operating over extended horizons and large action spaces. The use of LLM-simulated environments for RL training offers a scalable path for developing agents capable of interacting with complex, real-world APIs.

Conclusion

DeepAgent establishes a new agentic reasoning paradigm, enabling LRMs to autonomously discover and utilize scalable toolsets within a unified, end-to-end reasoning process. The framework's autonomous memory folding, structured memory schema, and ToolPO RL training yield robust, scalable performance across diverse benchmarks and application domains. These contributions have significant implications for the development of general-purpose AI agents, suggesting that dynamic tool discovery, structured memory management, and end-to-end RL will be central to future advances in agentic intelligence.

PDF Markdown

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper introduces DeepAgent, a smart computer “agent” that can think through problems, find and use the right tools (like apps or websites), and remember important information while working on long tasks. It aims to handle real-world jobs that need many steps, outside resources, and careful planning—like researching online, shopping on a website, or solving complex puzzles with code.

What questions were they trying to answer?

The researchers wanted to solve a few big challenges:

How can an AI agent think deeply and plan without being locked into a rigid, pre-made workflow?
How can it discover and use the right tools at the right time, instead of only using a small set of fixed tools?
How can it manage long tasks without forgetting important details or getting overwhelmed by too much history?
How can we train such an agent efficiently and reliably, even when the tools are complicated or expensive to use?

How did they do it? Methods and ideas explained simply

To tackle these challenges, they built DeepAgent with several key parts. Think of DeepAgent like a focused student with a giant toolbox and a smart way of taking notes.

The agent’s unified reasoning process

Most older agents follow a strict loop: “Think → Act → Observe” over and over. DeepAgent is different. It keeps one continuous chain of thought. In that stream, it can:

Think about the task,
Search for the tools it needs,
Call those tools with the right settings,
Organize its memory,
And keep going without losing the bigger picture.

This helps it stay flexible and make better decisions overall.

Finding and using tools on demand

Imagine you’re doing homework and need different apps: a calculator, a map, or a music database. DeepAgent can:

Search for tools by writing a short query (like “find a movie database tool”).
Retrieve the best tools from a huge library using a “search-by-meaning” system (similar to searching by topic rather than exact words).
Call a tool by sending structured instructions (like telling the app exactly what to do and with what inputs).
Summarize long tool outputs with a helper model so the main agent doesn’t get distracted by too much text.

Memory folding: tidying up the agent’s notes

Long tasks create lots of text and history, which can overload the agent. DeepAgent can “fold” its memory—like tidying a messy notebook into neat summaries—whenever it needs a reset. It compresses past interactions into three organized parts:

Episodic memory: a high-level timeline of important events and decisions.
Working memory: the current sub-goal, recent obstacles, and next steps.
Tool memory: which tools were used, how they were used, and whether they worked.

These summaries are stored in a structured format (like clear, labeled checklists) so the agent can easily use them later. This saves space, reduces mistakes, and lets the agent “take a breath” and rethink its strategy.

Training with ToolPO (a smart practice routine)

Training AI with real online tools can be slow, costly, and unstable. So the authors created ToolPO, a reinforcement learning (RL) method that uses simulated tools:

Tool simulator: instead of calling real APIs, the agent practices with a fast and reliable simulator that mimics real responses.
Two types of rewards:
- Global success: Did the agent finish the task correctly?
- Action-level credit: Did it choose and call the right tools at the right moments?
Fine-grained credit assignment: The training gives extra “points” specifically to the tokens (tiny pieces of text) where the agent made correct tool calls or folded memory efficiently. This helps it learn the tricky parts precisely.

What did they find?

The team tested DeepAgent on many benchmarks, including tool-use tasks and real-world applications:

Tool-use tasks: ToolBench, API-Bank, TMDB, Spotify, and ToolHop
Applications: ALFWorld (virtual home tasks), WebShop (online shopping), GAIA (research and reasoning), and HLE (very hard exam-style problems)

Across these tests, DeepAgent consistently performed better than traditional agents that rely on fixed workflows (like ReAct, Plan-and-Solve, CodeAct, and Reflexion). Two standout results:

In “open-set” scenarios (where the agent must search for tools from large collections), DeepAgent’s ability to discover tools on demand led to big improvements.
Using ToolPO training made DeepAgent even stronger, improving both tool-use accuracy and performance on long, complex tasks.

In short, the agent’s unified reasoning, dynamic tool discovery, and memory folding gave it clear advantages.

Why does this matter? Implications and impact

DeepAgent points toward smarter, more independent AI systems that can:

Handle real-world tasks that need many steps and different tools,
Keep track of what matters during long interactions without getting lost,
Learn robustly and affordably using simulated tools,
And adapt to new toolsets in the wild (like the growing ecosystem of APIs and apps).

This could improve AI assistants for research, shopping, coding, data analysis, and more—especially tasks that require careful planning, reliable memory, and using the right tools at the right time.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single consolidated list of what remains missing, uncertain, or unexplored in the paper; each item is concrete to guide follow-up research:

Real-world API fidelity: The RL training relies on LLM-simulated APIs, but there is no systematic sim-to-real evaluation with production APIs (e.g., authentication, rate limits, pagination, non-deterministic behavior, schema drift, latency spikes, outages). Quantify performance drop and robustness when swapping the simulator for actual APIs across diverse providers.
Tool-call correctness definition: “Correct tool invocation” is rewarded, but the paper does not specify how correctness is determined across tasks lacking labeled intermediate calls. Establish formal, task-agnostic criteria (schema validation, argument semantics, side-effect verification) and measure inter-annotator agreement.
Retrieval scaling beyond 16k tools: Dense retrieval is tested up to ~16k tools; real-world ecosystems can exceed 100k–1M tools. Evaluate retrieval accuracy, latency, memory footprint, and top-k sensitivity at much larger scales, and explore approximate nearest neighbor indices and caching strategies.
Retriever training and adaptation: The retriever is fixed (bge-large-en) and not trained or adapted end-to-end with the agent. Investigate joint training of the retriever with the reasoning policy, online adaptation to new toolsets, and feedback loops from tool-call outcomes.
Tool documentation quality and summarization loss: The auxiliary LLM summarizes tool docs and outputs, but the paper does not quantify information loss, hallucination, or misinterpretation introduced by summarization. Conduct controlled studies measuring critical-field omission rates, argument mis-specification, and downstream error propagation.
Dynamic toolset updates: No mechanism or evaluation for handling tool additions/removals, version changes, or deprecated endpoints during a running session. Design incremental index updates, capability negotiation, and backward-compatibility handling.
Memory folding policy characterization: While memory folding improves GAIA, the paper does not analyze fold timing, frequency, token savings, and accuracy trade-offs. Provide quantitative analysis of fold-trigger conditions, impact on trajectory length, and fold-related failure modes (e.g., forgetting recent constraints).
Catastrophic forgetting and memory staleness: The effects of compressing long histories into episodic/working/tool memories on recall of fine-grained details are not measured. Evaluate retention of task-critical facts pre/post-fold, stale-memory detection, and recovery strategies.
Schema robustness: The JSON memory schema is proposed but not stress-tested for malformed entries, partial writes, or adversarial injection (e.g., prompt poisoning within fields). Add schema validation, type-checking, and sanitization; measure robustness under adversarial inputs.
Safety and security: The agent can discover and invoke external tools, but the paper does not address permissioning, sandboxing, data exfiltration risks, or potentially harmful actions. Develop a safety layer with allowlists/denylists, least-privilege scopes, and runtime policy checks; evaluate attack scenarios (prompt injection, tool masquerading).
Latency, throughput, and cost: Inference uses up to 81,920 tokens and 64 H20 GPUs, but there is no analysis of end-to-end latency, throughput, or cost per task. Provide runtime breakdowns (retrieval, reasoning, tool calls, folding), and quantify the efficiency gains from folding under realistic constraints.
Advantage attribution sensitivity: ToolPO’s advantage attribution relies on λ hyperparameters and masking of action tokens, but sensitivity analyses and theoretical justification are limited. Study convergence properties, variance under different λ values, and alternative credit assignment strategies (e.g., per-argument token weighting, structured action heads).
Simulator coverage and failure modeling: The tool simulator design, coverage of error modes (timeouts, partial data, 4xx/5xx codes, schema evolution), and calibration to real APIs are not detailed. Build a benchmark of realistic failure distributions and report agent resilience and recovery behaviors.
Auxiliary LLM dependence: The auxiliary LLM handles summarization, denoising, and compression, but its contribution is not isolated beyond coarse ablation. Quantify its error rates, compare different auxiliary models, and assess how its quality scales with task difficulty.
Multilingual and cross-lingual generalization: Tasks and tools appear predominantly English. Evaluate tool retrieval and usage with non-English user queries and tool documentation; explore multilingual embeddings and cross-lingual argument generation.
Coverage of non-text modalities and tools: Beyond VQA, the framework does not explore audio, geospatial, IoT/robotics, or interactive UI tools. Assess generalization to diverse modalities and device APIs, including time-critical control loops.
MCP interoperability and ecosystem integration: The paper references MCP conceptually but does not implement formal MCP tool discovery/permission flows. Evaluate plug-in lifecycle management, consent workflows, and interoperability across MCP-compliant providers.
Robustness to noisy/adversarial tool outputs: No experiments assess behavior under misleading or adversarial tool responses (e.g., poisoned search results). Introduce self-verification, cross-tool consistency checks, and rollback/undo mechanisms.
Argument validation and execution safety: Tool calls are parsed from JSON, but there is no formal schema validation, unit constraints, or exception handling pipeline. Implement typed schemas, contract checks, and safe execution wrappers; report the rate of runtime errors and recovery success.
Termination criteria and “done” detection: The agent relies on a max step limit, but criteria for detecting completion or futility are not formalized. Develop confidence-based stopping rules and measure trade-offs (precision/recall of completion detection).
Failure mode analysis: The paper lacks granular error attribution (retrieval miss, argument formatting, tool failure, reasoning flaw) on benchmarks like ToolHop. Instrument trajectories to tag failure causes and prioritize targeted fixes.
Evaluation breadth: Metrics focus on Pass@1 and path scores; there is no human evaluation of helpfulness, safety, or user satisfaction. Add qualitative assessments, task success under user constraints (budget, time), and compliance/adherence metrics.
Reproducibility and data availability: Training data sources are listed, but the exact splits, prompts, and simulator configurations are not fully disclosed. Release detailed datasets, simulator scripts, and seeds to enable faithful reproduction.
Model generality across sizes/backbones: Results center on QwQ-32B; portability to smaller models or different backbones is not studied. Test scaling laws, minimum viable model sizes, and transfer across open/closed-source backbones.
Measuring “global perspective” claims: The paper asserts a globally coherent reasoning process but does not provide metrics (e.g., plan consistency, re-planning success after fold). Propose and report quantitative measures of global task coherence and long-horizon planning quality.
Online learning and continual adaptation: The agent does not adapt its policy or memories across tasks or sessions. Explore continual learning, memory reuse, and personalized tool-use profiles with drift detection and mitigation.
Privacy and compliance: Web and file tools may process sensitive data; no discussion of privacy guarantees or regulatory compliance (GDPR/CCPA). Define data handling policies, logging minimization, and compliance auditing; evaluate privacy-preserving variants.
Multi-agent coordination: The framework uses a single agent plus auxiliary LLM; potential benefits of specialized sub-agents or negotiators are unexplored. Investigate multi-agent decomposition, communication protocols, and shared memory structures.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

Below are practical use cases that can be deployed now, leveraging DeepAgent’s unified reasoning, dynamic tool retrieval, and autonomous memory folding, together with the ToolPO training workflow and tool simulation.

Autonomous web research co-pilot for analysts (industry, academia, media, legal)
- What it does: Plans multi-hop searches, browses pages, extracts and cross-checks facts, runs quick code-based checks, and maintains a structured “research log” via memory folding for long investigations (GAIA-like).
- Tools/workflows: Web search (e.g., Serper), page reading (e.g., Jina Reader), VQA for charts/tables, code execution for light data cleaning/validation, JSON memory for auditability.
- Assumptions/dependencies: High-quality tool documentation; permissioned access to paywalled sources; retrieval quality (embedding model + tool doc quality); human-in-the-loop approval for claims.
E-commerce shopping assistant (retail/marketplaces)
- What it does: Searches and compares products across stores, summarizes reviews/specs, justifies trade-offs, and prepares carts with vendor-specific APIs (WebShop-style).
- Tools/workflows: Retailer APIs, price/availability APIs, payment initiation with human confirmation, JSON episodic memory to span long user sessions.
- Assumptions/dependencies: API rate limits; identity and payment authorization; fraud/chargeback guardrails; product catalog freshness.
L1 customer support triage and resolution (software, telecom, fintech, logistics)
- What it does: Retrieves relevant tools on demand (ticketing, KB, account data), reproduces issues via scripted tool calls, and summarizes steps taken using structured tool memory.
- Tools/workflows: Zendesk/Jira/ServiceNow, account lookups, log search, knowledge base, automatic case “memory fold” summaries attached to tickets.
- Assumptions/dependencies: Secure data access; PII/PHI handling and audit; escalation policies; well-instrumented APIs.
DevOps/SRE co-pilot (software)
- What it does: Investigates service incidents end-to-end by querying CI/CD, APM, logs, and error trackers; runs diagnostic scripts; folds long histories into concise working memory to avoid getting stuck.
- Tools/workflows: GitHub/GitLab, Jenkins, Datadog/New Relic, Splunk/ELK, Sentry; tool memory to track attempted remediations and outcomes.
- Assumptions/dependencies: Safe execution environment and permissioning; runbooks/standard operating procedures; change management with human approval.
Data engineering and BI assistant (software, analytics)
- What it does: Diagnoses broken pipelines, queries warehouses (Snowflake/BigQuery), inspects lineage and data quality checks, and compiles a structured incident report.
- Tools/workflows: SQL execution, data catalog/lineage APIs, quality monitors; memory folding to keep long triage contexts affordable and coherent.
- Assumptions/dependencies: Stable connectors; cost controls for query execution; role-based access control (RBAC).
SOC triage co-pilot (security)
- What it does: Correlates SIEM alerts, enriches with threat intel, retrieves EDR/IDS evidence, and drafts response playbooks, storing a detailed, structured episode log for audits.
- Tools/workflows: Splunk/QRadar/Chronicle, EDR (e.g., CrowdStrike), threat intel feeds, case management; tool-call attribution logs for forensics.
- Assumptions/dependencies: Strict access governance; risk thresholds; human gatekeeping on containment/remediation actions.
Healthcare admin workflow assistant (healthcare)
- What it does: Automates scheduling, eligibility checks, benefit verification, prior authorizations, and billing status checks using dynamic tool retrieval; maintains case histories with memory folds.
- Tools/workflows: FHIR/EHR APIs, payer portals, scheduling/billing APIs; episodic memory as a HIPAA-aligned audit trail.
- Assumptions/dependencies: Regulatory compliance (HIPAA, GDPR); secure handling of PHI; payer/EHR API variability.
Finance back-office reconciliation and compliance (finance)
- What it does: Matches transactions across ERP/banking APIs, flags anomalies, checks policies/regulatory rules, and produces explainable, structured audit logs from memory.
- Tools/workflows: ERP (e.g., NetSuite, SAP), bank APIs, compliance knowledge bases, file reading and code-based validation for edge cases.
- Assumptions/dependencies: Precise policy rules and test suites; segregation of duties; read/write permissions with approvals.
Personal travel concierge (consumer)
- What it does: Searches flights/hotels, compares trade-offs, books with multiple providers, handles changes/refunds, and maintains itinerary “working memory” across long interactions.
- Tools/workflows: Aggregators (e.g., Skyscanner), airline/hotel APIs, document/file tools (passport/visa checks), secure vault for credentials.
- Assumptions/dependencies: Human confirmation for payments; vendor-specific quirks; robust error handling for failed bookings.
Education tutor and lab assistant (education)
- What it does: Solves multi-step problems, fetches references, runs code to verify solutions, explains reasoning, and supports visual Q&A for diagrams; maintains student progress memory.
- Tools/workflows: Web search, code execution, VQA; structured episodic memory for lesson continuity.
- Assumptions/dependencies: Alignment to curricula and academic integrity standards; sandboxed execution; content safety filters.
Agent development sandbox with tool simulation (software/tooling)
- What it does: Uses LLM-simulated APIs to test agents against large toolsets, enabling low-cost, low-latency RL fine-tuning with tool-call advantage attribution for precise debugging.
- Tools/workflows: ToolPO training pipeline, tool simulators, advantage attribution telemetry dashboards.
- Assumptions/dependencies: Simulator fidelity to production APIs; synthetic-to-real generalization checks; evaluation harnesses.
Enterprise API catalog and retrieval service (software/platform)
- What it does: Indexes API docs across the enterprise and third parties, supports dense retrieval for agents, and enforces governance/permissions; fits the MCP paradigm.
- Tools/workflows: Embedding index (e.g., bge-large), MCP servers, tool governance UI, ranking/feedback loops.
- Assumptions/dependencies: Up-to-date and standardized API documentation; metadata quality; access control integration.
Agent memory service for cost and reliability (software/platform)
- What it does: Provides a JSON schema for episodic/working/tool memories and automatic “folding” endpoints to compress long trajectories, lowering token costs and error accumulation.
- Tools/workflows: Memory folding microservice, storage + retrieval of structured memories, observability for fold triggers/outcomes.
- Assumptions/dependencies: Stable schema versioning; privacy retention policies; compatibility across backbone LLMs.

Long-Term Applications

These use cases become feasible as reliability, safety, and ecosystem maturity improve (e.g., standard tool schemas, robust guardrails, stronger verification, and broader MCP adoption).

Clinical decision support with autonomous tool use (healthcare)
- What it could do: Query guidelines/medical databases, interpret multimodal inputs (labs, images via VQA), simulate treatment options, and explain reasoning with auditable, structured memories.
- Tools/workflows: EHR/FHIR, medical knowledge bases, imaging tools, code for risk calculators.
- Assumptions/dependencies: Regulatory approval; validated clinical performance; strict human oversight and post-hoc verification; liability frameworks.
Household/industrial robotics orchestration (robotics/manufacturing)
- What it could do: Use dynamic tool discovery as “skill discovery” (e.g., ROS actions, PLC commands), adapt to new devices on demand, and fold memory during long-horizon tasks (assembly/maintenance).
- Tools/workflows: ROS/ROS2, digital twins/simulators, safety interlocks, environment mapping.
- Assumptions/dependencies: Robust perception-actuation; real-time constraints; formal safety guarantees; sim-to-real transfer.
Fully autonomous procurement and vendor management (enterprise/policy)
- What it could do: Discover suppliers, evaluate bids, negotiate (email/chat), check compliance, generate contracts, and reconcile deliveries and invoices end-to-end with detailed audit traces.
- Tools/workflows: Sourcing platforms, contract management, ERP, supplier risk/intel; structured memory for audits.
- Assumptions/dependencies: Organizational policy encoding; negotiation ethics; human approvals for commitments; robust identity and authority management.
Automated scientific discovery loops (academia, pharma, materials)
- What it could do: Formulate hypotheses, mine literature, design/run code experiments, control lab equipment or simulators, and iteratively refine approaches with embedded tool memories.
- Tools/workflows: LIMS, lab robots, simulation frameworks, dataset registries; RL with tool simulation before lab runs.
- Assumptions/dependencies: Reliable experiment control; data provenance; reproducibility checks; domain-specific validation pipelines.
Regulated financial advisors with transactional autonomy (finance)
- What it could do: Monitor markets, query risk/compliance tools, simulate portfolios, execute trades under policy constraints, and produce regulator-grade logs via episodic/tool memory.
- Tools/workflows: Broker/trading APIs, risk engines, KYC/AML checks, supervisory dashboards for approvals.
- Assumptions/dependencies: Licensing and fiduciary requirements; strict guardrails; robust backtesting; incident handling protocols.
Cross-agency digital service orchestrator (public sector)
- What it could do: Dynamically discover agency services, retrieve personalized records, pre-fill forms, schedule appointments, and track cases across departments with a single agent session.
- Tools/workflows: MCP-based service registry, identity federation, case management; memory folding for multi-month episodes.
- Assumptions/dependencies: Interoperable standards; privacy/security frameworks; equitable access; strong auditability.
Agent OS / enterprise agent platform
- What it could do: Provide a standardized runtime embedding autonomous tool retrieval, memory folding, ToolPO training, and tool simulators; support multi-tenant governance and observability.
- Tools/workflows: Policy engines, capability-based permissions, tool marketplaces, telemetry and cost controls.
- Assumptions/dependencies: Vendor-neutral standards (e.g., MCP); compatibility with diverse LLM backbones; organizational adoption.
Standardized tool schemas and compliance/audit pipelines
- What it could do: Establish widely adopted JSON schemas for episodic/working/tool memories and tool invocation logs, enabling cross-vendor audits, reproducibility, and forensics.
- Tools/workflows: Schema registries, conformance validators, signed execution receipts; chain-of-custody for actions.
- Assumptions/dependencies: Industry consortia; regulator engagement; secure logging and attestation.
Multi-agent ecosystems with shared tool memories (cross-domain)
- What it could do: Specialized agents coordinate via shared episodic/tool memories, hand off subtasks, and negotiate tool ownership and access in dynamic environments.
- Tools/workflows: Orchestration fabric, shared memory stores, role-based tool routing; inter-agent protocols.
- Assumptions/dependencies: Communication safety; conflict resolution; provenance and accountability for collective actions.
On-device/private agents with local tool discovery (consumer/enterprise)
- What it could do: Run privacy-sensitive workflows locally (e.g., document analysis, offline search), use memory folding to fit small-context models, and selectively escalate to cloud tools.
- Tools/workflows: Local vector stores, sandboxed code tools, private search indices; hybrid execution policies.
- Assumptions/dependencies: Efficient local LRMs; hardware constraints; differential privacy; secure enclave support.

Cross-cutting assumptions and dependencies that affect feasibility

Tool ecosystem readiness: Availability, stability, and documentation quality of APIs; adoption of Model Context Protocol (MCP) or equivalent.
Governance and safety: Permissioning, identity, guardrails for high-risk actions; human-in-the-loop checkpoints; content safety and data governance (PII/PHI/financial).
Reliability and evaluation: Robust retrieval quality, verification/grounding of outputs, test suites and post-hoc validation; simulator-to-reality fidelity for ToolPO-trained behaviors.
Cost and scalability: Token budgets and latency; memory folding services to control context bloat; observability and cost controls for long-horizon agents.
Compliance and audit: Use of structured JSON memories for traceability; regulator-grade logging and attestations in sensitive sectors.
Organizational adoption: Integration with existing stacks (SaaS, data, security), change management, and user training.

View Paper Prompt View All Prompts

Glossary

ALFWorld: A text-based embodied AI environment used to evaluate agents on household tasks via discrete actions. "downstream applications (ALFWorld, WebShop, GAIA, HLE)"
API-Bank: A benchmark with human-annotated dialogues and API calls to test planning, retrieval, and calling abilities. "API-Bank~\cite{API-Bank}, which includes 314 human-annotated dialogues with 73 APIs and 753 API calls"
Autonomous Memory Folding: A mechanism where the agent compresses past interactions into structured memories to manage long-horizon reasoning. "We introduce an Autonomous Memory Folding strategy that allows DeepAgent to consolidate its previous thoughts and interaction history into a structured memory schema"
Auxiliary LLM: A secondary LLM that supports the main agent by summarizing tool docs/results and compressing histories. "DeepAgent employs an auxiliary LLM to handle complex interactions with large toolsets and manage long histories."
Brain-inspired memory architecture: A memory design modeled after human cognition, comprising episodic, working, and tool memories. "we introduce a brain-inspired memory architecture comprising episodic memory, working memory, and tool memory"
Chain-of-Thought (CoT): A prompting/learning approach that elicits step-by-step reasoning in LLMs. "elicit extended Chain-of-Thought (CoT) reasoning"
Clipped surrogate objective function: A PPO-style training objective that constrains policy updates to stabilize learning. "ToolPO then optimizes the policy using a clipped surrogate objective function:"
Cosine similarity: A similarity measure between embeddings used for ranking retrieved tools. "by ranking them based on the cosine similarity"
Dense retrieval: A retrieval method using vector embeddings to find relevant tools/documents. "The system's tool retriever operates via dense retrieval."
Embedding model: A model that maps text (e.g., tool docs, queries) into vector representations for retrieval. "using an embedding model $E$ ."
End-to-end reinforcement learning (RL): Training that optimizes the entire agent behavior directly through RL signals. "we propose ToolPO, an end-to-end reinforcement learning (RL) training method tailored for general tool use."
Episodic Memory: A high-level log of key events and decisions to preserve long-term task context. "Episodic Memory ( $M_E$ ): This component serves as a high-level log of the task"
GAIA: A complex information-seeking benchmark requiring multi-tool reasoning (e.g., search, browsing, VQA, code). "GAIA, a complex information-seeking benchmark"
Group Relative Policy Optimization (GRPO): A policy optimization method that uses group-normalized rewards for stability. "compared to the commonly used GRPO."
Humanity's Last Exam (HLE): A benchmark of extremely challenging reasoning problems for agents. "Humanity's Last Exam (HLE)"
LLMs: Foundation models trained on vast corpora to perform language tasks and tool-augmented reasoning. "The rapid advancement of LLMs has inspired the development of LLM-powered agents"
Large Reasoning Models (LRMs): LLMs specialized or trained to perform extended, step-by-step reasoning. "Large Reasoning Models (LRMs) \citep{deepseek-r1,openai2024openaio1card} have demonstrated significant performance improvements"
Memory Fold: An action where the agent compresses the interaction history into structured memory. "Memory Fold ($a_t^{\text{fold}$)}: A special action to compress the interaction history $s_t$ into a structured memory summary."
Model Context Protocol (MCP): A paradigm for dynamically accessing diverse, non-preselected tools at inference time. "aligning with the emerging Model Context Protocol (MCP) paradigm."
Multi-hop reasoning: Solving tasks that require multiple sequential inference steps and tool calls. "a multi-hop reasoning dataset with 3,912 locally executable tools"
Open-set tool retrieval: Retrieving and using tools from a large, not pre-labeled pool during task execution. "open-set tool retrieval scenarios."
Parametric knowledge: Information stored in the model’s parameters as opposed to external sources or tools. "models relying solely on parametric knowledge face inherent limitations"
Pass@1: A metric reporting the success of the top (first) attempt at solving a task. "We report Pass@1 metric for all tasks."
Policy (in RL): The agent’s action-selection function mapping states/histories to action probabilities. "driven by a policy $\pi$ parameterized by $\theta$ ,"
Probability ratio: The ratio of new-to-old policy probabilities for a token/action used in PPO-style updates. "is the probability ratio for token $y_i$ ."
Sequential decision-making process: A formalization where an agent takes a series of actions over time to maximize reward. "We frame the agent's task as a sequential decision-making process."
Sparse reward: A reward signal that provides feedback only at the end or infrequently, making learning harder. "A sparse reward based solely on the final outcome is often insufficient to guarantee the accuracy of intermediate tool calls."
Supervised Fine-Tuning (SFT): Training a model on labeled data to improve performance on targeted tasks. "Supervised Fine-Tuning (SFT)"
Tool Memory: A structured record of which tools were used, how, and with what effectiveness. "Tool Memory ( $M_T$ ): This consolidates all tool-related interactions"
Tool Policy Optimization (ToolPO): The proposed RL algorithm that combines global and action-level rewards with advantage attribution. "We train DeepAgent end-to-end with Tool Policy Optimization (ToolPO), an RL approach designed for general tool-using agents."
Tool retrieval: Selecting relevant tools from a large toolset based on a query or context. "Tool retrieval is performed using bge-large-en-v1.5"
Tool Simulator: An LLM-based component that mimics real-world APIs to enable stable, low-cost RL training. "we develop an LLM-based Tool Simulator."
ToolBench: A large-scale tool-use benchmark with over 16k APIs that stress-tests multi-step tool calling. "ToolBench~\cite{ToolLLM}, based on over 16,000 real-world APIs"
ToolHop: A benchmark requiring sequences of tool calls (3–7 steps) to solve multi-hop tasks. "ToolHop~\cite{ToolHop}, a multi-hop reasoning dataset"
Trajectory: The sequence of states, actions, and observations generated during an episode. "The sequence of states, actions, and observations forms a trajectory $\tau$ "
Visual Question Answering (VQA): A task/tool where the agent answers questions about images. "Visual Question Answering (VQA)"
WebShop: An online shopping environment for evaluating goal-directed browsing and purchasing via tools. "WebShop~\cite{WebShop}, an online shopping environment"
Working Memory: The short-term memory that maintains the current sub-goals, obstacles, and near-term plans. "Working Memory ( $M_W$ ): This contains the most recent information"

View Paper Prompt View All Prompts

Open Problems

We found no open problems mentioned in this paper.

Continue Learning

Authors (11)

Collections

GitHub

GitHub - RUC-NLPIR/DeepAgent: 🛠️ DeepAgent: A General Reasoning Agent with Scalable Toolsets (20 stars)

Tweets

This paper has been mentioned in 11 tweets and received 212 likes.

Upgrade to Pro to view all of the tweets about this paper:

Start a free 7-day Pro trial

YouTube

Show All Videos

DeepAgent: A General Reasoning Agent with Scalable Toolsets (2510.21618v1)

Summary

DeepAgent: A General Reasoning Agent with Scalable Toolsets

Motivation and Agentic Paradigm Shift

Framework Architecture and Memory Management

Autonomous Tool Discovery and Execution

End-to-End RL Training: ToolPO

Empirical Results and Scaling Analysis

Theoretical and Practical Implications

Conclusion

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions were they trying to answer?

How did they do it? Methods and ideas explained simply

The agent’s unified reasoning process

Finding and using tools on demand

Memory folding: tidying up the agent’s notes

Training with ToolPO (a smart practice routine)

What did they find?

Why does this matter? Implications and impact

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Cross-cutting assumptions and dependencies that affect feasibility

Glossary

Open Problems

Continue Learning

Related Papers

Authors (11)

Collections

GitHub

Tweets

YouTube