Papers
Topics
Authors
Recent
AI Research Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 71 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 21 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 91 tok/s Pro
Kimi K2 164 tok/s Pro
GPT OSS 120B 449 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

Towards General Agentic Intelligence via Environment Scaling (2509.13311v1)

Published 16 Sep 2025 in cs.CL

Abstract: Advanced agentic intelligence is a prerequisite for deploying LLMs in practical, real-world applications. Diverse real-world APIs demand precise, robust function-calling intelligence, which needs agents to develop these capabilities through interaction in varied environments. The breadth of function-calling competence is closely tied to the diversity of environments in which agents are trained. In this work, we scale up environments as a step towards advancing general agentic intelligence. This gives rise to two central challenges: (i) how to scale environments in a principled manner, and (ii) how to effectively train agentic capabilities from experiences derived through interactions with these environments. To address these, we design a scalable framework that automatically constructs heterogeneous environments that are fully simulated, systematically broadening the space of function-calling scenarios. We further adapt a two-phase agent fine-tuning strategy: first endowing agents with fundamental agentic capabilities, then specializing them for domain-specific contexts. Extensive experiments on agentic benchmarks, tau-bench, tau2-Bench, and ACEBench, demonstrate that our trained model, AgentScaler, significantly enhances the function-calling capability of models.

Summary

  • The paper presents a novel method for constructing diverse, verifiable simulated environments that enhance function-calling capabilities in language models.
  • It implements a two-stage agent experience learning protocol that achieves state-of-the-art performance even in compact model regimes like AgentScaler-4B.
  • Empirical results on multiple benchmarks validate the framework’s effectiveness, while also revealing challenges in long-horizon tool calling and domain specialization.

Advancing General Agentic Intelligence via Environment Scaling

Introduction

The paper "Towards General Agentic Intelligence via Environment Scaling" (AgentScaler) presents a systematic framework for developing general agentic intelligence in LLMs by programmatically scaling and diversifying simulated environments. The central thesis is that robust function-calling capabilities in agents are tightly coupled to the diversity and verifiability of the environments in which they are trained. The authors introduce a pipeline that automates environment construction, synthesizes agentic tasks, and implements a two-stage agent experience learning protocol. The approach is validated on multiple agentic benchmarks, demonstrating strong performance, particularly in compact model regimes.

Environment Construction and Scaling

The environment scaling methodology is grounded in the abstraction of function calls as read–write operations over domain-specific databases. The pipeline begins with large-scale scenario collection, aggregating over 30,000 APIs from public and internal sources. These APIs are filtered and refined to ensure explicit input–output specifications and compositional compatibility.

A tool dependency graph is constructed, where nodes represent tools and edges encode parameter-based compositional relationships. Louvain community detection is employed to partition the tool graph into coherent domains, each associated with a database schema that formalizes the environment's state space. Tools within each domain are programmatically materialized as executable code, enabling direct manipulation of the underlying database. Figure 1

Figure 1: Overview of the automatic environment build and agentic task construction pipeline, illustrating scenario collection, tool graph modeling, and programmatic materialization.

Agentic task construction proceeds via forward simulated agent–human interplay. Initial environment states are sampled for diversity, and tool sequences are generated by traversing the domain-specific tool graph. Each step involves argument generation and tool invocation, with state transitions tracked for verifiability at both the database and tool-sequence levels.

Agent Experience Learning

Agent experience is collected through simulated human–agent interactions within the constructed environments. A simulated user with a predefined intent interacts with the agent, which must leverage domain-specific tools to fulfill the task. Completed interaction traces are subjected to a rigorous three-stage filtering process: validity control, environment state alignment, and function calling exact match. This ensures high-fidelity supervision and robustness, even retaining trajectories with intermediate tool-call errors if the overall intent is achieved. Figure 2

Figure 2: The agent interacts with the simulated user and changes the environment state through generated functions, enabling scalable experience collection.

The agentic experience learning protocol is two-phased. The first phase focuses on general tool-usage competence across broad domains, while the second phase specializes the agent in vertical domains with domain-specific tasks and tools. The training objective masks out human instructions and tool responses from the loss, propagating gradients only through assistant-generated tool calls and responses, thereby conditioning the model on context while optimizing for agentic actions.

Experimental Results

AgentScaler models (4B, 8B, 30B-A3B) are trained on Qwen-3 backbones and evaluated on τ\tau-bench, τ2\tau^2-Bench, and ACEBench. The results indicate that AgentScaler-30B-A3B achieves state-of-the-art performance among open-source models under 1T parameters, and in several domains approaches or matches closed-source systems. Notably, AgentScaler-4B demonstrates competitive performance with models several times larger, underscoring the efficiency of the environment scaling and experience learning pipeline.

Ablation studies confirm the necessity of the two-stage training protocol: general foundation learning is critical for tool-usage competence, and domain specialization further consolidates these capabilities.

Robustness, Generalization, and Long-Horizon Analysis

AgentScaler models exhibit strong robustness and generalization, as evidenced by out-of-distribution performance on ACEBench-zh. The synthetic data approach enables efficient knowledge transfer, with substantial gains in agentic capabilities for compact models. Figure 3

Figure 3: Passk^{\wedge}k metric results across all domains in the τ2\tau^2-Bench, demonstrating stability and consistency of AgentScaler models.

Stability analysis using the passk^{\wedge}k metric reveals that AgentScaler-30B-A3B consistently outperforms its backbone across all evaluated settings. However, a negative correlation is observed between the number of tool calls in a trajectory and task accuracy, indicating that long-horizon tool calling remains a fundamental challenge for agentic models.

Implications and Future Directions

The AgentScaler framework demonstrates that scalable, verifiable environment construction and principled agent experience learning can yield robust agentic intelligence in LLMs, even at modest parameter scales. This has significant practical implications for deploying agentic models in resource-constrained or latency-sensitive scenarios.

Theoretically, the work highlights the importance of environment diversity and verifiability for generalization and robustness in agentic intelligence. The pipeline's modularity suggests extensibility to broader modalities and real-world deployment.

Future research directions include integrating reinforcement learning atop the simulated environments, scaling to larger model architectures, and addressing the long-horizon tool-calling challenge. The authors also note the potential for edge deployment and broader applicability of compact agentic models.

Conclusion

This paper presents a principled approach to advancing general agentic intelligence via environment scaling and agent experience learning. The automated construction of diverse, verifiable environments and the two-stage training protocol enable efficient synthesis of agentic capabilities, validated by strong empirical results. The framework sets a foundation for scalable, robust, and generalizable agentic LLMs, with clear avenues for further research in RL integration, model scaling, and long-horizon reasoning.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Explain it Like I'm 14

What this paper is about (in simple terms)

The paper is about teaching AI “agents” (smart chatbots) to use tools and apps reliably in the real world. Think of an agent that can check flight times, book a seat, look up your order, or change a calendar—each of those is a tool call. The authors say the best way to make agents good at using many tools is to let them practice in many different, realistic “worlds.” They build a system called AgentScaler that creates lots of these practice worlds and trains agents on them.

The big questions the paper asks

  • How can we create many different, high‑quality practice environments where an agent can safely learn to use tools?
  • How can we train an agent from those practice experiences so it learns general tool-using skills (and also gets good in special areas like retail or airlines)?

How they did it (methods explained with simple ideas)

Imagine the agent’s world like a video game with a “world state” (what’s true right now). Tools are like apps the agent can use to read or change that world state.

  • Read = checking information (like “What’s my next flight?”)
  • Write = changing information (like “Book seat 12A”)

To scale up training, the authors do two big things:

1) Build lots of simulated environments automatically They start with a huge collection of real-world APIs (think: instructions for how to talk to apps). Then they: - Group similar tools into “domains” (like putting phone apps into folders: travel, shopping, telecom). They use a network method (Louvain community detection) to cluster tools that work well together. - For each domain, they create a database that represents the world state (like a game save file). - They turn each tool into runnable code that either reads from or writes to that database. Now tool calls have real, checkable effects. - They generate practice tasks by sampling sensible sequences of tool uses (like “search flights -> choose flight -> book flight”), filling in arguments (dates, names, seats), and executing them to change the database. Because everything is simulated and grounded in a database, they can verify whether actions were correct.

2) Collect “agent experience” and train in two stages They simulate a user with a goal (e.g., “I need to change my flight”), let the agent interact with the tools, and record the whole conversation plus tool calls. Then they filter the data carefully so only good, verifiable examples remain. Finally, they fine-tune the model in two steps: - Stage 1: General training across many domains (learn the basics of when and how to use tools, and how to talk to users about the results). - Stage 2: Specialize in a target domain (e.g., retail or airlines) so the agent gets extra good at one area.

A few helpful translations of terms:

  • API: a standard way for software to talk to another app (like giving the agent a phone number and script so it can “call” an app).
  • Environment/database: the “world” the agent can read and change.
  • Tool call/function call: asking an app to do something with specific inputs.
  • Verification: checking the database state and tool sequence to be sure the agent really did the right thing.

What they found (results) and why it matters

They trained several AgentScaler models (small to medium size) and tested them on well-known benchmarks that measure tool use:

  • tau-bench and tau²-bench (customer-like tasks in retail, airline, telecom)
  • ACEBench (broader tool-usage tests, including English and Chinese versions)

Key takeaways:

  • Their models beat most other open-source models of similar or even much larger size. The 30B AgentScaler often performs close to very large or closed-source systems.
  • Even a small 4B model learned strong tool-using skills after this training pipeline, which is impressive for such a compact model.
  • The method generalizes well: performance stayed strong on out-of-distribution tests (like the Chinese ACEBench-zh), showing good robustness.
  • Stability is better but still a challenge: as you ask the same question multiple times, consistency can drop across all models—so this is an open problem in the field.
  • Long tool chains (many steps) are hard for everyone, including AgentScaler. Accuracy goes down as the number of tool calls grows. This highlights a key area to improve.

Why it’s important:

  • It shows you don’t need a trillion-parameter model to get great tool-using agents—smart training in rich environments can make smaller models very capable.
  • The pipeline is automatic and verifiable, making it practical to scale up training without tons of human labor.

What this could mean in the real world

If AI agents can reliably use tools across many domains, they can:

  • Help with customer support (check orders, update accounts, fix issues)
  • Plan travel (search flights, book seats, change reservations)
  • Manage tasks (calendar updates, reminders, data lookups)
  • Work faster and cheaper because smaller models can still perform well

The authors also point to future steps:

  • Add reinforcement learning (letting the agent learn by trying actions and getting rewards) on top of these simulated environments.
  • Expand to more domains and modalities (e.g., combining text with images or other inputs).
  • Tackle long, multi-step tool sequences to make agents handle complex, real-world workflows even better.

In short: AgentScaler shows that building lots of realistic, checkable practice worlds and training in two smart stages can make AI agents much better at using tools—without needing gigantic models.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise, actionable list of what remains missing, uncertain, or unexplored in the paper, to guide future research.

  • Sim-to-real transfer: How well do skills learned in fully simulated, DB-grounded environments transfer to real APIs that are non-deterministic, rate-limited, versioned, and require authentication, streaming, or asynchronous handling?
  • Real API execution: No evaluation with live MCP/OpenAPI endpoints. What is the degradation under real network latency, transient failures, pagination, throttling, and schema drift?
  • Distributional shift: The simulator enforces deterministic state transitions; real systems exhibit partial observability, delayed consistency, side effects, and noisy responses. How to model and train for these properties?
  • Safety and security: No analysis of prompt injection, tool misuse, data exfiltration, or malicious API responses within the environment. How to harden agents against adversarial tools and untrusted outputs?
  • Transactionality and idempotency: The DB abstraction omits transactional semantics (rollback, idempotency keys) common in real APIs. How should agents reason about retries, compensating actions, and exactly-once effects?
  • Asynchrony and events: Simulated calls are synchronous; real-world agents must handle callbacks, webhooks, streaming, and long-running jobs. How to extend the environment and training to event-driven workflows?
  • Multi-modality: The pipeline focuses on text and DB operations. How to incorporate images, audio, UI actions, and web navigation in the same scalable, verifiable framework?
  • Multi-agent coordination: No paper of collaboration, delegation, or tool ownership across multiple agents. How to scale environments to multi-agent settings with shared state and role-based access?
  • Tool-graph construction fidelity: Dependency edges are initiated by parameter-text cosine similarity and then LLM-audited. What is the precision/recall of edges, and how sensitive is performance to the threshold τ, embedding choice, and auditing quality?
  • Cross-domain dependencies: Domains are disjoint communities; many real tasks span multiple domains (e.g., travel+payments+identity). How to represent and train cross-domain compositions and orchestration?
  • Tool schema induction: The method for inducing domain DB schemas from tool parameters is under-specified. How to verify correctness, normalization, and minimality of schemas at scale, and how do schema choices affect agent learning?
  • Programmatic materialization validity: “High degree of consistency” with τ-bench implementations is claimed via manual inspection; no quantitative code- or behavior-level agreement metrics are reported. How to automate unit/integration testing for generated tool code?
  • Argument generation realism: Parameter generation for tool calls lacks explicit validation against real usage distributions and constraints. How to ensure realistic value ranges, entity consistency, and referential integrity?
  • Sequence sampling bias: Tool sequences are sampled from the graph via directed walks; it is unclear if the sampling covers long-horizon, branching, and rare compositions. How to actively balance sequence length, difficulty, and coverage?
  • Verifiability scope: Environment-level verification checks state equality and exact call sequences; it does not validate semantic equivalence when multiple tool plans are correct. How to design verifiers tolerant to plan diversity while remaining precise?
  • Exact-match filtering brittleness: The strict sequence/argument exact-match filter likely discards alternative correct trajectories, biasing data toward canonical plans. Can relaxed or equivalence-class filters preserve diversity without harming supervision quality?
  • Error handling and recovery: Although errorful trajectories are kept, there is no targeted training/evaluation of diagnosis, retry, backoff, or fallback strategies. How to measure and improve robust recovery behavior?
  • Credit assignment and planning: The approach uses SFT only; no RL for long-horizon planning, tool budgeting, or non-myopic trade-offs. What RL formulations (e.g., environment rewards, curriculum) best improve multi-step tool chains?
  • Long-horizon compositionality: The paper shows accuracy declines with more tool calls but proposes no concrete mitigations. Which methods (hierarchical planning, graph-constrained decoding, search, tool-belief states) actually bend this curve?
  • Calibration and uncertainty: No evaluation of confidence estimation, abstention, or tool-call calibration. How to train calibrated agents that know when to call tools, when to stop, and when to ask for clarification?
  • Memory and state summarization: The framework doesn’t paper how memory mechanisms (summaries, scratchpads, slot-filling) help long dialogues with evolving environment state. Which memory strategies provide stable gains?
  • Data scaling laws: No analysis of how performance scales with the number of environments, tools, trajectories, or sequence length. What are the data–model–compute scaling relationships specific to agentic tool use?
  • Component ablations: Limited ablations on the environment pipeline. What is the marginal contribution of (i) LLM edge refinement, (ii) schema induction, (iii) each filtering stage, and (iv) error-trajectory retention?
  • OOD generalization breadth: Evaluation covers τ/τ²/ACE and a single OOD (ACEBench-zh). How does the model perform on other multi-hop tool-use benchmarks (e.g., ToolHop) and entirely new domains/tools unseen in training?
  • Fairness of baselines: Inference-time settings (tool budgets, temperatures, system prompts, “thinking” modes) across baselines are not standardized. How sensitive are results to these knobs, and what is a fair comparison protocol?
  • Real-world cost and latency: The framework does not measure end-to-end costs (tool calls, tokens) or latency under practical constraints. How to optimize for resource-aware tool policies?
  • Continual learning and tool churn: Real APIs change. How can the agent and environment handle tool additions/removals, versioning, and continual fine-tuning without catastrophic forgetting?
  • API documentation grounding: Agents are trained on interactions, not on reading API docs at inference time. Would retrieval-augmented API doc grounding improve zero-shot tool adoption?
  • Security of execution: Even if “fully simulated,” executing generated code can pose risks. What sandboxing, capability restrictions, and auditing are required for safe large-scale materialization?
  • Data leakage risks: The simulator’s schemas and tools may overlap with evaluation designs (e.g., τ-bench) due to “high consistency.” How is leakage prevented, and what strict isolation protocols ensure clean evaluation?
  • Reproducibility details: Key hyperparameters (graph thresholds, schema rules), dataset sizes after each filter, and environment-generation code are not fully specified. What exact settings are needed to reproduce results?
  • Licensing and release: It’s unclear which APIs, tools, and environments (especially internal repositories) will be released and under what licenses, limiting community validation and extension.
  • Multi-lingual and code-switching: Aside from ACEBench-zh, broader multilingual coverage, code-switching, and locale-specific tool behavior (date/time/currency formats) remain unexplored.
  • Complex constraints and compliance: No modeling of compliance constraints (PII, KYC/AML, HIPAA/GDPR). How to encode and enforce policy constraints in planning and tool selection?
  • Evaluation metrics: Reliance on pass@1/accuracy underplays user-centric metrics (task success under failure modes, satisfaction, safety violations). What richer, reliability-aware metrics should be adopted?
  • Orchestration architectures: The work uses monolithic SFT; it does not explore specialized routers, tool planners, or modular controllers. Which architectures best leverage the scaled environments?
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 10 posts and received 753 likes.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube