EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL

Published 18 May 2026 in cs.CL and cs.LG | (2605.18703v1)

Abstract: Equipping LLMs with tool-use capabilities via Agentic Reinforcement Learning (Agentic RL) is bottlenecked by two challenges: the lack of scalable, robust execution environments and the scarcity of realistic training data that captures implicit human reasoning. Existing approaches depend on costly real-world APIs, hallucination-prone LLM simulators, or synthetic environments that are often single-turn or depend on pre-collected documents. Moreover, synthetic trajectories are frequently over-specified, resembling instruction sequences rather than natural human intents, reducing their effectiveness for RL training. We introduce EnvFactory, a fully automated framework that addresses both challenges. EnvFactory autonomously explores and verifies stateful, executable tool environments from authentic resources, and synthesizes natural multi-turn trajectories through topology-aware sampling and calibrated refinement, producing grounded queries with implicit intents. Using only 85 verified environments across 7 domains, EnvFactory generates 2,575 SFT and RL trajectories. Despite using significantly fewer environments than prior work, which are often 5 times more, EnvFactory achieves superior training efficiency and downstream performance, improving Qwen3-series models by up to +15% on BFCLv3, +8.6% on MCP-Atlas, and +6% on conversational benchmarks including $τ^2$-Bench and VitaBench. By fully automating both environment construction and trajectory synthesis, EnvFactory provides a scalable, extensible, and robust foundation for Agentic RL.

Abstract PDF Upgrade to Chat

Authors (15)

Summary

The paper introduces a fully automated pipeline for synthesizing verified, realistic tool-use environments that minimize reliance on over-specified synthetic data.
It proposes a dependency-aware trajectory synthesis method that recursively resolves tool input dependencies to mimic authentic multi-turn interactions.
Empirical results show improved training efficiency and performance, with up to 15% gains while using fewer synthetic environments compared to baselines.

EnvFactory: Automated Synthesis of Executable Environments and Robust RL for Scalable Tool-Use Agents

Problem Setting and Motivation

Recent advances in LLMs reveal strong potential for tool-augmented agents capable of performing complex, compositional tasks through interactive use of external tools. However, reinforcement learning frameworks for such agents (Agentic RL) are acutely limited by two persistent bottlenecks: (1) a lack of scalable, robust, and verifiable executable tool environments, and (2) the unavailability of training data that faithfully captures naturalistic, multi-turn, and implicitly reasoned human interaction patterns. Existing paradigms—reliant on expensive production APIs, hallucination-prone LLM simulators, or synthetic tool sandboxes built from documentation—fail to address these limitations at scale. Additionally, synthetic data often consists of rigid, over-specified instruction lists that do not reflect the conciseness, ambiguity, and contextual dependencies of real-world user behavior.

Methodology

Autonomous Environment Generation

EnvFactory introduces a fully automated pipeline for constructing verified, stateful, and executable environments, each representing a realistic tool-use ecosystem (e.g., commerce, finance, travel). The pipeline combines three agentic components:

Search Agent: Identifies functional gaps, retrieves authentic online resources, and drafts environment metadata encompassing tool definitions, schemas, and descriptions.
Code Agent: Derives stateful Pydantic-based database schemas, implements production-quality Python classes with standardized MCP interfaces, and guarantees schema-level validation.
Test Agent: Generates diverse, type-checked scenario data and comprehensive test cases; executes iterative revision until all tools satisfy interface, behavior, and state-consistency criteria.

This pipeline yields robust, independently verifiable environments capable of supporting complex tool-agent interactions. Unlike AutoForge (Cai et al., 28 Dec 2025), AgentScaler (Fang et al., 16 Sep 2025), and similar work, EnvFactory’s pipeline autonomously recovers real-world tool ecosystems, not only those pre-curated via documents or static task sets.

Dependency-Aware and Topology-Guided Trajectory Synthesis

To overcome the limitations of semantically shallow or instruction-like synthetic queries, EnvFactory synthesizes tool-use trajectories via a dependency tool graph $G$ over all tools and parameters. Semantic and LLM-based logical matching ensures that both direct and latent parameter dependencies are encoded. Based on $G$ , a topology-aware sampling process recursively resolves tool input dependencies, ensuring that all required inputs for each tool in a sampled chain are either user-providable or supplied by preceding tool outputs. This enables complex, non-linear, multi-step trajectories representing realistic user goals.

QueryGen incorporates calibrated, multi-stage refinement to inject:

Implicit references, contextual ambiguity, and omission of deducible arguments
Pragmatic action compression and goal expansion
Plausible subgoals, scenario-motivated plans, and real user communication patterns

Verified, scenario-driven agent–user interactions are conducted in sandboxed environments, producing multi-turn, multi-step, non-deterministic “ground-truth” agentic trajectories for both SFT and RL.

Composite Reward and Post-Training

Post-training leverages both supervised fine-tuning (SFT) on multi-turn dialogue and RL in constructed environments. Reward signals integrate trajectory fidelity, final state equivalence, and explicit length penalties to handle non-uniqueness and non-determinism inherent in real tool use:

$R = \alpha R_\text{traj} + (1-\alpha) R_\text{state} - \gamma P_\text{length}$

Balanced reward weighting is empirically found to optimize performance, as ablation studies show that relying solely on either trajectory or state reduces both task completion and generalization.

Experimental Results

Training Efficiency and Downstream Performance

EnvFactory constructs 85 verified executable environments across 7 domains (e.g., office, finance, travel). It generates 1,622 SFT and 953 RL trajectories—an order of magnitude fewer than baselines such as EnvScaler (191 environments, 11,572 tasks) or AWM (526 environments, 3,315 tasks). Despite this, models post-trained with EnvFactory’s data consistently outperform both SFT-only and RL-enhanced baselines across all major tool-use benchmarks:

Key Out-of-the-Box Performance Gains (Qwen3-series):

BFCLv3 (multi-turn): Qwen3-4B increases from 33.50 to 48.50 (+15%), with fewer synthetic environments
MCP-Atlas (pass rate): Qwen3-8B rises from 5.15 to 13.75 (+8.6%)
Conversational Benchmarks (t2-Bench/VitaBench): Consistent improvements (+6% absolute on average) and especially pronounced in settings with high ambiguity or implicit user intent

These gains are robust to model size, with strong improvements in both 1.7B and 8B parameter regimes. SFT alone delivers substantial cold-start performance; RL post-training further increases robustness and compositional reasoning depth.

Generalization and Data Efficiency

Scaling the number of environments evidences diminishing returns after 75–85 environments, indicating that sufficient domain and tool diversity is captured early via EnvFactory’s automated, authenticity-driven environment generation. The strong performance observed with only a fraction of the synthetic environments crafted by competing methods highlights both the data efficiency and higher supervisory quality of EnvFactory’s output.

Ablations

Omitting the topology-aware sampling or refinement stages during query synthesis degrades performance, especially on tasks where user intent is ambiguous or contextually specified (e.g., missing-function and missing-parameter settings). SFT initialization is also critical, as direct RL from synthetic data suffers instability and lower coverage; this aligns with prior RLHF findings.

Theoretical and Practical Implications

Theoretically, EnvFactory operationalizes the synthesis of Markovian, stateful, and compositional tool ecosystems from real-world online data—advancing the simulation infrastructure required for robust Agentic RL at scale. Its recursive, dependency-aware planning for trajectory generation more faithfully models the logical and temporal structures underlying complex human–tool interactions, thus facilitating the emergence of robust policy learning.

Practically, EnvFactory dramatically lowers the cost and barriers to developing and scaling tool-use-capable agents. It enables the research community to iterate over diverse, compositional, and realistic environments without relying on unstable simulators or proprietary production APIs. This both democratizes and systematizes reproducible research in this domain.

Risks and Limitations: While the system improves safety and coverage for benign agent deployment (office, finance, travel, etc.), its automated environment synthesis could be repurposed for malicious domains (e.g., financial fraud, phishing). Mitigation requires transparent documentation, restricted artifacts, and integration of safety constraints. Additionally, session isolation and per-connection resource requirements may constrain synthetic throughput, but are partially offset by asynchronous pipeline execution.

Conclusion

EnvFactory represents a comprehensive and automated framework for scaling up tool-use agents via real-world environment synthesis and robust RL. By fusing autonomous executive environment construction, dependency-aware multi-turn trajectory generation, and reward-aligned RL, it achieves strong empirical gains in both training efficiency and agentic task performance—outperforming methods that use 5x to 10x more synthetic environments and tasks. The methodological contributions extend the foundation for AI agents capable of nuanced, contextual, and robust tool utilization, and provide a scalable alternative to expensive or ungrounded approaches. Future research will likely explore even richer environment topologies, dynamic environment mutation, and further transfer to real-world, production-grade API ecosystems.