EnvFactory: Automated Verified RL Environments
- EnvFactory is an automated framework for Agentic RL that generates verified, executable tool environments and realistic multi-turn trajectories.
- It integrates EnvGen and QueryGen to autonomously construct stateful environments and synthesize human-like, calibrated multi-step queries from authentic online data.
- Topology-aware sampling and dependency graph construction ensure validated workflows and scalable supervision, boosting performance on diverse benchmarks.
Searching arXiv for the specified paper and closely related work to ground the article. EnvFactory is a fully automated framework for Agentic RL that targets two bottlenecks in tool-use agent training: scalable, robust execution environments and realistic training data that reflects implicit human reasoning. It combines EnvGen, which autonomously constructs stateful, executable tool environments from authentic online resources, with QueryGen, which synthesizes natural multi-turn, multi-step tool-use trajectories from those environments through topology-aware sampling and calibrated refinement. The framework is presented as an end-to-end pipeline for producing verified environments and grounded trajectories suitable for SFT and RL post-training, and is evaluated on Qwen3-series models with reported gains on BFCLv3, MCP-Atlas, -Bench, and VitaBench (Xu et al., 18 May 2026).
1. Problem setting and design motivation
EnvFactory is situated within the broader problem of equipping LLMs with tool-use capabilities via Agentic RL. The motivating claim is that current Agentic RL is bottlenecked not only by model optimization, but by the quality and scalability of the environments and trajectories used for post-training. The framework is proposed in response to two shortcomings identified in prior pipelines: the environment bottleneck and the data realism bottleneck (Xu et al., 18 May 2026).
The environment bottleneck is described in terms of three existing environment types. Production environments, such as real APIs or MCP servers, are authentic but costly, latency-prone, and hard to scale. Simulated environments based on LLM simulators are fast but hallucination-prone and unstable for RL. Synthetic environments implemented as sandbox code or reconstructed tools offer better scalability, but are often stateless, dependent on pre-collected documents or fixed task sets, and not generalizable to unseen tool ecosystems.
The data realism bottleneck concerns the character of synthetic trajectories. The paper states that these trajectories often become over-specified: they explicitly enumerate all requirements, resemble instruction lists, and fail to capture implicit intent, ambiguity, contextual references, and natural human brevity. In this formulation, the challenge is not merely to generate more tool traces, but to generate grounded queries with implicit intents that are useful for robust tool-use learning.
A plausible implication is that EnvFactory is designed to shift the unit of scaling from raw trajectory count to verified environment quality and interaction realism. This interpretation is consistent with the paper’s emphasis on executable verification, statefulness, topology-aware dependency handling, and calibrated query refinement.
2. Formal structure of environments and environment synthesis
EnvGen begins from an empty environment set, , and constructs environments one by one. Each environment is defined as
where denotes metadata, including descriptions, tool definitions, and tool schemas; denotes the stateful database schema; denotes the executable Python implementation; and denotes the tool interface exposed to agents, including tool names, descriptions, parameters, and the default MCP interface. The global toolset is given as (Xu et al., 18 May 2026).
The synthesis procedure has four stages. In the proposal and sketch stage, a Search Agent inspects current environment coverage, identifies missing domains or capabilities, searches authentic external sources such as API documentation, technical reports, and usage examples, and drafts candidate environments grounded in real, widely applicable functionality. The output is structured metadata , which serves as the blueprint for construction.
In database modeling, a Code Agent derives the stateful schema 0 from 1, including entities, relationships, mutable state, tool parameters, intermediate states, and persistent records. In code implementation, the same agent implements the executable Python logic 2 for every tool so that the tools align with the metadata, obey schema constraints, and correctly mutate the environment state.
The final stage is a revision loop. A Test Agent validates the environment with unit tests and checks four criteria: tool interfaces are consistent with metadata 3, tools import and execute successfully, outputs match expected behavior, and database states transition correctly after tool use. If validation fails, the Test Agent generates a structured error report, the Code Agent revises the relevant component, and the environment is rebuilt. The loop continues until all tests pass or the revision budget is exhausted.
The verification output is described as cross-validated across all components. This is central to the framework’s identity: EnvFactory is not limited to generating tool descriptions, but generates verified, runnable tool systems for stable RL training.
3. Dependency graph construction and topology-aware sampling
After environment construction, EnvFactory builds a tool dependency graph
4
where the nodes are tools. The graph representation is fine-grained in the sense that it models not only tools as nodes, but also parameters, with the stated purpose of capturing dependencies more precisely (Xu et al., 18 May 2026).
Graph construction proceeds in two steps. First, EnvFactory performs semantic parameter matching using the BAAI/bge-m3 embedding model. It encodes input and output parameters and, for each tool pair 5, computes cosine similarity between outputs of 6 and inputs of 7. If the similarity exceeds a threshold, an edge 8 is added. Second, an LLM performs logical dependency refinement by adding missing logical edges and removing spurious semantic edges. The summary notes that this is especially important for parameter-less tools, tools with implicit dependencies, and tools in the same functional pipeline but with different signatures.
Topology-aware sampling is the mechanism for creating realistic tool sequences from this graph. Its stated motivation is that naive random walks are insufficient because real workflows are nonlinear, tools often require outputs from multiple prior tools, and some parameters are user-supplied whereas others are only available internally from tool outputs. EnvFactory therefore enforces a core constraint: every required input parameter of a sampled tool must be either provided by the user or derived from outputs of earlier sampled tools.
To operationalize this, an LLM classifies input parameters into external parameters, which can be naturally provided by a user, and internal parameters, which must come from prior tool outputs. A parameter is valid if it is optional, externally providable, or internally satisfiable from previous tools. If it is not valid, the sampler recursively looks backward in the graph for tools that can produce it. The algorithmic details in the summary specify a recursion depth cap of 9, a stochastic override probability of 0, and forward expansion by sampling one outgoing neighbor uniformly. Elsewhere in the description, the sampler is also said to select 1 to 2 outgoing neighbors uniformly at random after resolving dependencies. The presence of both phrasings indicates a branching-capable sampling strategy whose precise operationalization is implementation-specific within the paper’s appendix and method description.
This sampling design is significant because it constrains trajectory generation by prerequisite structure rather than by surface similarity alone. A plausible implication is that the resulting trajectories more faithfully approximate executable workflows than chain-like or weakly structured synthetic paths.
4. Query synthesis, calibrated refinement, and grounded trajectory generation
Given a sampled tool chain 3, QueryGen synthesizes multi-turn, multi-step conversations. The planning stage constructs a user profile and scenario, derives a database state consistent with the schema, and partitions the tool chain into multiple dialogue turns, with each turn containing 4–5 tools. The generation stage then produces a query for each turn conditioned on the current database state, dialogue history, and sampled tools. This generation is decomposed into subgoal decomposition, which breaks tools into fine-grained subgoals or intents, and goal articulation, which composes natural language requests from those subgoals (Xu et al., 18 May 2026).
Calibrated refinement is presented as the realism mechanism that transforms rigid, over-specified synthetic prompts into natural human queries. It applies four edits: implicit reference, action compression, ambiguity introduction, and goal expansion. Implicit reference replaces explicit identifiers with contextual references and omits deducible parameters. Action compression removes logically inferable intermediate steps. Ambiguity introduction adds reasonable referential ambiguity. Goal expansion adds plausible secondary objectives. The resulting queries are characterized as shorter, more human-like, more ambiguous, and more realistic for Agentic RL.
Ground-truth trajectory generation is carried out through sandboxed agent-user interactions. The agent uses tools, the user responds with clarifications or required information, and the conversation continues until termination or the maximum number of steps is reached. For each query, EnvFactory independently generates 6 candidate solution trajectories to cover different valid execution paths. These candidates are evaluated using the corresponding database state changes, and the best one is selected. Post-processing removes redundant tool calls, filters unnecessary user interactions, and masks arguments whose values do not affect correctness.
The framework draws an explicit distinction between over-specified synthetic trajectories and grounded queries with implicit intents. Over-specified trajectories expose all reasoning steps and enumerate task requirements, making them easier to solve but less realistic. Grounded queries contain missing, contextual, or ambiguous elements and force reasoning over user intent and environment state. In the paper’s framing, this difference is central because the objective is to generate human-like interaction patterns rather than only executable task scripts.
5. Training pipeline, optimization setup, and reward design
EnvFactory uses the synthesized data for both SFT and RL. In the SFT stage, training is initialized with user interaction trajectories, each tool-call or user-interaction step is treated as a training sample, and failed tool calls are filtered out. In the RL stage, only tool-call trajectories are used, each interaction turn is treated as a sample, and the implementation uses GRPO in VeRL (Xu et al., 18 May 2026).
The appendix-level hyperparameters summarized for RL are a learning rate of 7, rollout size 8, batch size 9, maximum trajectory length of 0k tokens, maximum generation length of 1k tokens, and 2 training epochs. The SFT configuration uses LlamaFactory with a learning rate of 3, batch size 4, and 5 epochs. The data synthesis models are specified as Kimi-K2-Thinking for EnvGen, DeepSeek-V3.2-Chat for QueryGen RL trajectories, and Qwen3-30B-A3B-Thinking-2507 for QueryGen SFT trajectories. Evaluation inference uses SGLang, with temperature 6 for non-thinking models, temperature 7 for thinking models, and 8.
The RL stage employs a composite reward because valid tool-use executions are often non-unique. The summary identifies three reward components: trajectory-based reward, state-based reward, and length penalty. The ambiguity motivating this design includes read-only tools that can be called in different orders, arguments that may vary without changing correctness, and multiple valid trajectories that reach the same final state. The intended reward combines trajectory match, final database-state equivalence, and a penalty for unnecessarily long trajectories, with weighting coefficients 9.
A plausible implication is that the reward design privileges behavioral equivalence over strict trace imitation. That interpretation follows directly from the inclusion of both trajectory-based and state-based terms and from the paper’s explicit discussion of non-unique valid executions.
6. Experimental scale, benchmark performance, and efficiency claims
EnvFactory constructs 0 verified MCP environments across 1 domains: commerce, finance, travel, office, lifestyle, research, and utilities. From these environments it synthesizes 2 SFT trajectories and 3 RL trajectories, for a total of 4 trajectories. The average trajectory statistics are 5 turns per conversation and 6 steps per turn, where steps include both tool calls and user interactions. The backbone models are Qwen3-1.7B, Qwen3-4B, and Qwen3-8B, and the primary baselines are AWM and EnvScaler (Xu et al., 18 May 2026).
Evaluation is reported on BFCL v3, MCP-Atlas, 7-Bench, and VitaBench. The paper claims improvements of up to 8 on BFCLv3, 9 on MCP-Atlas, and 0 on conversational benchmarks including 1-Bench and VitaBench. It further reports that SFT alone yields strong gains, and that RL after SFT yields further improvements. The gains are described as generalizing across conversational benchmarks and non-conversational or compositional tool benchmarks.
Several benchmark-specific patterns are highlighted. On BFCLv3, EnvFactory improves both single-turn and multi-turn performance, with especially strong multi-turn gains. On MCP-Atlas, it improves pass rate and mean coverage, which the paper interprets as evidence that verified environments help generalize to real server-like tool ecosystems. On 2-Bench, gains appear across domains such as airline, retail, and tele. On VitaBench, improvements are described as strong for real-world interactive tasks.
A central efficiency claim is that EnvFactory uses only 3 environments, compared with 4 for AWM and 5 for EnvScaler, yet performs better. The paper attributes this to verified executability, stateful dynamics, topology-aware trajectory structure, calibrated refinement, and data efficiency. Because these explanations are explicitly given in the summary, the article can state that EnvFactory presents quality of supervision as more consequential than environment count alone.
The scaling study uses subsets of 6, 7, and 8 environments. It reports that increasing the number of environments generally improves BFCL multi-turn performance, but with diminishing returns: the gain from 9 is larger than from 0. This suggests that added environments increasingly overlap in logic and that diversity matters, but not linearly.
7. Ablations, implementation constraints, and broader implications
The reported ablations address both optimization and data realism. A direct RL setting without SFT warm start improves some metrics, but the gains are smaller and less stable, leading to the conclusion that SFT initialization remains important for stable policy optimization. A refinement ablation comparing refined and unrefined trajectories on 1 SFT samples finds that refined trajectories generally outperform unrefined ones, especially on ambiguous settings such as Miss-Func and Miss-Param. The stated conclusion is that calibrated refinement improves ambiguity handling and supervision quality (Xu et al., 18 May 2026).
The implementation notes include a specific MCP-Atlas evaluation caveat. Because of connectivity constraints, evaluation is run on 2 of 3 servers and 4 of 5 tasks. The excluded servers are mongodb, oxylabs, brave-search, wikipedia, slack, and google-workspace. This caveat is relevant to interpretation because it delimits the empirical basis of the reported MCP-Atlas results.
The paper also identifies an MCP session isolation bottleneck. Because MCP servers are stateful, write-capable tools can modify shared state, strict session isolation is required, and each conversation needs a dedicated transport connection. This constrains parallel tool invocation and throughput during large-scale synthesis. The mitigation described is an asynchronous synthesis pipeline with many isolated sessions concurrently.
The broader impact discussion mentions possible misuse in sensitive domains, generation of malicious automation, and amplification of source or model bias. The mitigations named are restrictive licensing, transparent documentation, and safety constraints in synthesized environments. These points frame EnvFactory not only as a technical method for scalable Agentic RL, but also as a system whose automation capacity introduces governance and deployment considerations.
A common misconception would be to treat EnvFactory as merely a synthetic data generator. The paper’s own formulation argues against that reading: the framework’s defining components are verified environment synthesis, stateful executability, dependency-constrained sampling, and calibrated realism in multi-turn trajectories. Another misconception would be that larger numbers of environments necessarily yield better training signals. The reported comparison with AWM and EnvScaler is presented specifically to dispute that assumption and to emphasize compact, high-quality supervision.
Taken together, EnvFactory advances the proposition that effective Agentic RL depends on executable environments and grounded multi-turn interaction traces rather than on textual task descriptions alone. This suggests a methodological reorientation in tool-use training toward verification, state transitions, and implicit-intent supervision as first-class design variables.