EnvScaler: Automated LLM Training Environments
- EnvScaler is an automated framework that synthesizes large-scale, executable environments to train and benchmark LLM agents.
- It employs SkelBuilder for environment skeleton synthesis and ScenGenerator for multi-tool scenario generation, ensuring rigorous quality and scalability.
- Benchmark results on Qwen3 models show significant performance gains with supervised fine-tuning and reinforcement learning, validating its practical efficacy.
EnvScaler is an automated framework for synthesizing large-scale, executable, tool-interactive environments to facilitate the training and evaluation of LLM agents. Its principal aim is to address the limitations of restricted real-system access, the unreliability of LLM-simulated environments, and the scalability challenges of manual sandbox construction by programmatically generating diverse environments (“sandboxes”) and associated task scenarios. EnvScaler supports both supervised fine-tuning (SFT) and reinforcement learning (RL), as demonstrated on Qwen3 models, substantially improving agent performance in complex, multi-turn, multi-tool interaction settings (Song et al., 9 Jan 2026).
1. Architectural Components
EnvScaler is composed of two main modules: SkelBuilder and ScenGenerator.
- SkelBuilder synthesizes environment skeletons by mining environment topics, inferring logic models, and conducting quality evaluation.
- ScenGenerator generates diverse task scenarios and rule-based trajectory validators tailored to each synthesized environment.
For each environment , SkelBuilder outputs a triplet:
where is an executable Python program (state classes, tool methods, domain rules), is a human-readable environment specification, and encapsulates the schema of all agent-exposed method interfaces.
2. Environment Skeleton Synthesis: SkelBuilder
SkelBuilder operates in three primary stages:
2.1 Task-Guided Environment Discovery (Topic Mining)
Starting from an existing set of tasks , SkelBuilder:
- Filters tasks for statefulness via an LLM-prompted classifier.
- Infers environment descriptions for qualifying tasks using environment-inference prompts.
- Collects, embeds, and deduplicates environment descriptions to yield a diverse set .
Pseudocode outline:
1 2 3 4 |
for t in T_exist: if M(task_filter_prompt, t) == YES: E'_des_list.append(M(env_infer_prompt, t)) E_des = DedupByEmbedding(E'_des_list) |
2.2 Automated Executable Environment Construction (Logic Modeling)
Each undergoes:
- Logic planning via LLMs to infer state variables (), rule constraints (), and tool blueprints ().
- Program modeling: attribute and method generation based on , , and .
- Program assembly into , verifying Python AST correctness and extracting .
2.3 Dual-Agent Environment Assessment (Quality Evaluation)
EnvScaler employs N = 100 rounds of tool-call tests:
- A frontend agent issues (valid/invalid) tool calls given the current state and .
- The backend agent verifies the call/response/state delta against the implementation, outputting Pass/Warning/Fail.
- The environment’s quality score is:
Only environments with are retained.
3. Scenario Generation: ScenGenerator
ScenGenerator uses SkelBuilder’s outputs to curate scenario data critical for agent training.
3.1 Initial State Generation
For each and , an LLM prompt produces a JSON-compliant conforming to attribute constraints, establishing realistic database states.
3.2 Task Generation
Given , ScenGenerator prompts for tasks that are:
- Feasible with the initial state and rules,
- Multi-tool and multi-stage,
- Nontrivial.
3.3 Validation Function Generation
Rule-based checkers are synthesized for task validation:
The overall trajectory reward is:
3.4 Agent-Environment Rollouts
Interactions model a Partially Observable Markov Decision Process (POMDP):
Two interaction types are supported:
- Non-conversation (agent receives the task upfront),
- Conversation (task information released incrementally via a simulated user ).
Rollouts yield (i) SFT data: states, histories, tool calls, reasoning, and (ii) RL reward signals.
4. Experimental Results and Benchmarks
EnvScaler was instantiated on 191 environments and approximately 7,000 scenarios. Training and evaluation were conducted on Qwen3 models (1.7B, 4B, 8B) in “Thinking” mode, using both SFT and RL with Reinforce++ and KL penalty.
Benchmarks:
- BFCL-v3 Multi-Turn: 8 environments, 800 tasks, with Base, Miss-Param, Miss-Func, Long-Context subsets.
- Tau-Bench: Retail and Airline domains.
- ACEBench-Agent: Multi-Step, Multi-Turn tasks.
Key results:
| Model (Thinking) | BFCL-MT Base | Tau-Bench | ACEBench |
|---|---|---|---|
| Qwen3–4B (base) | 25.38% | 26.00% | 55.28% |
| + SFT | 34.88% (+9.50) | 38.20% (+11.39) | 66.67% (+11.39) |
| + RL | 38.00% (+12.62) | 41.06% (+15.06) | 70.55% (+15.27) |
| Qwen3–8B (base) | 28.88% | 30.00% | 60.00% |
| + SFT | 37.00% (+8.12) | 41.35% (+11.35) | 71.67% (+11.67) |
| + RL | 41.88% (+13.00) | 44.81% (+14.81) | 72.50% (+12.50) |
Averaged across models, SFT adds 8.67 points on BFCL-MT, 4.29 on Tau-Bench, and 11.57 on ACEBench; RL yields further improvements, notably for larger models. The gains are robust to similarity between synthetic and test environments, indicating the framework promotes transferable tool-use patterns rather than memorization.
5. Formal Definitions and Pseudocode
The following summarizes core formal structures:
- Environment Skeleton:
- State/tool inference:
- Code Assembly:
- Dual-agent Evaluation:
- Scenario Reward:
- Agent Step: ,
High-level pseudocode sketches:
1 2 3 4 5 6 7 |
for e_des in E_des: (E_state, E_rule, E_tool) = LogicPlan(e_des) F_exec = CodeGen(E_state, E_rule, E_tool) Σ_tool = ExtractInterfaces(F_exec) score_env = EvaluateEnv(F_exec, Σ_tool) if score_env >= τ: env_list.append((F_exec, e_des, Σ_tool)) |
1 2 3 4 5 6 7 |
for (F_exec, E_des, Σ_tool) in env_list: for r in range(R): S_init = GenInitialState(F_exec, E_state) task = GenTask(S_init, Σ_tool, E_rule) c_k = GenChecklist(task) f_c_k = GenCheckFuncs(c_k) scenario_list.append((F_exec, S_init, task, f_c_k)) |
6. Ablations, Limitations, and Future Directions
Experiments demonstrate that train-test environment similarity minimally affects transfer (∼1–2 points). As the number of synthetic environments increases, BFCL-MT success rises monotonically, indicating scaling efficacy. Models trained with both non-conversational and conversational SFT settings surpass those exposed to a single pattern, confirming the necessity for diverse interaction dynamics.
Identified limitations include:
- Reliance on LLMs for synthesis introduces possible domain simplification or logic bias.
- Current scope is restricted to text-based, stateful, domain-specific tool interactions, excluding open-world web search and multimodal tasks.
- Only nominal tool latencies are modeled; error realism and stochasticity are excluded.
Envisioned extensions comprise integration of multimodal tools, simulation of stochastic or noisy environments, expansion to open-domain tool interactions, and the incorporation of human-in-the-loop validation to calibrate logic fidelity.
7. Significance and Availability
EnvScaler’s combination of topic mining, skeleton code generation, dual-agent validation, and automated scenario synthesis constitutes a scalable, rigorous solution for constructing training sandboxes, yielding significant and consistent improvements for tool-enabled LLM agents in both SFT and RL regimes. The code and data are available at https://github.com/RUC-NLPIR/EnvScaler (Song et al., 9 Jan 2026).