EnvScaler: Automated LLM Training Environments

Updated 12 January 2026

EnvScaler is an automated framework that synthesizes large-scale, executable environments to train and benchmark LLM agents.
It employs SkelBuilder for environment skeleton synthesis and ScenGenerator for multi-tool scenario generation, ensuring rigorous quality and scalability.
Benchmark results on Qwen3 models show significant performance gains with supervised fine-tuning and reinforcement learning, validating its practical efficacy.

EnvScaler is an automated framework for synthesizing large-scale, executable, tool-interactive environments to facilitate the training and evaluation of LLM agents. Its principal aim is to address the limitations of restricted real-system access, the unreliability of LLM-simulated environments, and the scalability challenges of manual sandbox construction by programmatically generating diverse environments (“sandboxes”) and associated task scenarios. EnvScaler supports both supervised fine-tuning (SFT) and reinforcement learning (RL), as demonstrated on Qwen3 models, substantially improving agent performance in complex, multi-turn, multi-tool interaction settings (Song et al., 9 Jan 2026).

1. Architectural Components

EnvScaler is composed of two main modules: SkelBuilder and ScenGenerator.

SkelBuilder synthesizes environment skeletons by mining environment topics, inferring logic models, and conducting quality evaluation.
ScenGenerator generates diverse task scenarios and rule-based trajectory validators tailored to each synthesized environment.

For each environment $E$ , SkelBuilder outputs a triplet:

$E = \{ F_{exec},\ E_{doc},\ \Sigma_{tool} \}$

where $F_{exec}$ is an executable Python program (state classes, tool methods, domain rules), $E_{doc}$ is a human-readable environment specification, and $\Sigma_{tool}$ encapsulates the schema of all agent-exposed method interfaces.

2. Environment Skeleton Synthesis: SkelBuilder

SkelBuilder operates in three primary stages:

2.1 Task-Guided Environment Discovery (Topic Mining)

Starting from an existing set of tasks $T_{exist}$ , SkelBuilder:

Filters tasks for statefulness via an LLM-prompted classifier.
Infers environment descriptions for qualifying tasks using environment-inference prompts.
Collects, embeds, and deduplicates environment descriptions to yield a diverse set $\{E_{des}\}$ .

Pseudocode outline:

for t in T_exist:
    if M(task_filter_prompt, t) == YES:
        E'_des_list.append(M(env_infer_prompt, t))
E_des = DedupByEmbedding(E'_des_list)

2.2 Automated Executable Environment Construction (Logic Modeling)

Each $E_{des}$ undergoes:

Logic planning via LLMs to infer state variables ( $E_{state}$ ), rule constraints ( $E_{rule}$ ), and tool blueprints ( $\{E_{tool_i}\}$ ).
Program modeling: attribute and method generation based on $E_{state}$ , $E_{rule}$ , and $E_{tool_i}$ .
Program assembly into $F_{exec}$ , verifying Python AST correctness and extracting $\Sigma_{tool}$ .

2.3 Dual-Agent Environment Assessment (Quality Evaluation)

EnvScaler employs N = 100 rounds of tool-call tests:

A frontend agent $M_{test}$ issues (valid/invalid) tool calls given the current state and $\Sigma_{tool}$ .
The backend agent $M_{check}$ verifies the call/response/state delta against the implementation, outputting Pass/Warning/Fail.
The environment’s quality score is:

$\text{score}_{env} = \frac{1}{N}\sum_{j=1}^{N} \mathbb{1}[\text{judge}_j = Pass]$

Only environments with $\text{score}_{env} \geq 0.85$ are retained.

3. Scenario Generation: ScenGenerator

ScenGenerator uses SkelBuilder’s outputs to curate scenario data critical for agent training.

3.1 Initial State Generation

For each $F_{exec}$ and $E_{state}$ , an LLM prompt produces a JSON-compliant $S_{init}$ conforming to attribute constraints, establishing realistic database states.

3.2 Task Generation

Given $(S_{init}, E_{tool}, E_{rule})$ , ScenGenerator prompts for tasks that are:

Feasible with the initial state and rules,
Multi-tool and multi-stage,
Nontrivial.

3.3 Validation Function Generation

Rule-based checkers $f_{c_k}$ are synthesized for task validation:

$\{c_k\}_{k=1..K} = M(P^{check\_list} || \text{task}),\quad f_{c_k} = M(P^{check\_func} || c_k)$

The overall trajectory reward is:

$\text{reward} = \frac{1}{K} \sum_{k=1}^{K} \mathbb{1}[f_{c_k}(S_{final}) = True]$

3.4 Agent-Environment Rollouts

Interactions model a Partially Observable Markov Decision Process (POMDP):

$a_t = \pi_\theta(H_t, o_t),\quad (o_{t+1}, S_{t+1}) = E(a_t, S_t)$

Two interaction types are supported:

Non-conversation (agent receives the task upfront),
Conversation (task information released incrementally via a simulated user $\pi_{user}$ ).

Rollouts yield (i) SFT data: states, histories, tool calls, reasoning, and (ii) RL reward signals.

4. Experimental Results and Benchmarks

EnvScaler was instantiated on 191 environments and approximately 7,000 scenarios. Training and evaluation were conducted on Qwen3 models (1.7B, 4B, 8B) in “Thinking” mode, using both SFT and RL with Reinforce++ and KL penalty.

Benchmarks:

BFCL-v3 Multi-Turn: 8 environments, 800 tasks, with Base, Miss-Param, Miss-Func, Long-Context subsets.
Tau-Bench: Retail and Airline domains.
ACEBench-Agent: Multi-Step, Multi-Turn tasks.

Key results:

Model (Thinking)	BFCL-MT Base	Tau-Bench	ACEBench
Qwen3–4B (base)	25.38%	26.00%	55.28%
+ SFT	34.88% (+9.50)	38.20% (+11.39)	66.67% (+11.39)
+ RL	38.00% (+12.62)	41.06% (+15.06)	70.55% (+15.27)
Qwen3–8B (base)	28.88%	30.00%	60.00%
+ SFT	37.00% (+8.12)	41.35% (+11.35)	71.67% (+11.67)
+ RL	41.88% (+13.00)	44.81% (+14.81)	72.50% (+12.50)

Averaged across models, SFT adds 8.67 points on BFCL-MT, 4.29 on Tau-Bench, and 11.57 on ACEBench; RL yields further improvements, notably for larger models. The gains are robust to similarity between synthetic and test environments, indicating the framework promotes transferable tool-use patterns rather than memorization.

5. Formal Definitions and Pseudocode

The following summarizes core formal structures:

Environment Skeleton: $E = \{ F_{exec}, E_{doc}, \Sigma_{tool} \}$
State/tool inference: $E_{state}, E_{rule} = M(P^{state\_plan} \| E_{des})$
Code Assembly: $F_{exec} = \text{Merge}(F_{attr}, \{F_{meth_i}\}_{i})$
Dual-agent Evaluation: $\text{score}_{env} = \frac{1}{N} \sum_{j=1}^{N} \text{judge}_j$
Scenario Reward: $\text{reward} = \frac{1}{K} \sum_{k=1}^{K} \mathbb{1}[f_{c_k}(S_{final}) = True]$
Agent Step: $a_t = \pi_\theta(H_t, o_t)$ , $(o_{t+1}, S_{t+1}) = E(a_t, S_t)$

High-level pseudocode sketches:

for e_des in E_des:
    (E_state, E_rule, E_tool) = LogicPlan(e_des)
    F_exec = CodeGen(E_state, E_rule, E_tool)
    Σ_tool = ExtractInterfaces(F_exec)
    score_env = EvaluateEnv(F_exec, Σ_tool)
    if score_env >= τ:
        env_list.append((F_exec, e_des, Σ_tool))

for (F_exec, E_des, Σ_tool) in env_list:
    for r in range(R):
        S_init = GenInitialState(F_exec, E_state)
        task = GenTask(S_init, Σ_tool, E_rule)
        c_k = GenChecklist(task)
        f_c_k = GenCheckFuncs(c_k)
        scenario_list.append((F_exec, S_init, task, f_c_k))

6. Ablations, Limitations, and Future Directions

Experiments demonstrate that train-test environment similarity minimally affects transfer (∼1–2 points). As the number of synthetic environments increases, BFCL-MT success rises monotonically, indicating scaling efficacy. Models trained with both non-conversational and conversational SFT settings surpass those exposed to a single pattern, confirming the necessity for diverse interaction dynamics.

Identified limitations include:

Reliance on LLMs for synthesis introduces possible domain simplification or logic bias.
Current scope is restricted to text-based, stateful, domain-specific tool interactions, excluding open-world web search and multimodal tasks.
Only nominal tool latencies are modeled; error realism and stochasticity are excluded.

Envisioned extensions comprise integration of multimodal tools, simulation of stochastic or noisy environments, expansion to open-domain tool interactions, and the incorporation of human-in-the-loop validation to calibrate logic fidelity.

7. Significance and Availability

EnvScaler’s combination of topic mining, skeleton code generation, dual-agent validation, and automated scenario synthesis constitutes a scalable, rigorous solution for constructing training sandboxes, yielding significant and consistent improvements for tool-enabled LLM agents in both SFT and RL regimes. The code and data are available at https://github.com/RUC-NLPIR/EnvScaler (Song et al., 9 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EnvScaler.

EnvScaler: Automated LLM Training Environments

1. Architectural Components

2. Environment Skeleton Synthesis: SkelBuilder

2.1 Task-Guided Environment Discovery (Topic Mining)

2.2 Automated Executable Environment Construction (Logic Modeling)

2.3 Dual-Agent Environment Assessment (Quality Evaluation)

3. Scenario Generation: ScenGenerator

3.1 Initial State Generation

3.2 Task Generation

3.3 Validation Function Generation

3.4 Agent-Environment Rollouts

4. Experimental Results and Benchmarks

5. Formal Definitions and Pseudocode

6. Ablations, Limitations, and Future Directions

7. Significance and Availability

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

EnvScaler: Automated LLM Training Environments

1. Architectural Components

2. Environment Skeleton Synthesis: SkelBuilder

2.1 Task-Guided Environment Discovery (Topic Mining)

2.2 Automated Executable Environment Construction (Logic Modeling)

2.3 Dual-Agent Environment Assessment (Quality Evaluation)

3. Scenario Generation: ScenGenerator

3.1 Initial State Generation

3.2 Task Generation

3.3 Validation Function Generation

3.4 Agent-Environment Rollouts

4. Experimental Results and Benchmarks

5. Formal Definitions and Pseudocode

6. Ablations, Limitations, and Future Directions

7. Significance and Availability

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research