Papers
Topics
Authors
Recent
2000 character limit reached

EnvScaler: Automated LLM Training Environments

Updated 12 January 2026
  • EnvScaler is an automated framework that synthesizes large-scale, executable environments to train and benchmark LLM agents.
  • It employs SkelBuilder for environment skeleton synthesis and ScenGenerator for multi-tool scenario generation, ensuring rigorous quality and scalability.
  • Benchmark results on Qwen3 models show significant performance gains with supervised fine-tuning and reinforcement learning, validating its practical efficacy.

EnvScaler is an automated framework for synthesizing large-scale, executable, tool-interactive environments to facilitate the training and evaluation of LLM agents. Its principal aim is to address the limitations of restricted real-system access, the unreliability of LLM-simulated environments, and the scalability challenges of manual sandbox construction by programmatically generating diverse environments (“sandboxes”) and associated task scenarios. EnvScaler supports both supervised fine-tuning (SFT) and reinforcement learning (RL), as demonstrated on Qwen3 models, substantially improving agent performance in complex, multi-turn, multi-tool interaction settings (Song et al., 9 Jan 2026).

1. Architectural Components

EnvScaler is composed of two main modules: SkelBuilder and ScenGenerator.

  • SkelBuilder synthesizes environment skeletons by mining environment topics, inferring logic models, and conducting quality evaluation.
  • ScenGenerator generates diverse task scenarios and rule-based trajectory validators tailored to each synthesized environment.

For each environment EE, SkelBuilder outputs a triplet:

E={Fexec, Edoc, Σtool}E = \{ F_{exec},\ E_{doc},\ \Sigma_{tool} \}

where FexecF_{exec} is an executable Python program (state classes, tool methods, domain rules), EdocE_{doc} is a human-readable environment specification, and Σtool\Sigma_{tool} encapsulates the schema of all agent-exposed method interfaces.

2. Environment Skeleton Synthesis: SkelBuilder

SkelBuilder operates in three primary stages:

2.1 Task-Guided Environment Discovery (Topic Mining)

Starting from an existing set of tasks TexistT_{exist}, SkelBuilder:

  • Filters tasks for statefulness via an LLM-prompted classifier.
  • Infers environment descriptions for qualifying tasks using environment-inference prompts.
  • Collects, embeds, and deduplicates environment descriptions to yield a diverse set {Edes}\{E_{des}\}.

Pseudocode outline:

1
2
3
4
for t in T_exist:
    if M(task_filter_prompt, t) == YES:
        E'_des_list.append(M(env_infer_prompt, t))
E_des = DedupByEmbedding(E'_des_list)

2.2 Automated Executable Environment Construction (Logic Modeling)

Each EdesE_{des} undergoes:

  • Logic planning via LLMs to infer state variables (EstateE_{state}), rule constraints (EruleE_{rule}), and tool blueprints ({Etooli}\{E_{tool_i}\}).
  • Program modeling: attribute and method generation based on EstateE_{state}, EruleE_{rule}, and EtooliE_{tool_i}.
  • Program assembly into FexecF_{exec}, verifying Python AST correctness and extracting Σtool\Sigma_{tool}.

2.3 Dual-Agent Environment Assessment (Quality Evaluation)

EnvScaler employs N = 100 rounds of tool-call tests:

  • A frontend agent MtestM_{test} issues (valid/invalid) tool calls given the current state and Σtool\Sigma_{tool}.
  • The backend agent McheckM_{check} verifies the call/response/state delta against the implementation, outputting Pass/Warning/Fail.
  • The environment’s quality score is:

scoreenv=1Nj=1N1[judgej=Pass]\text{score}_{env} = \frac{1}{N}\sum_{j=1}^{N} \mathbb{1}[\text{judge}_j = Pass]

Only environments with scoreenv0.85\text{score}_{env} \geq 0.85 are retained.

3. Scenario Generation: ScenGenerator

ScenGenerator uses SkelBuilder’s outputs to curate scenario data critical for agent training.

3.1 Initial State Generation

For each FexecF_{exec} and EstateE_{state}, an LLM prompt produces a JSON-compliant SinitS_{init} conforming to attribute constraints, establishing realistic database states.

3.2 Task Generation

Given (Sinit,Etool,Erule)(S_{init}, E_{tool}, E_{rule}), ScenGenerator prompts for tasks that are:

  • Feasible with the initial state and rules,
  • Multi-tool and multi-stage,
  • Nontrivial.

3.3 Validation Function Generation

Rule-based checkers fckf_{c_k} are synthesized for task validation:

{ck}k=1..K=M(Pcheck_listtask),fck=M(Pcheck_funcck)\{c_k\}_{k=1..K} = M(P^{check\_list} || \text{task}),\quad f_{c_k} = M(P^{check\_func} || c_k)

The overall trajectory reward is:

reward=1Kk=1K1[fck(Sfinal)=True]\text{reward} = \frac{1}{K} \sum_{k=1}^{K} \mathbb{1}[f_{c_k}(S_{final}) = True]

3.4 Agent-Environment Rollouts

Interactions model a Partially Observable Markov Decision Process (POMDP):

at=πθ(Ht,ot),(ot+1,St+1)=E(at,St)a_t = \pi_\theta(H_t, o_t),\quad (o_{t+1}, S_{t+1}) = E(a_t, S_t)

Two interaction types are supported:

  • Non-conversation (agent receives the task upfront),
  • Conversation (task information released incrementally via a simulated user πuser\pi_{user}).

Rollouts yield (i) SFT data: states, histories, tool calls, reasoning, and (ii) RL reward signals.

4. Experimental Results and Benchmarks

EnvScaler was instantiated on 191 environments and approximately 7,000 scenarios. Training and evaluation were conducted on Qwen3 models (1.7B, 4B, 8B) in “Thinking” mode, using both SFT and RL with Reinforce++ and KL penalty.

Benchmarks:

  • BFCL-v3 Multi-Turn: 8 environments, 800 tasks, with Base, Miss-Param, Miss-Func, Long-Context subsets.
  • Tau-Bench: Retail and Airline domains.
  • ACEBench-Agent: Multi-Step, Multi-Turn tasks.

Key results:

Model (Thinking) BFCL-MT Base Tau-Bench ACEBench
Qwen3–4B (base) 25.38% 26.00% 55.28%
+ SFT 34.88% (+9.50) 38.20% (+11.39) 66.67% (+11.39)
+ RL 38.00% (+12.62) 41.06% (+15.06) 70.55% (+15.27)
Qwen3–8B (base) 28.88% 30.00% 60.00%
+ SFT 37.00% (+8.12) 41.35% (+11.35) 71.67% (+11.67)
+ RL 41.88% (+13.00) 44.81% (+14.81) 72.50% (+12.50)

Averaged across models, SFT adds 8.67 points on BFCL-MT, 4.29 on Tau-Bench, and 11.57 on ACEBench; RL yields further improvements, notably for larger models. The gains are robust to similarity between synthetic and test environments, indicating the framework promotes transferable tool-use patterns rather than memorization.

5. Formal Definitions and Pseudocode

The following summarizes core formal structures:

  • Environment Skeleton: E={Fexec,Edoc,Σtool}E = \{ F_{exec}, E_{doc}, \Sigma_{tool} \}
  • State/tool inference: Estate,Erule=M(Pstate_planEdes)E_{state}, E_{rule} = M(P^{state\_plan} \| E_{des})
  • Code Assembly: Fexec=Merge(Fattr,{Fmethi}i)F_{exec} = \text{Merge}(F_{attr}, \{F_{meth_i}\}_{i})
  • Dual-agent Evaluation: scoreenv=1Nj=1Njudgej\text{score}_{env} = \frac{1}{N} \sum_{j=1}^{N} \text{judge}_j
  • Scenario Reward: reward=1Kk=1K1[fck(Sfinal)=True]\text{reward} = \frac{1}{K} \sum_{k=1}^{K} \mathbb{1}[f_{c_k}(S_{final}) = True]
  • Agent Step: at=πθ(Ht,ot)a_t = \pi_\theta(H_t, o_t), (ot+1,St+1)=E(at,St)(o_{t+1}, S_{t+1}) = E(a_t, S_t)

High-level pseudocode sketches:

1
2
3
4
5
6
7
for e_des in E_des:
    (E_state, E_rule, E_tool) = LogicPlan(e_des)
    F_exec = CodeGen(E_state, E_rule, E_tool)
    Σ_tool = ExtractInterfaces(F_exec)
    score_env = EvaluateEnv(F_exec, Σ_tool)
    if score_env >= τ:
        env_list.append((F_exec, e_des, Σ_tool))

1
2
3
4
5
6
7
for (F_exec, E_des, Σ_tool) in env_list:
    for r in range(R):
        S_init = GenInitialState(F_exec, E_state)
        task = GenTask(S_init, Σ_tool, E_rule)
        c_k = GenChecklist(task)
        f_c_k = GenCheckFuncs(c_k)
        scenario_list.append((F_exec, S_init, task, f_c_k))

6. Ablations, Limitations, and Future Directions

Experiments demonstrate that train-test environment similarity minimally affects transfer (∼1–2 points). As the number of synthetic environments increases, BFCL-MT success rises monotonically, indicating scaling efficacy. Models trained with both non-conversational and conversational SFT settings surpass those exposed to a single pattern, confirming the necessity for diverse interaction dynamics.

Identified limitations include:

  • Reliance on LLMs for synthesis introduces possible domain simplification or logic bias.
  • Current scope is restricted to text-based, stateful, domain-specific tool interactions, excluding open-world web search and multimodal tasks.
  • Only nominal tool latencies are modeled; error realism and stochasticity are excluded.

Envisioned extensions comprise integration of multimodal tools, simulation of stochastic or noisy environments, expansion to open-domain tool interactions, and the incorporation of human-in-the-loop validation to calibrate logic fidelity.

7. Significance and Availability

EnvScaler’s combination of topic mining, skeleton code generation, dual-agent validation, and automated scenario synthesis constitutes a scalable, rigorous solution for constructing training sandboxes, yielding significant and consistent improvements for tool-enabled LLM agents in both SFT and RL regimes. The code and data are available at https://github.com/RUC-NLPIR/EnvScaler (Song et al., 9 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to EnvScaler.