Simia-RL: LLM-Simulated RL Framework
- Simia-RL is a reinforcement learning framework that uses LLMs to simulate both environmental dynamics and reward feedback, enabling robust and scalable agent training.
- It integrates a simulated MDP with PPO/GRPO-style updates, employing prompt-based queries to generate diverse, synthetic trajectories for supervised and RL modalities.
- Empirical evaluations show that Simia-RL achieves competitive performance on complex tool-use and dialogue benchmarks while reducing manual environment engineering.
The Simia-RL framework is a reinforcement learning methodology in which both environment transitions and reward feedback are simulated by a LLM rather than a fixed-coded environment or real-world API. Conceived as part of a broader effort to eliminate brittle and labor-intensive environment engineering for language agents, Simia-RL enables scalable, environment-agnostic agent training. By leveraging LLMs to simulate both environment feedback and reward computations, Simia-RL permits training and evaluation of agents entirely in synthetic contexts, supporting both supervised and reinforcement learning modalities. The framework is positioned as a unified solution for robust language agent training, with demonstrated empirical gains on multiple complex tool-use and dialogue benchmarks.
1. Markov Decision Process Formulation
Simia-RL is defined as a standard reinforcement learning problem over a Markov Decision Process (MDP) , where:
- is the state space, comprising full agent–environment dialogue histories, tool-use contexts, and schema information.
- is the action space, including free-form text replies and structured tool calls.
- is the discount factor.
The key distinction is that both the transition function and reward function are implemented by LLM-based simulators, not by hand-coded or real environments. Formally: where and are outputs of the LLM environment simulator and reward model, respectively. The episodic reward is binary, with successful task completion (as defined by oracle criteria) yielding , and failure yielding .
The agent objective is to maximize the expected discounted return: where trajectories are simulated entirely with LLM-based feedback.
2. Algorithmic Structure
Training proceeds in interleaved policy rollouts and updates, closely akin to PPO or GRPO, with all feedback provided by the LLM-based simulator.
High-Level Algorithm
- Initialize policy (typically an instruction-tuned transformer such as Qwen or Llama).
- For RL iterations (e.g., 64):
- Collect simulated episodes: for each,
- Start from an initial state .
- For up to max-turns:
- Sample action .
- Query the LLM simulator with to obtain .
- Store .
- Collect simulated episodes: for each,
- Compute advantage estimates (e.g., GAE).
- Update by maximizing the clipped surrogate PPO or GRPO objective:
where .
- Optional fine-tuning with supervised learning on high-reward trajectories.
Pseudocode
1 2 3 4 5 6 7 8 9 10 11 12 13 |
initialize θ for iteration in range(N): trajectories = [] for rollout in range(K): s = s0 for t in range(T): a = pi_theta(s) s_prime = LLM_env_simulator(prompt=(s, a)) r = LLM_reward_computer(prompt=(s, a, s_prime)) store(s, a, r, s_prime) s = s_prime compute_advantages() update_theta() # PPO/GRPO with L_CLIP objective |
3. LLM-Based Simulation Details
Both the transition simulator and reward function are implemented as prompt-based queries to a large instruction-tuned LLM, typically o4-mini or GPT-5.
- Simulator/Reward Models: Designed atop transformer-based LLMs with context windows up to 60k tokens.
- Environment Feedback Prompts: Bundle tool schemas, available functions and signatures, interaction formatting instructions, and at least one exemplar trajectory.
- Reward Prompts: Include natural-language task success criteria, full history, and explicit instructions to return JSON-formatted success/failure with justification and confidence.
- Prompt Diversity and Amplification: Trajectories are generated with stochastic sampling (temperature=1.0), supporting the construction of tens of thousands of unique episodes. The same environment prompt is reused, ensuring API consistency but supporting stochasticity in simulator output.
4. Training Regime and Hyperparameters
Simia-RL employs RL training hyperparameters as follows:
- Learning rate:
- Rollouts per iteration: 16 episodes
- Gradient steps: 64 per training epoch
- Mini-batch size: 32
- Clip parameter:
- KL penalty coefficient: 0.001
- Sampling temperature: 0.7; Top-P: 1.0
- Max rounds per episode: 25
- Max context length: $12,000$ tokens
Exploration incentives are implemented via entropy bonuses from GRPO. A cosine annealing scheduler with warm-up is used. No explicit data augmentation is needed, as the LLM-simulated environment naturally injects trajectory diversity due to its stochastic outputs.
5. Empirical Evaluation and Performance
Experiments target benchmarks requiring realistic multi-tool, multi-turn agent interaction:
Benchmarks evaluated:
- -Bench (Airline, Retail) using GPT-4.1 as a user simulator.
- OfficeBench (2-apps, 3-apps workflows).
- AgentBench (Operating System, WebShop, Mind2Web).
Performance metrics (selected results):
| Benchmark | Closed-source Baseline | Simia (SFT) | Simia-RL (SFT+RL) |
|---|---|---|---|
| -Bench | GPT-4o: 54.2 | 58.9 (Qwen2.5-32B) | 51.0 (Qwen3-8B) |
| OfficeBench | GPT-4: 31.1 | 44.0 (Qwen3-8B) | 49.6 (Qwen2.5-7B) |
| AgentBench | GPT-4o: 38.1 | 42.6 (Qwen3-8B) | N/A |
Ablation analyses demonstrate that synthetic–simulated trajectories yield similar or better downstream performance versus real-environment data, with greater scalability. Using different LLM simulators (o4-mini, GPT-5) produces nearly indistinguishable downstream results, subject to domain effects. RL training on LLM-simulated environments converges more quickly and robustly than training against real test-bed APIs, plausibly due to richer and more consistent feedback.
6. Strengths, Limitations, and Prospective Extensions
Strengths
- Eliminates dependency on handcrafted environments or APIs for agent training.
- Supports data amplification and diversity, scaling SFT and RL to tens of thousands of trajectories with minimal engineering effort.
- Open-source models fine-tuned with Simia-RL achieve or surpass the performance of larger closed models on standard tool-use and workflow benchmarks.
Limitations
- Experimental evaluation is limited in scope (Airline, Retail, Office, WebShop, OS, Mind2Web).
- LLM simulators may introduce distributional shifts or biases relative to deployment environments.
- The binary reward function constrains the granularity of credit assignment; structured reward shaping remains open.
Potential Extensions
- Application to robotics, science education/tutoring, multi-agent coordination.
- Integration of hierarchical or curriculum-based RL leveraging subtask decomposition guided by LLM simulators.
- Ensembles of LLMs for more robust or diversified simulation feedback.
- Human-in-the-loop corrections to improve simulator accuracy during online fine-tuning.
In sum, Simia-RL establishes that LLM-based environment and reward simulation constitute a viable and scalable alternative to traditional environment engineering for both supervised and reinforcement learning of complex language agents, enabling flexible and robust training pipelines independent of real-world APIs or testbeds.