Simia-RL: LLM-Simulated RL Framework

Updated 9 November 2025

Simia-RL is a reinforcement learning framework that uses LLMs to simulate both environmental dynamics and reward feedback, enabling robust and scalable agent training.
It integrates a simulated MDP with PPO/GRPO-style updates, employing prompt-based queries to generate diverse, synthetic trajectories for supervised and RL modalities.
Empirical evaluations show that Simia-RL achieves competitive performance on complex tool-use and dialogue benchmarks while reducing manual environment engineering.

The Simia-RL framework is a reinforcement learning methodology in which both environment transitions and reward feedback are simulated by a LLM rather than a fixed-coded environment or real-world API. Conceived as part of a broader effort to eliminate brittle and labor-intensive environment engineering for language agents, Simia-RL enables scalable, environment-agnostic agent training. By leveraging LLMs to simulate both environment feedback and reward computations, Simia-RL permits training and evaluation of agents entirely in synthetic contexts, supporting both supervised and reinforcement learning modalities. The framework is positioned as a unified solution for robust language agent training, with demonstrated empirical gains on multiple complex tool-use and dialogue benchmarks.

1. Markov Decision Process Formulation

Simia-RL is defined as a standard reinforcement learning problem over a Markov Decision Process (MDP) $(\mathcal S, \mathcal A, \gamma)$ , where:

$\mathcal S$ is the state space, comprising full agent–environment dialogue histories, tool-use contexts, and schema information.
$\mathcal A$ is the action space, including free-form text replies and structured tool calls.
$\gamma \in [0, 1]$ is the discount factor.

The key distinction is that both the transition function $T(s'|s, a)$ and reward function $r(s, a)$ are implemented by LLM-based simulators, not by hand-coded or real environments. Formally: $\bigl(s', r_t\bigr) \sim \left( \widehat T(s'|s, a), \widehat r(s, a) \right)$ where $\widehat T$ and $\widehat r$ are outputs of the LLM environment simulator and reward model, respectively. The episodic reward is binary, with successful task completion (as defined by oracle criteria) yielding $r_t = 1$ , and failure yielding $r_t = 0$ .

The agent objective is to maximize the expected discounted return: $J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t = 0}^T \gamma^t r_t \right]$ where trajectories $\tau$ are simulated entirely with LLM-based feedback.

2. Algorithmic Structure

Training proceeds in interleaved policy rollouts and updates, closely akin to PPO or GRPO, with all feedback provided by the LLM-based simulator.

High-Level Algorithm

Initialize policy $\pi_\theta$ (typically an instruction-tuned transformer such as Qwen or Llama).
For $N$ $N$ RL iterations (e.g., 64):
- Collect $K$ simulated episodes: for each,
  - Start from an initial state $s_0$ .
  - For up to max-turns:
    - Sample action $a_t \sim \pi_\theta(\cdot | s_t)$ .
    - Query the LLM simulator with $(s_t, a_t)$ to obtain $(s_{t+1}, r_t)$ .
    - Store $(s_t, a_t, r_t, s_{t+1})$ .
Compute advantage estimates $\hat A_t$ (e.g., GAE).
Update $\theta$ by maximizing the clipped surrogate PPO or GRPO objective:

$L^{\mathrm{CLIP}}(\theta) = \mathbb{E} \left[ \min \left( r_t(\theta) \hat A_t, ~ \mathrm{clip}\left( r_t(\theta), 1 - \epsilon, 1 + \epsilon \right) \hat A_t \right) \right]$

where $r_t(\theta) = \pi_\theta(a_t|s_t) / \pi_{\theta_{\text{old}}}(a_t|s_t)$ .
Optional fine-tuning with supervised learning on high-reward trajectories.

Pseudocode

initialize θ
for iteration in range(N):
    trajectories = []
    for rollout in range(K):
        s = s0
        for t in range(T):
            a = pi_theta(s)
            s_prime = LLM_env_simulator(prompt=(s, a))
            r = LLM_reward_computer(prompt=(s, a, s_prime))
            store(s, a, r, s_prime)
            s = s_prime
    compute_advantages()
    update_theta()  # PPO/GRPO with L_CLIP objective

3. LLM-Based Simulation Details

Both the transition simulator and reward function are implemented as prompt-based queries to a large instruction-tuned LLM, typically o4-mini or GPT-5.

Simulator/Reward Models: Designed atop transformer-based LLMs with context windows up to 60k tokens.
Environment Feedback Prompts: Bundle tool schemas, available functions and signatures, interaction formatting instructions, and at least one exemplar trajectory.
Reward Prompts: Include natural-language task success criteria, full history, and explicit instructions to return JSON-formatted success/failure with justification and confidence.
Prompt Diversity and Amplification: Trajectories are generated with stochastic sampling (temperature=1.0), supporting the construction of tens of thousands of unique episodes. The same environment prompt is reused, ensuring API consistency but supporting stochasticity in simulator output.

4. Training Regime and Hyperparameters

Simia-RL employs RL training hyperparameters as follows:

Learning rate: $1 \times 10^{-6}$
Rollouts per iteration: 16 episodes
Gradient steps: 64 per training epoch
Mini-batch size: 32
Clip parameter: $\epsilon = 0.28$
KL penalty coefficient: 0.001
Sampling temperature: 0.7; Top-P: 1.0
Max rounds per episode: 25
Max context length: $12,000$ tokens

Exploration incentives are implemented via entropy bonuses from GRPO. A cosine annealing scheduler with $10\%$ warm-up is used. No explicit data augmentation is needed, as the LLM-simulated environment naturally injects trajectory diversity due to its stochastic outputs.

5. Empirical Evaluation and Performance

Experiments target benchmarks requiring realistic multi-tool, multi-turn agent interaction:

Benchmarks evaluated:

$\tau^2$ -Bench (Airline, Retail) using GPT-4.1 as a user simulator.
OfficeBench (2-apps, 3-apps workflows).
AgentBench (Operating System, WebShop, Mind2Web).

Performance metrics (selected results):

Benchmark	Closed-source Baseline	Simia (SFT)	Simia-RL (SFT+RL)
$\tau^2$ -Bench	GPT-4o: 54.2	58.9 (Qwen2.5-32B)	51.0 (Qwen3-8B)
OfficeBench	GPT-4: 31.1	44.0 (Qwen3-8B)	49.6 (Qwen2.5-7B)
AgentBench	GPT-4o: 38.1	42.6 (Qwen3-8B)	N/A

Ablation analyses demonstrate that synthetic–simulated trajectories yield similar or better downstream performance versus real-environment data, with greater scalability. Using different LLM simulators (o4-mini, GPT-5) produces nearly indistinguishable downstream results, subject to domain effects. RL training on LLM-simulated environments converges more quickly and robustly than training against real test-bed APIs, plausibly due to richer and more consistent feedback.

6. Strengths, Limitations, and Prospective Extensions

Strengths

Eliminates dependency on handcrafted environments or APIs for agent training.
Supports data amplification and diversity, scaling SFT and RL to tens of thousands of trajectories with minimal engineering effort.
Open-source models fine-tuned with Simia-RL achieve or surpass the performance of larger closed models on standard tool-use and workflow benchmarks.

Limitations

Experimental evaluation is limited in scope (Airline, Retail, Office, WebShop, OS, Mind2Web).
LLM simulators may introduce distributional shifts or biases relative to deployment environments.
The binary reward function constrains the granularity of credit assignment; structured reward shaping remains open.

Potential Extensions

Application to robotics, science education/tutoring, multi-agent coordination.
Integration of hierarchical or curriculum-based RL leveraging subtask decomposition guided by LLM simulators.
Ensembles of LLMs for more robust or diversified simulation feedback.
Human-in-the-loop corrections to improve simulator accuracy during online fine-tuning.

In sum, Simia-RL establishes that LLM-based environment and reward simulation constitute a viable and scalable alternative to traditional environment engineering for both supervised and reinforcement learning of complex language agents, enabling flexible and robust training pipelines independent of real-world APIs or testbeds.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Simia-RL Framework.