Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 177 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 183 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Simia-RL: LLM-Simulated RL Framework

Updated 9 November 2025
  • Simia-RL is a reinforcement learning framework that uses LLMs to simulate both environmental dynamics and reward feedback, enabling robust and scalable agent training.
  • It integrates a simulated MDP with PPO/GRPO-style updates, employing prompt-based queries to generate diverse, synthetic trajectories for supervised and RL modalities.
  • Empirical evaluations show that Simia-RL achieves competitive performance on complex tool-use and dialogue benchmarks while reducing manual environment engineering.

The Simia-RL framework is a reinforcement learning methodology in which both environment transitions and reward feedback are simulated by a LLM rather than a fixed-coded environment or real-world API. Conceived as part of a broader effort to eliminate brittle and labor-intensive environment engineering for language agents, Simia-RL enables scalable, environment-agnostic agent training. By leveraging LLMs to simulate both environment feedback and reward computations, Simia-RL permits training and evaluation of agents entirely in synthetic contexts, supporting both supervised and reinforcement learning modalities. The framework is positioned as a unified solution for robust language agent training, with demonstrated empirical gains on multiple complex tool-use and dialogue benchmarks.

1. Markov Decision Process Formulation

Simia-RL is defined as a standard reinforcement learning problem over a Markov Decision Process (MDP) (S,A,γ)(\mathcal S, \mathcal A, \gamma), where:

  • S\mathcal S is the state space, comprising full agent–environment dialogue histories, tool-use contexts, and schema information.
  • A\mathcal A is the action space, including free-form text replies and structured tool calls.
  • γ[0,1]\gamma \in [0, 1] is the discount factor.

The key distinction is that both the transition function T(ss,a)T(s'|s, a) and reward function r(s,a)r(s, a) are implemented by LLM-based simulators, not by hand-coded or real environments. Formally: (s,rt)(T^(ss,a),r^(s,a))\bigl(s', r_t\bigr) \sim \left( \widehat T(s'|s, a), \widehat r(s, a) \right) where T^\widehat T and r^\widehat r are outputs of the LLM environment simulator and reward model, respectively. The episodic reward is binary, with successful task completion (as defined by oracle criteria) yielding rt=1r_t = 1, and failure yielding rt=0r_t = 0.

The agent objective is to maximize the expected discounted return: J(θ)=Eτπθ[t=0Tγtrt]J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t = 0}^T \gamma^t r_t \right] where trajectories τ\tau are simulated entirely with LLM-based feedback.

2. Algorithmic Structure

Training proceeds in interleaved policy rollouts and updates, closely akin to PPO or GRPO, with all feedback provided by the LLM-based simulator.

High-Level Algorithm

  1. Initialize policy πθ\pi_\theta (typically an instruction-tuned transformer such as Qwen or Llama).
  2. For NN RL iterations (e.g., 64):
    • Collect KK simulated episodes: for each,
      • Start from an initial state s0s_0.
      • For up to max-turns:
        • Sample action atπθ(st)a_t \sim \pi_\theta(\cdot | s_t).
        • Query the LLM simulator with (st,at)(s_t, a_t) to obtain (st+1,rt)(s_{t+1}, r_t).
        • Store (st,at,rt,st+1)(s_t, a_t, r_t, s_{t+1}).
  3. Compute advantage estimates A^t\hat A_t (e.g., GAE).
  4. Update θ\theta by maximizing the clipped surrogate PPO or GRPO objective:

    LCLIP(θ)=E[min(rt(θ)A^t, clip(rt(θ),1ϵ,1+ϵ)A^t)]L^{\mathrm{CLIP}}(\theta) = \mathbb{E} \left[ \min \left( r_t(\theta) \hat A_t, ~ \mathrm{clip}\left( r_t(\theta), 1 - \epsilon, 1 + \epsilon \right) \hat A_t \right) \right]

    where rt(θ)=πθ(atst)/πθold(atst)r_t(\theta) = \pi_\theta(a_t|s_t) / \pi_{\theta_{\text{old}}}(a_t|s_t).

  5. Optional fine-tuning with supervised learning on high-reward trajectories.

Pseudocode

1
2
3
4
5
6
7
8
9
10
11
12
13
initialize θ
for iteration in range(N):
    trajectories = []
    for rollout in range(K):
        s = s0
        for t in range(T):
            a = pi_theta(s)
            s_prime = LLM_env_simulator(prompt=(s, a))
            r = LLM_reward_computer(prompt=(s, a, s_prime))
            store(s, a, r, s_prime)
            s = s_prime
    compute_advantages()
    update_theta()  # PPO/GRPO with L_CLIP objective

3. LLM-Based Simulation Details

Both the transition simulator and reward function are implemented as prompt-based queries to a large instruction-tuned LLM, typically o4-mini or GPT-5.

  • Simulator/Reward Models: Designed atop transformer-based LLMs with context windows up to 60k tokens.
  • Environment Feedback Prompts: Bundle tool schemas, available functions and signatures, interaction formatting instructions, and at least one exemplar trajectory.
  • Reward Prompts: Include natural-language task success criteria, full history, and explicit instructions to return JSON-formatted success/failure with justification and confidence.
  • Prompt Diversity and Amplification: Trajectories are generated with stochastic sampling (temperature=1.0), supporting the construction of tens of thousands of unique episodes. The same environment prompt is reused, ensuring API consistency but supporting stochasticity in simulator output.

4. Training Regime and Hyperparameters

Simia-RL employs RL training hyperparameters as follows:

  • Learning rate: 1×1061 \times 10^{-6}
  • Rollouts per iteration: 16 episodes
  • Gradient steps: 64 per training epoch
  • Mini-batch size: 32
  • Clip parameter: ϵ=0.28\epsilon = 0.28
  • KL penalty coefficient: 0.001
  • Sampling temperature: 0.7; Top-P: 1.0
  • Max rounds per episode: 25
  • Max context length: $12,000$ tokens

Exploration incentives are implemented via entropy bonuses from GRPO. A cosine annealing scheduler with 10%10\% warm-up is used. No explicit data augmentation is needed, as the LLM-simulated environment naturally injects trajectory diversity due to its stochastic outputs.

5. Empirical Evaluation and Performance

Experiments target benchmarks requiring realistic multi-tool, multi-turn agent interaction:

Benchmarks evaluated:

  • τ2\tau^2-Bench (Airline, Retail) using GPT-4.1 as a user simulator.
  • OfficeBench (2-apps, 3-apps workflows).
  • AgentBench (Operating System, WebShop, Mind2Web).

Performance metrics (selected results):

Benchmark Closed-source Baseline Simia (SFT) Simia-RL (SFT+RL)
τ2\tau^2-Bench GPT-4o: 54.2 58.9 (Qwen2.5-32B) 51.0 (Qwen3-8B)
OfficeBench GPT-4: 31.1 44.0 (Qwen3-8B) 49.6 (Qwen2.5-7B)
AgentBench GPT-4o: 38.1 42.6 (Qwen3-8B) N/A

Ablation analyses demonstrate that synthetic–simulated trajectories yield similar or better downstream performance versus real-environment data, with greater scalability. Using different LLM simulators (o4-mini, GPT-5) produces nearly indistinguishable downstream results, subject to domain effects. RL training on LLM-simulated environments converges more quickly and robustly than training against real test-bed APIs, plausibly due to richer and more consistent feedback.

6. Strengths, Limitations, and Prospective Extensions

Strengths

  • Eliminates dependency on handcrafted environments or APIs for agent training.
  • Supports data amplification and diversity, scaling SFT and RL to tens of thousands of trajectories with minimal engineering effort.
  • Open-source models fine-tuned with Simia-RL achieve or surpass the performance of larger closed models on standard tool-use and workflow benchmarks.

Limitations

  • Experimental evaluation is limited in scope (Airline, Retail, Office, WebShop, OS, Mind2Web).
  • LLM simulators may introduce distributional shifts or biases relative to deployment environments.
  • The binary reward function constrains the granularity of credit assignment; structured reward shaping remains open.

Potential Extensions

  • Application to robotics, science education/tutoring, multi-agent coordination.
  • Integration of hierarchical or curriculum-based RL leveraging subtask decomposition guided by LLM simulators.
  • Ensembles of LLMs for more robust or diversified simulation feedback.
  • Human-in-the-loop corrections to improve simulator accuracy during online fine-tuning.

In sum, Simia-RL establishes that LLM-based environment and reward simulation constitute a viable and scalable alternative to traditional environment engineering for both supervised and reinforcement learning of complex language agents, enabling flexible and robust training pipelines independent of real-world APIs or testbeds.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Simia-RL Framework.