Scaling Agent Learning via Experience Synthesis (2511.03773v1)
Abstract: While reinforcement learning (RL) can empower LLM agents by enabling self-improvement through interaction, its practical adoption remains challenging due to costly rollouts, limited task diversity, unreliable reward signals, and infrastructure complexity, all of which obstruct the collection of scalable experience data. To address these challenges, we introduce DreamGym, the first unified framework designed to synthesize diverse experiences with scalability in mind to enable effective online RL training for autonomous agents. Rather than relying on expensive real-environment rollouts, DreamGym distills environment dynamics into a reasoning-based experience model that derives consistent state transitions and feedback signals through step-by-step reasoning, enabling scalable agent rollout collection for RL. To improve the stability and quality of transitions, DreamGym leverages an experience replay buffer initialized with offline real-world data and continuously enriched with fresh interactions to actively support agent training. To improve knowledge acquisition, DreamGym adaptively generates new tasks that challenge the current agent policy, enabling more effective online curriculum learning. Experiments across diverse environments and agent backbones demonstrate that DreamGym substantially improves RL training, both in fully synthetic settings and in sim-to-real transfer scenarios. On non-RL-ready tasks like WebArena, DreamGym outperforms all baselines by over 30%. And in RL-ready but costly settings, it matches GRPO and PPO performance using only synthetic interactions. When transferring a policy trained purely on synthetic experiences to real-environment RL, DreamGym yields significant additional performance gains while requiring far fewer real-world interactions, providing a scalable warm-start strategy for general-purpose RL.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper introduces DreamGym, a new way to train smart computer agents (powered by LLMs) using reinforcement learning (RL) without relying on expensive, slow, and messy real-world interactions. Instead of practicing on real websites or apps all the time, DreamGym lets agents practice in a “dream” version of the world that is text-based, consistent, and easy to scale—like a safe, smart training simulator.
What questions did the researchers ask?
In simple terms, they asked:
- Can we train LLM agents to get better at multi-step tasks (like using websites or tools) using RL—but with far less cost, chaos, and setup?
- Can we create a “simulated experience” that is detailed and realistic enough for agents to learn useful skills?
- Can we generate lots of varied, challenging tasks automatically so agents keep improving?
- If we train agents in this simulator first, will they perform better and need fewer interactions when moved to the real world?
How did they do it?
Think of DreamGym like a smart training arena where the agent practices, gets feedback, and faces new challenges—without touching the real world too often.
Here’s how it works, using everyday ideas:
- A reasoning-based “experience model”: This is like a coach who understands the rules of the task. When the agent takes an action, the coach explains what would happen next (the next “state”) and gives a reward (like a score) based on step-by-step reasoning. It works in a clean text format instead of raw, messy data (like HTML code), so it’s fast and consistent.
- A memory (replay buffer): The coach looks up similar past experiences to stay accurate and avoid making things up. This memory is filled with both existing examples and new ones created during training, so it grows alongside the agent’s skills.
- An automatic curriculum (task generator): The system creates new, harder task variations over time. It picks tasks where the agent sometimes succeeds and sometimes fails—because those are the best for learning. This is like raising the difficulty as the player improves.
To train the agent, they use standard RL methods (like PPO and GRPO). In RL, the agent tries actions, gets rewards, and updates its policy—the “strategy” it uses—to do better next time. DreamGym supplies the agent with lots of high-quality practice runs (called “rollouts”) inside the simulator.
What did they find, and why is it important?
The results show DreamGym really helps:
- In places that aren’t ready for RL (like WebArena, which is hard to automate and reset), DreamGym-enabled agents beat all baselines by over 30%. This makes RL possible where it used to be impractical.
- In RL-ready but expensive environments (like WebShop and ALFWorld), agents trained only in DreamGym matched the performance of strong RL methods (PPO and GRPO) that used 80,000 real interactions—without using any real interactions during training.
- Sim-to-real transfer: If you first train the agent in DreamGym and then do a small amount of RL in the real environment, performance improves by over 40% compared to starting from scratch. Even better, this needs less than 10% of the usual real-world data. That means saving time, money, and effort.
- Training is faster and more stable: DreamGym’s clean, reasoned feedback makes learning smoother and less noisy than real-world training, which often has delayed or misleading rewards.
- Generalization: Agents trained in DreamGym learn patterns they can carry across similar domains (for example, learning on one website environment helps on another). Extremely different domains (like web tasks versus 3D embodied tasks) are harder, but the approach still shows promise.
Why does this matter?
- Lower cost, higher scale: Training agents in real environments is slow, expensive, and risky. DreamGym makes it cheaper and safer by using synthetic, well-structured experiences.
- Better learning signals: By explaining outcomes step-by-step, DreamGym provides clear rewards and consistent state changes, which makes RL more efficient.
- Practical deployment: It helps teams bootstrap strong agents quickly, then fine-tune them with far fewer real-world interactions.
- Opens RL to new areas: Some environments are too messy or unsafe for RL today. DreamGym makes RL feasible there by simulating reliable experiences.
- Future potential: With better “dream worlds,” we could train general-purpose agents that adapt across many tasks and environments, further reducing reliance on costly real-world training.
In short, DreamGym shows that you don’t always need perfect realism to train powerful agents—you need smart, diverse, well-reasoned practice. This makes RL for LLM agents more accessible, scalable, and effective.
Knowledge Gaps
Unresolved knowledge gaps, limitations, and open questions
Below is a concise list of concrete gaps and questions left open by the paper that future researchers can directly act on:
- Quantify and bound the fidelity gap between synthetic and real environments: measure how
T_expandR_expdiverge fromT_realandR_real(e.g., transition accuracy, reward precision/recall) across tasks and domains. - Strengthen theoretical guarantees: extend the trust-region lower bound to cover model mismatch, stochastic dynamics, and partial observability; provide convergence/performance guarantees under distribution shift between synthetic and real MDPs.
- Reward design and shaping: the paper uses binary outcome rewards; investigate shaping via intermediate subgoals inferred from reasoning traces, dense feedback derived from state changes, and the impact on credit assignment and stability.
- Reward correctness calibration: develop programmatic or human-in-the-loop verifiers to estimate false-positive/false-negative rates in synthetic rewards, and paper how reward errors affect policy learning and sim-to-real transfer.
- Task feasibility validation: specify and implement automatic validators that filter unrealistic or malformed generated tasks; report feasibility rates and their training impact.
- Curriculum policy analysis: formalize the properties of entropy-based task selection; compare against difficulty estimators, uncertainty-driven curricula, and success-rate banding; ablate the curriculum mixing ratio
λand selection thresholds. - Experience model updating: clarify whether
M_expis frozen or updated online; paper continual fine-tuning strategies, drift control, and catastrophic forgetting; quantify how online updates change rollout quality and RL outcomes. - Replay buffer and retrieval design: specify the semantic encoder
φ, paper retrieval quality vs. buffer size, aging/priority mechanisms, on/off-policy discrepancies, and methods to prevent staleness or bias in top-kdemonstrations. - Hallucination and consistency metrics: replace or complement GPT-4o judging with task-grounded automatic checks; standardize metrics for consistency, informativeness, and hallucination; report confidence intervals and statistical significance.
- Baseline completeness: compare against world-model RL baselines (e.g., Dreamer variants, WebDreamer used for RL) and LLM-based simulators with RL rather than only IL; ensure fair, apples-to-apples cost and performance accounting.
- Bridging large domain gaps: the method underperforms in web→embodied transfer; explore multimodal meta-representations (vision, UI graphs), better action-space alignment, and domain adaptation techniques to improve cross-domain generalization.
- Modeling stochasticity and non-stationarity: introduce stochastic transitions and time-varying dynamics in
M_exp, reflecting noisy real environments (web/GUI); evaluate robustness under changing interfaces and reward noise. - Action/state mapping for sim-to-real: formalize the mapping from abstract synthetic
S, Ato real environment observations/actions; quantify mapping error, automate alignment, and measure its effect on transfer performance. - Safety in sim-to-real: assess whether synthetic training induces risky real-world behaviors; integrate safe-RL constraints, reversible action filters, and reliable reset policies; report safety incident rates during transfer.
- Scalability and cost accounting: provide end-to-end compute breakdown (LLM inference for annotation, training
M_exp, synthetic generation, RL updates); analyze scaling with model size, rollout parallelism, and inference latency. - Long-horizon robustness: evaluate on truly long, sparse tasks (complex web workflows, OS/desktop control like OSWorld); paper history length requirements, memory bottlenecks, and degradation with horizon.
- Reward hacking and overfitting to
M_exp: detect whether agents exploit simulator quirks; design adversarial tests, domain randomization, and regularizers to improve robustness and prevent overfitting to synthetic dynamics. - Data quality and contamination: ensure strict train/test separation for offline trajectories; measure sensitivity to noisy or biased demonstrations and LLM-generated reasoning annotations; paper robustness in low-quality data regimes.
- Formal MDP properties of abstract state space: verify Markov and causal consistency; define a grammar/ontology for
SandAthat enforces well-formed transitions and reliable planning across environments.
Practical Applications
Immediate Applications
The following applications leverage DreamGym’s reasoning-based experience model, curriculum task generation, and replay-buffered synthetic rollouts to deliver deployable value now, with clear tools and workflows and noted dependencies.
- Software and web automation for e-commerce and support — Software
- Train robust web agents for product search, cart/checkout, ticket triage, and FAQ navigation without hitting production sites by using DreamGym’s synthetic rollouts; then warm-start with small sim-to-real RL.
- Potential tools/workflows: REMaaS (Reasoning Experience Model as a Service) for web states and rewards; a Browser Sandbox Connector; a Sim-to-Real Warm-Start pipeline; an Agent Training Console with entropy-based curriculum.
- Dependencies/assumptions: a consistent state/action mapping from abstract states to real DOM/UX; access to de-identified offline logs to seed the replay buffer; guardrails for safe deployment; outcome-based reward sufficiency for target KPIs.
- Enterprise GUI automation and RPA — Software
- Automate repetitive ERP/CRM workflows (record creation, report generation, field reconciliation) by pretraining agents entirely in DreamGym’s synthetic GUI environment, avoiding heavy VM/Docker infrastructure.
- Potential tools/workflows: Abstract State Mapper (GUI → textual schema); Replay Buffer seeded with click-stream logs; Curriculum Task Generator for edge-case coverage; policy gating for production.
- Dependencies/assumptions: reliable UI element-to-action mapping; adequate offline trajectory quality; access controls and data privacy for logs; stability of reward signals across app versions.
- Academic research and benchmarking of agent RL — Academia
- Use DreamGym as a unified, reproducible RL training bed for long-horizon agent tasks, enabling ablations across GRPO/PPO, replay strategies, and curriculum policies at a fraction of real rollout cost.
- Potential tools/workflows: DreamGym SDK; standardized benchmarking suites; auto-ablation scripts; reward audit dashboards.
- Dependencies/assumptions: availability of public offline datasets to train the experience model; reported metrics aligned with downstream tasks; transparent evaluation of sim-to-real transfer.
- Adaptive task generation for tutoring in math/coding — Education
- Build tutors that adapt difficulty via reward entropy, generating progressively challenging problem variants while providing multi-turn, reasoned feedback and rewards.
- Potential tools/workflows: Curriculum Task Generator API; Reasoned Feedback Synthesizer; student-level progress tracking; group-relative difficulty calibration (GRPO).
- Dependencies/assumptions: accurate outcome rewards (solution verification, test suites); pedagogy-aligned difficulty ramps; content safety (no harmful code/data).
- Synthetic dataset augmentation for SFT/DPO — Software/Academia
- Expand training corpora with diverse trajectory variants and task instructions derived from DreamGym’s experience model to improve supervised baselines and reduce label costs.
- Potential tools/workflows: Trajectory Diversifier; Task Variant Generator; quality filters using entropy, consistency, and hallucination judges.
- Dependencies/assumptions: quality scoring to avoid compounding bias; domain coverage checks; IP/privacy constraints on seed data.
- Agent safety testing and red-teaming — Policy/Software
- Simulate risky or irreversible interactions (deletions, sensitive data exposure) in synthetic environments to uncover failure modes before real-world deployment.
- Potential tools/workflows: Adversarial Curriculum Generator; Failure-Mode Replay Library; sandboxed evaluation harness with policy-gradient tracking.
- Dependencies/assumptions: faithful abstraction of risky operations; comprehensive edge-case modeling; alignment of synthetic reward penalties with organizational risk frameworks.
- MLOps cost reduction for agent RL — Software
- Replace heterogeneous VM/Docker-based rollouts with DreamGym’s lightweight abstract state transitions hosted in scalable LLM services, cutting sampling time and GPU hours.
- Potential tools/workflows: LLM-hosted Experience Service; batch rollout orchestration; semantic retrieval for replay; metrics for efficiency vs. performance.
- Dependencies/assumptions: service-level reliability; model sizing and latency budgets; consistent token-efficient state design.
- Personal productivity agents — Daily life
- Train agents for email triage, calendar coordination, and form filling using synthetic workflows (e.g., inbox schemas, scheduling conflicts), then lightly fine-tune on a user’s data.
- Potential tools/workflows: Personal Workflow Simulator; user-specific State Mapper; minimal sim-to-real RL with privacy-preserving feedback.
- Dependencies/assumptions: privacy-preserving data ingestion; accurate abstraction of user tools (email/calendar); outcome reward definitions (e.g., meeting scheduled, correct label).
- Back-office finance portal navigation — Finance
- Pretrain agents to reconcile transactions, download statements, and submit reports using synthetic portals and reward-verified tasks, then transfer with small-scale real RL.
- Potential tools/workflows: Finance Portal Simulator; compliance-aware task generator; audit logs with entropy-based task difficulty tracking.
- Dependencies/assumptions: regulatory constraints on synthetic-to-real mapping; accurate verification signals (balance reconciliation, reconciliation completeness); robust credential handling.
Long-Term Applications
These applications require further research, scaling, or development (e.g., richer state mappings, safety assurance, domain-specific reward design, or multi-environment generalization) before broad deployment.
- Clinical workflow assistants navigating EHRs — Healthcare
- Train agents to assist with order entry, chart review, and documentation in synthetic EHR environments, then cautiously introduce sim-to-real RL under strict compliance.
- Potential tools/workflows: DreamGym-EHR with de-identified logs; clinical reward models (task completion + safety checks); policy oversight with human-in-the-loop review.
- Dependencies/assumptions: HIPAA/GDPR compliance; validated reward functions beyond outcome-based binary signals; high-fidelity state mapping to real EHRs; rigorous safety evaluation.
- Embodied robotics for household or warehouse tasks — Robotics
- Use DreamGym to pretrain high-level planning (textual/meta states) and integrate with low-level control; transfer to sim and then real robots for navigation/manipulation.
- Potential tools/workflows: Hybrid Planner (LLM high-level + controller low-level); Embodied State Mapper; sim-to-real pipelines; dense subgoal rewards.
- Dependencies/assumptions: bridging abstract textual states to sensorimotor realities; domain gap management; reliable reward shaping; safety constraints in physical environments.
- Industrial process control and energy grid ops — Energy/Industrial
- Train agents to monitor dashboards, propose interventions, and escalate alarms with synthetic control rooms to reduce real-world risk before pilot deployments.
- Potential tools/workflows: Control-Room Simulator; safety-critical curriculum emphasizing edge cases; operator-in-the-loop RL with strict trust-region constraints.
- Dependencies/assumptions: safety certification; high-fidelity state-action abstractions; interpretation of complex temporal dynamics; regulatory approval pathways.
- Universal multi-environment world models for agents — Academia/Software
- Extend DreamGym into a unified world model spanning web, GUI, tools, and embodied domains to learn domain-agnostic behavioral priors and enable zero-shot adaptation.
- Potential tools/workflows: Multi-Environment REM; cross-domain replay federation; transferability evaluators; adaptive state harmonization.
- Dependencies/assumptions: modular action/state abstractions; scalable retrieval across heterogeneous trajectories; mitigation of domain gaps (observed limitations across web → embodied).
- Public-sector digital services assistants — Policy
- Train agents to help citizens complete complex benefit applications, tax forms, or licensing, using synthetic government portals to prepare for real-world constraints.
- Potential tools/workflows: GovPortal DreamGym; compliance-aware rewards and audit trails; accessibility-aware task generation; human oversight programs.
- Dependencies/assumptions: procurement and security requirements; fairness and accessibility metrics; careful reward alignment to policy outcomes; multilingual support.
- Compliance-aware RL pipelines for regulated domains — Finance/Healthcare/Policy
- Integrate reward auditability, traceable chain-of-thought, and entropy-based curriculum tuning into standardized training pipelines with compliance monitoring.
- Potential tools/workflows: Reward Audit Tooling; CoT provenance logs; bias/fairness dashboards; gated policy updates with rollback.
- Dependencies/assumptions: accepted governance standards; explainability requirements; mapping of synthetic failures to real operational risks.
- Multi-agent training in synthetic ecosystems — Software/Academia
- Simulate interacting agents (cooperation/competition) for marketplaces, customer support swarms, or tool orchestration ensembles, then transfer to controlled real pilots.
- Potential tools/workflows: Multi-Agent DreamGym; reward shaping for collective outcomes; conflict resolution curricula; orchestration schedulers.
- Dependencies/assumptions: scalable experience synthesis for agent ensembles; robust credit assignment; avoidance of emergent unsafe behaviors.
- Autonomously maintained agent workflows (“self-evolving” agents) — Software
- Agents continuously generate new tasks with entropy criteria, learn from synthetic experiences, and periodically validate in real settings, minimizing ongoing human curation.
- Potential tools/workflows: Continuous Curriculum Engine; periodic sim-to-real validation jobs; drift detection; replay pruning and governance.
- Dependencies/assumptions: reliable drift and failure detection; lifecycle governance; limits on autonomy in safety-critical contexts; continuous state mapping maintenance.
Glossary
- Actor–critic methods: A class of reinforcement learning algorithms that combine a policy (actor) with a value estimator (critic) to improve learning stability and efficiency. "Classical RL algorithms such as policy gradients and actorâcritic methods~\citep{williams1992simple, schulman2017proximal} have achieved strong results in robotics, games, and control~\citep{Silver2016AlphaGo}."
- Advantage function: A measure of how much better an action is compared to the average action at a given state, used to guide policy updates. "where is the advantage function, estimating how favorable an action is compared to others."
- ALFWorld: A benchmark environment involving multi-turn embodied control tasks in simulated 3D spaces for evaluating agent capabilities. "ALFWorld~\citep{shridharalfworld}, which involves multi-turn tool-based embodied control to navigate 3D environments;"
- Chain-of-thought (CoT): A prompting technique that elicits step-by-step reasoning in LLMs to improve consistency and correctness. "via chain-of-thought (CoT)~\citep{wei2022chain}:"
- Curriculum-based task generation: A strategy to progressively create harder tasks based on the agent’s current performance to improve learning efficiency. "we propose curriculum-based task generation, where the same experience model actively generates new tasks as variations of a set of seed tasks:"
- Curriculum task generator: A component that selects and produces future tasks with high learning value using reward entropy to guide curriculum design. "tasks with high reward entropy are proposed by the curriculum task generator for future training."
- Direct Preference Optimization (DPO): An offline alignment method that optimizes model outputs using pairwise human preference data. "direct preference optimization (DPO)~\citep{rafailov2023direct};"
- Discount factor: A scalar in that weights future rewards relative to immediate rewards in cumulative return computations. " is the discount factor, and specifies the initial state distribution that includes the task instruction ."
- DreamGym-S2R (sim-to-real): A training approach that pretrains agents with synthetic experiences before transferring to real environments for improved sample efficiency. "Moreover, we introduce DreamGym-S2R (sim-to-real), which first trains agents in DreamGym using diverse, curriculum-driven synthetic experiences before transferring them to external environments."
- Experience replay buffer: A memory structure storing trajectories that can be retrieved to condition state predictions and stabilize learning. "DreamGym equips the experience model with an experience replay buffer, from which it retrieves similar yet diverse trajectories to guide its current state prediction."
- Generalized Advantage Estimation (GAE): A method to compute advantages that balances bias and variance using a decay parameter. "PPO~\citep{schulman2017proximal} is a popular policy gradient method that improves stability by computing with Generalized Advantage Estimation (GAE):"
- Goal-conditioned RL: Reinforcement learning where policies are conditioned on explicit task goals or instructions. "leaving current environments insufficient for goal-conditioned RL."
- Group Relative Policy Optimization (GRPO): An RL algorithm that normalizes rewards within a group of responses to compute advantages without a value function. "GRPO~\citep{shao2024deepseekmath} extends PPO by discarding the value function and normalizing advantages within each group of responses sampled for the same task instruction."
- Hyperparameter λ: A training control parameter that limits the proportion of synthetic tasks introduced per iteration to stabilize curriculum. "we introduce a hyperparameter that bounds the proportion of synthetic tasks sampled per iteration."
- Imitation learning: Learning policies by mimicking expert trajectories rather than optimizing rewards directly. "training agents primarily through imitation learning~\citep{yao2022webshop, pahuja2025explorer, deng2023mind2web}."
- Markov Decision Process (MDP): A formal framework for sequential decision-making defined by states, actions, transitions, rewards, and discounting. "We formalize the agent learning problem as a Markov Decision Process (MDP)~\citep{bellman1957markovian}, defined by the tuple "
- Meta-representational textual space: An abstract textual state space that captures environment dynamics without raw observations to enable efficient reasoning and synthesis. "we design an efficient reasoning experience model, denoted as $\mathcal{M}_{\text{exp}$, that operates in an abstract, meta-representational textual space ."
- On-policy experiences: Trajectories generated by the current policy used to update that same policy. "which interact with agents to generate unlimited on-policy experiences."
- Outcome-based reward scheme: A reward design that provides a terminal reward only upon successful task completion, otherwise zero. "we adopt an outcome-based reward scheme, assigning only at the final step when the task is successfully completed and in all other cases."
- Policy gradient: A family of methods that directly optimize policy parameters via gradients of expected returns. "optimize via policy gradient as follows:"
- Probability simplex: The set of all probability distributions over a discrete set, used to denote distributions over states or actions. "where denotes the probability simplex over ."
- Proximal Policy Optimization (PPO): A policy gradient algorithm that stabilizes updates via clipped objectives and advantage estimation. "PPO~\citep{schulman2017proximal} is a popular policy gradient method that improves stability by computing with Generalized Advantage Estimation (GAE):"
- Reasoning-based experience model: An LLM-driven simulator that predicts next states and rewards through explicit reasoning traces conditioned on context. "DreamGym distills environment dynamics into a reasoning-based experience model that derives consistent state transitions and feedback signals through step-by-step reasoning"
- Reward entropy: The variance of rewards within a group of rollouts for the same task, measuring task difficulty and informativeness. "we introduce a group-based reward entropy as a criteria for selecting high-quality and challenging tasks."
- Reward sparsity: A condition where informative rewards are rare or delayed, making learning unstable and slow. "With this unified design, DreamGym addresses both task and reward sparsity while enabling scalable RL with diverse and curriculum-driven environments."
- Rollouts: Sequences of state-action transitions collected from interacting with an environment or simulator. "its practical adoption remains challenging due to costly rollouts"
- Semantic encoder: A model that embeds states and actions into a representation space to compute similarity for retrieval. "where denotes an arbitrary semantic encoder."
- Sim-to-real (S2R): Transferring policies learned in simulation or synthetic environments to real-world environments. "We further extend DreamGym to a sim-to-real (S2R) setting, where the agent policy is first trained with synthetic experiences and then transferred to RL in real environments."
- State space: The set of all possible states that describe the environment’s configuration observable by the agent. "where denotes the state space"
- Top-k demonstrations: The k most similar past trajectories retrieved to guide current state prediction and reduce hallucination. "top- demonstrations retrieved from the replay buffer"
- Transition function: A function defining the probabilistic dynamics of moving from one state to another given an action. "The transition function governs the environment dynamics"
- Trust-region assumptions: Constraints on policy update magnitudes that ensure stable improvements and theoretical guarantees. "an analytical lower bound of the policy improvement in real environments when training with purely synthetic experiences from DreamGym under trust-region assumptions"
- Value function: An estimator of expected return from a state, used to compute advantages and guide policy updates. "where is a value function approximated by a LLM,"
- World model: A model that simulates environment dynamics to generate feedback and trajectories without interacting with the real world. "constructed world models to produce environment feedback for agent planning and training."
Collections
Sign up for free to add this paper to one or more collections.

