Papers
Topics
Authors
Recent
Search
2000 character limit reached

ImagineBench: Synthetic Data for Offline RL

Updated 23 February 2026
  • ImagineBench is a standardized benchmark for offline RL that evaluates algorithms using both environment-collected rollouts and LLM-generated synthetic experiences.
  • It provides curated datasets and evaluation protocols across locomotion, robotic manipulation, and navigation tasks to assess performance on varied instruction complexities.
  • Empirical analysis reveals significant performance gaps on hard tasks, highlighting the need for robust algorithms and improved synthetic rollout quality.

ImagineBench is a standardized benchmark for the evaluation of offline reinforcement learning (RL) algorithms that utilize both environment-collected (“real”) rollouts and LLM-generated synthetic (“imaginary”) rollouts. Designed to address the absence of rigorous evaluation frameworks for LLM-imagined synthetic experiences in RL, ImagineBench provides curated datasets and protocols encompassing diverse tasks, modalities, and levels of instruction complexity. Its primary goal is to facilitate scientific progress in RL research where learning relies not only on environmental interaction data but also on large-scale, language-driven synthetic trajectories (2505.10010).

1. Formal Problem Definition and Benchmark Construction

ImagineBench operationalizes a goal-conditioned Markov Decision Process (MDP) defined as

(S,A,P,r,γ,G),(\mathcal{S},\mathcal{A},P,r,\gamma,\mathcal{G}),

with state space SRn\mathcal{S}\subseteq\mathbb{R}^n, action space ARm\mathcal{A}\subseteq\mathbb{R}^m (or discrete), transition kernel PP, reward function rr, discount factor γ(0,1)\gamma\in(0,1), and a set G\mathcal{G} of natural language goals.

The benchmark provides two complementary datasets:

  • Dreal={τreali}\mathcal{D}_{\text{real}} = \{\tau_{\text{real}}^i\}: Environment-collected rollouts.
  • Dimag={τimagj}\mathcal{D}_{\text{imag}} = \{\tau_{\text{imag}}^j\}: LLM-generated “imaginary” rollouts.

Both datasets are mixed into a composite set

Dα=DrealαDimag\mathcal{D}_\alpha = \mathcal{D}_{\text{real}} \cup \alpha\cdot \mathcal{D}_{\text{imag}}

with α\alpha controlling the synthetic data proportion. Experiments use α=1\alpha=1, i.e., 1:1 sampling. The RL objective is

J(π)=EτDα[t=0Tγtr(st,at)],J(\pi) = \mathbb{E}_{\tau\sim\mathcal{D}_\alpha}\left[\sum_{t=0}^T \gamma^t r(s_t, a_t)\right],

or equivalently as a mixture of expectations over the real and imaginary distributions. The success metric is the fraction of held-out evaluation episodes where the agent achieves the target goal.

2. Dataset Composition and Coverage

ImagineBench spans three domains—locomotion, robotic manipulation, and navigation—covering a spectrum of physical simulation, control, and language-grounded environments:

  • Locomotion (MuJoCo HalfCheetah): sR18s\in\mathbb{R}^{18}, aR6a\in\mathbb{R}^6. Tasks: Run-forward, Run-backward, Jump; complexity gradations range from paraphrasing to hierarchical sequencing.
  • Robotic Manipulation:
    • Meta-World: sR91s\in\mathbb{R}^{91}, aR4a\in\mathbb{R}^4, tasks like Reach, Push, Door-open; hard variants such as Locked-door-open and Make-coffee.
    • CLEVR-Robot: sR10s\in\mathbb{R}^{10}, discrete a{40}a\in\{40\} directions, tasks like Move, Make-circle.
    • LIBERO-Object suite: sR44s\in\mathbb{R}^{44}, aR7a\in\mathbb{R}^7, tasks like Pick, Place, with hard sequential tasks.
  • Navigation (BabyAI): sZ17s\in\mathbb{Z}^{17}, a{a\in\{move, drop, pickup, toggle}\}, tasks such as Goto, Pickup, Open, Put-next; hard tasks e.g. Open-lock, Put-pile.

Each environment includes detailed statistics of rollout counts (see Table below):

Domain/Env NrealN_{\text{real}} NimagN_{\text{imag}}
Meta-World 20,000 72,400
CLEVR-Robot 100,000 72,400
BabyAI 19,200 19,200
LIBERO 29,780 12,000
MuJoCo 16,000 10,000

Real and synthetic rollouts are mixed on a per-batch basis during offline training.

3. Data Generation: Real and Imaginary Rollouts

Real rollouts are generated using expert policies:

  • Meta-World and CLEVR-Robot use existing offline datasets and demonstrations.
  • BabyAI relies on a rule-based policy for 19,200 trajectories.
  • LIBERO applies behavior cloning on public demonstrations, producing 29,780 rollouts.
  • MuJoCo uses a Soft Actor-Critic (SAC) trained expert for 16,000 rollouts.

Each rollout is paired with a natural language instruction GG from a controlled vocabulary.

Imaginary rollouts are generated by:

  1. Fine-tuning Llama-2-7B-chat on real rollout/instruction pairs, supervised for three tasks: dynamics prediction, rollout explanation, and full trajectory generation.
  2. Prompting with initial state s0s_0 and goal GG as “Generate a rollout for the following goal: [GOAL]. Rollout:”, yielding {a0,s1,a1,...,sT}\{a_0, s_1, a_1, ..., s_T\}.
  3. Filtering rollouts by consistency with the goal, legality of transitions, and overall state plausibility.

Quality assessment for the BabyAI Hard regime yields: 25.8% consistency, 72.9% transition correctness, and 66.8% legality.

Natural language instructions are organized into four levels (Training, Rephrasing, Easy, Hard), which enable systematic evaluation of generalization and composition in RL. At every timestep, instructions GG are encoded with BERT and concatenated to the state input.

4. Evaluation Protocols, Algorithm Adaptation, and Metrics

Evaluation uses a strict success rate metric:

SuccessRate=1Ni=1NI[taski completed]\text{SuccessRate} = \frac{1}{N} \sum_{i=1}^N \mathbb{I}[\text{task}_i\ \text{completed}]

where I\mathbb{I} is the indicator function. Automated checkers evaluate task completion per environment (e.g., <5 cm error for manipulation; \geq85% semantic match for locomotion in HalfCheetah).

The following offline RL baselines are implemented:

  • Behavior Cloning (BC)
  • Conservative Q-Learning (CQL)
  • Batch-Constrained deep Q-learning (BCQ)
  • Twin Delayed DDPG + BC (TD3+BC)
  • PRDC
  • COMBO
  • Soft Actor-Critic (SAC, offline)

For synthetic data, each algorithm is trained on the union dataset DrealDimag\mathcal{D}_{\text{real}} \cup \mathcal{D}_{\text{imag}} (equal batch sampling). Methods including imaginary rollouts are denoted with the suffix “w/ IR”.

5. Empirical Findings and Analysis

Quantitative results show a significant success gap between models trained with imaginary rollouts and those trained on real rollouts for held-out, hard tasks. For MuJoCo hard tasks:

  • Best “w/ IR” method achieves 35.44% success rate.
  • Real rollout oracle training achieves 64.37%.

On Training tasks, nearly all methods—including those with synthetic data—exceed 90% success. For Rephrasing tasks, synthetic rollouts offer modest benefit, except for CQL trained only on real data, which is more robust due to its conservatism. The inclusion of imaginary rollouts in Easy tasks yields a 10–20 percentage point improvement over real-data-only baselines. For Hard tasks, performance remains under 40% even with imaginary data, whereas real rollout training on the novel tasks achieves approximately 64%.

Identified limitations include:

  • Quality gap in imaginary rollouts: Only ∼25% goal consistency for complex instructions.
  • Distribution mismatch: Imaginary rollouts exhibit transition and state biases; existing RL methods are not designed to accommodate such discrepancies.
  • Lack of hierarchical skill composition: Current RL methods underperform on tasks requiring long-horizon or hierarchical behavior composition.

6. Future Research Directions

Key avenues for advancing performance on ImagineBench are outlined as follows:

  • Algorithmic robustness: Developing offline RL algorithms that model uncertainty or explicit bias within Dimag\mathcal{D}_{\text{imag}}; incorporating automatic filtering or adaptive weighting of low-quality rollouts.
  • Fast online adaptation and continual learning: Applying meta-RL or regularization strategies to prevent catastrophic forgetting when incorporating limited real-world data post-deployment; bias correction to align policy learning with realistic dynamics.
  • Improved LLM-based rollout generation: Introducing physics-based or simulator-in-the-loop constraints to improve the verisimilitude of generated rollouts; exploring iterative co-training between the RL agent and the LLM generator.
  • Multi-modal extensions: Integrating vision-LLMs for pixel input environments; leveraging cross-modal attention for joint text/image action planning and rollout synthesis.

Resource availability is ensured through the public repository at https://github.com/LAMDA-RL/ImagineBench, including code, datasets, and evaluation protocols (2505.10010).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ImagineBench.