ImagineBench: Synthetic Data for Offline RL
- ImagineBench is a standardized benchmark for offline RL that evaluates algorithms using both environment-collected rollouts and LLM-generated synthetic experiences.
- It provides curated datasets and evaluation protocols across locomotion, robotic manipulation, and navigation tasks to assess performance on varied instruction complexities.
- Empirical analysis reveals significant performance gaps on hard tasks, highlighting the need for robust algorithms and improved synthetic rollout quality.
ImagineBench is a standardized benchmark for the evaluation of offline reinforcement learning (RL) algorithms that utilize both environment-collected (“real”) rollouts and LLM-generated synthetic (“imaginary”) rollouts. Designed to address the absence of rigorous evaluation frameworks for LLM-imagined synthetic experiences in RL, ImagineBench provides curated datasets and protocols encompassing diverse tasks, modalities, and levels of instruction complexity. Its primary goal is to facilitate scientific progress in RL research where learning relies not only on environmental interaction data but also on large-scale, language-driven synthetic trajectories (2505.10010).
1. Formal Problem Definition and Benchmark Construction
ImagineBench operationalizes a goal-conditioned Markov Decision Process (MDP) defined as
with state space , action space (or discrete), transition kernel , reward function , discount factor , and a set of natural language goals.
The benchmark provides two complementary datasets:
- : Environment-collected rollouts.
- : LLM-generated “imaginary” rollouts.
Both datasets are mixed into a composite set
with controlling the synthetic data proportion. Experiments use , i.e., 1:1 sampling. The RL objective is
or equivalently as a mixture of expectations over the real and imaginary distributions. The success metric is the fraction of held-out evaluation episodes where the agent achieves the target goal.
2. Dataset Composition and Coverage
ImagineBench spans three domains—locomotion, robotic manipulation, and navigation—covering a spectrum of physical simulation, control, and language-grounded environments:
- Locomotion (MuJoCo HalfCheetah): , . Tasks: Run-forward, Run-backward, Jump; complexity gradations range from paraphrasing to hierarchical sequencing.
- Robotic Manipulation:
- Meta-World: , , tasks like Reach, Push, Door-open; hard variants such as Locked-door-open and Make-coffee.
- CLEVR-Robot: , discrete directions, tasks like Move, Make-circle.
- LIBERO-Object suite: , , tasks like Pick, Place, with hard sequential tasks.
- Navigation (BabyAI): , move, drop, pickup, toggle, tasks such as Goto, Pickup, Open, Put-next; hard tasks e.g. Open-lock, Put-pile.
Each environment includes detailed statistics of rollout counts (see Table below):
| Domain/Env | ||
|---|---|---|
| Meta-World | 20,000 | 72,400 |
| CLEVR-Robot | 100,000 | 72,400 |
| BabyAI | 19,200 | 19,200 |
| LIBERO | 29,780 | 12,000 |
| MuJoCo | 16,000 | 10,000 |
Real and synthetic rollouts are mixed on a per-batch basis during offline training.
3. Data Generation: Real and Imaginary Rollouts
Real rollouts are generated using expert policies:
- Meta-World and CLEVR-Robot use existing offline datasets and demonstrations.
- BabyAI relies on a rule-based policy for 19,200 trajectories.
- LIBERO applies behavior cloning on public demonstrations, producing 29,780 rollouts.
- MuJoCo uses a Soft Actor-Critic (SAC) trained expert for 16,000 rollouts.
Each rollout is paired with a natural language instruction from a controlled vocabulary.
Imaginary rollouts are generated by:
- Fine-tuning Llama-2-7B-chat on real rollout/instruction pairs, supervised for three tasks: dynamics prediction, rollout explanation, and full trajectory generation.
- Prompting with initial state and goal as “Generate a rollout for the following goal: [GOAL]. Rollout:”, yielding .
- Filtering rollouts by consistency with the goal, legality of transitions, and overall state plausibility.
Quality assessment for the BabyAI Hard regime yields: 25.8% consistency, 72.9% transition correctness, and 66.8% legality.
Natural language instructions are organized into four levels (Training, Rephrasing, Easy, Hard), which enable systematic evaluation of generalization and composition in RL. At every timestep, instructions are encoded with BERT and concatenated to the state input.
4. Evaluation Protocols, Algorithm Adaptation, and Metrics
Evaluation uses a strict success rate metric:
where is the indicator function. Automated checkers evaluate task completion per environment (e.g., <5 cm error for manipulation; 85% semantic match for locomotion in HalfCheetah).
The following offline RL baselines are implemented:
- Behavior Cloning (BC)
- Conservative Q-Learning (CQL)
- Batch-Constrained deep Q-learning (BCQ)
- Twin Delayed DDPG + BC (TD3+BC)
- PRDC
- COMBO
- Soft Actor-Critic (SAC, offline)
For synthetic data, each algorithm is trained on the union dataset (equal batch sampling). Methods including imaginary rollouts are denoted with the suffix “w/ IR”.
5. Empirical Findings and Analysis
Quantitative results show a significant success gap between models trained with imaginary rollouts and those trained on real rollouts for held-out, hard tasks. For MuJoCo hard tasks:
- Best “w/ IR” method achieves 35.44% success rate.
- Real rollout oracle training achieves 64.37%.
On Training tasks, nearly all methods—including those with synthetic data—exceed 90% success. For Rephrasing tasks, synthetic rollouts offer modest benefit, except for CQL trained only on real data, which is more robust due to its conservatism. The inclusion of imaginary rollouts in Easy tasks yields a 10–20 percentage point improvement over real-data-only baselines. For Hard tasks, performance remains under 40% even with imaginary data, whereas real rollout training on the novel tasks achieves approximately 64%.
Identified limitations include:
- Quality gap in imaginary rollouts: Only ∼25% goal consistency for complex instructions.
- Distribution mismatch: Imaginary rollouts exhibit transition and state biases; existing RL methods are not designed to accommodate such discrepancies.
- Lack of hierarchical skill composition: Current RL methods underperform on tasks requiring long-horizon or hierarchical behavior composition.
6. Future Research Directions
Key avenues for advancing performance on ImagineBench are outlined as follows:
- Algorithmic robustness: Developing offline RL algorithms that model uncertainty or explicit bias within ; incorporating automatic filtering or adaptive weighting of low-quality rollouts.
- Fast online adaptation and continual learning: Applying meta-RL or regularization strategies to prevent catastrophic forgetting when incorporating limited real-world data post-deployment; bias correction to align policy learning with realistic dynamics.
- Improved LLM-based rollout generation: Introducing physics-based or simulator-in-the-loop constraints to improve the verisimilitude of generated rollouts; exploring iterative co-training between the RL agent and the LLM generator.
- Multi-modal extensions: Integrating vision-LLMs for pixel input environments; leveraging cross-modal attention for joint text/image action planning and rollout synthesis.
Resource availability is ensured through the public repository at https://github.com/LAMDA-RL/ImagineBench, including code, datasets, and evaluation protocols (2505.10010).