Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 88 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 12 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 110 tok/s Pro
GPT OSS 120B 470 tok/s Pro
Kimi K2 197 tok/s Pro
2000 character limit reached

Efficient Reinforcement Learning by Guiding Generalist World Models with Non-Curated Data (2502.19544v2)

Published 26 Feb 2025 in cs.LG and cs.RO

Abstract: Leveraging offline data is a promising way to improve the sample efficiency of online reinforcement learning (RL). This paper expands the pool of usable data for offline-to-online RL by leveraging abundant non-curated data that is reward-free, of mixed quality, and collected across multiple embodiments. Although learning a world model appears promising for utilizing such data, we find that naive fine-tuning fails to accelerate RL training on many tasks. Through careful investigation, we attribute this failure to the distributional shift between offline and online data during fine-tuning. To address this issue and effectively use the offline data, we propose two essential techniques: \emph{i)} experience rehearsal and \emph{ii)} execution guidance. With these modifications, the non-curated offline data substantially improves RL's sample efficiency. Under limited sample budgets, our method achieves a 102.8\% relative improvement in aggregate score over learning-from-scratch baselines across 72 visuomotor tasks spanning 6 embodiments. On challenging tasks such as locomotion and robotic manipulation, it outperforms prior methods that utilize offline data by a decent margin.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces a novel method to pre-train a multi-embodiment world model using non-curated data, eliminating the need for costly curated datasets.
  • The paper demonstrates that integrating retrieval-based experience rehearsal with execution guidance significantly enhances sample efficiency and stabilizes fine-tuning.
  • The paper's experiments reveal a 35% performance boost over baselines in continuous control tasks, highlighting its potential for fast task adaptation in robotics.

Generalist World Model Pre-Training for Efficient Reinforcement Learning

This paper introduces a novel approach to reinforcement learning (RL) that leverages non-curated offline data to improve sample efficiency and enable fast task adaptation. The core idea revolves around pre-training a generalist world model (WPT) on a diverse dataset comprising reward-free, non-expert, and multi-embodiment data. This pre-trained world model is then fine-tuned using online interaction, augmented by retrieval-based experience rehearsal and execution guidance, to achieve superior RL performance across a wide range of tasks. The approach addresses a critical gap in existing methods that often require curated datasets with expert demonstrations or reward labels, which are costly and limit their practical applicability.

Methodology: WPT with Experience Rehearsal and Execution Guidance

The WPT framework consists of two primary stages: multi-embodiment world model pre-training and RL-based fine-tuning.

Multi-Embodiment World Model Pre-training: The authors train a single, multi-task world model on non-curated offline data. This is achieved using a recurrent state space model (RSSM) with modifications to handle diverse data sources. The key modifications include removing task-related losses, padding actions to unify action spaces across different embodiments, and scaling the model size. The model is trained to minimize a loss function that combines pixel reconstruction and latent state consistency, enabling it to learn the dynamics of multiple embodiments.

RL-Based Fine-tuning: In the fine-tuning stage, the agent interacts with the environment to collect new data, which is used to learn a reward function and fine-tune the world model. The policy is trained using imagined trajectories generated by rolling out the policy within the world model. To address challenges in hard-exploration tasks, the authors introduce two key techniques:

  • Retrieval-based Experience Rehearsal: This technique augments the initial state distribution by retrieving a subset of trajectories from the non-curated offline data that are close to the downstream tasks. The distance is calculated based on the neural feature distance between online samples and offline trajectories. This helps prevent catastrophic forgetting and improves policy learning by providing informative initial states.
  • Execution Guidance via Prior Actors: This technique guides online data collection by training a prior actor via behavior cloning on the retrieved buffer. During online data collection, the agent switches between the prior actor and the RL agent according to a pre-defined schedule. This encourages exploration and ensures that the online data distribution remains close to the offline data distribution, improving reward predictions and stabilizing fine-tuning. Figure 1

    Figure 1: An overview of generalist world model pre-training (WPT), highlighting the training of a task and embodiment-agnostic world model using non-curated offline data.

Experimental Results: Sample Efficiency and Task Adaptation

The WPT framework was evaluated on 72 pixel-based continuous control tasks covering locomotion and manipulation with different action spaces, hard exploration, high dimensions, and complex dynamics. The results demonstrate that WPT significantly outperforms existing methods in terms of sample efficiency. Specifically, WPT achieves 35.65\% and 35\% higher normalized scores compared to DrQ v2 and Dreamer v3, respectively, under a limited sample budget of 150k samples (Figure 2). Figure 2

Figure 2: The aggregate performance on 50 manipulation tasks from Meta-World and 22 locomotion tasks from DMControl with pixel inputs, illustrating WPT's improved sample efficiency.

Moreover, the framework enables fast task adaptation, allowing the agent to continually adapt its skill to new tasks. In a continual learning experiment using an Ant robot, WPT significantly outperformed a standard continual learning baseline. These results highlight the potential of WPT for lifelong learning in robotic agents (Figure 3). Figure 3

Figure 3: Performance comparison of WPT against a widely used baseline in a task adaptation scenario, showcasing WPT's ability to quickly adapt to new tasks.

Ablation studies were conducted to investigate the role of each component in the WPT framework (Figure 4). The results indicate that world model pre-training alone can be effective when the offline dataset consists of diverse trajectories. However, the combination of world model pre-training, retrieval-based experience rehearsal, and execution guidance is crucial for achieving strong performance across a wide range of tasks. Figure 4

Figure 4

Figure 4: This ablation paper shows the impact of each component of WPT, including world model pre-training, experience rehearsal, and execution guidance, on RL performance.

Implications and Future Directions

The WPT framework has significant implications for the field of RL, particularly in robotics and autonomous systems. By demonstrating the feasibility of leveraging non-curated offline data, this research unlocks a vast pool of readily available data for pre-training RL agents. This can significantly reduce the cost and effort associated with data collection, making RL more accessible for real-world applications. The ability of WPT to enable fast task adaptation also opens up new possibilities for lifelong learning agents that can continuously acquire new skills and adapt to changing environments.

Future research directions could focus on extending WPT to real-world applications, exploring novel world model architectures, and developing more sophisticated techniques for experience rehearsal and execution guidance. Additionally, investigating the use of self-supervised learning methods and advanced architectures could further improve the performance of world model pre-training.

Conclusion

The WPT framework presents a promising approach to efficient RL by effectively leveraging non-curated offline data. The combination of generalist world model pre-training, retrieval-based experience rehearsal, and execution guidance enables agents to achieve superior sample efficiency and fast task adaptation across a wide range of tasks. This research has the potential to significantly advance the field of RL and facilitate the development of more capable and adaptable autonomous systems.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

X Twitter Logo Streamline Icon: https://streamlinehq.com