Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 104 tok/s
Gemini 3.0 Pro 36 tok/s Pro
Gemini 2.5 Flash 133 tok/s Pro
Kimi K2 216 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Leveraging Skills from Unlabeled Prior Data for Efficient Online Exploration (2410.18076v3)

Published 23 Oct 2024 in cs.LG, cs.AI, and stat.ML

Abstract: Unsupervised pretraining has been transformative in many supervised domains. However, applying such ideas to reinforcement learning (RL) presents a unique challenge in that fine-tuning does not involve mimicking task-specific data, but rather exploring and locating the solution through iterative self-improvement. In this work, we study how unlabeled offline trajectory data can be leveraged to learn efficient exploration strategies. While prior data can be used to pretrain a set of low-level skills, or as additional off-policy data for online RL, it has been unclear how to combine these ideas effectively for online exploration. Our method SUPE (Skills from Unlabeled Prior data for Exploration) demonstrates that a careful combination of these ideas compounds their benefits. Our method first extracts low-level skills using a variational autoencoder (VAE), and then pseudo-labels unlabeled trajectories with optimistic rewards and high-level action labels, transforming prior data into high-level, task-relevant examples that encourage novelty-seeking behavior. Finally, SUPE uses these transformed examples as additional off-policy data for online RL to learn a high-level policy that composes pretrained low-level skills to explore efficiently. In our experiments, SUPE consistently outperforms prior strategies across a suite of 42 long-horizon, sparse-reward tasks. Code: https://github.com/rail-berkeley/supe.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. Surprise-based intrinsic motivation for deep reinforcement learning. arXiv preprint arXiv:1703.01732, 2017.
  2. Variational option discovery algorithms. arXiv preprint arXiv:1807.10299, 2018.
  3. Reincarnating reinforcement learning: Reusing prior computation to accelerate progress. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  28955–28971. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/ba1c5356d9164bb64c446a4b690226b0-Paper-Conference.pdf.
  4. OPAL: Offline primitive discovery for accelerating offline reinforcement learning. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=V69LGwJ0lIN.
  5. The option-critic architecture. In Proceedings of the AAAI conference on artificial intelligence, volume 31, 2017.
  6. Option discovery using deep skill chaining. In International Conference on Learning Representations, 2019.
  7. Effectively learning initiation sets in hierarchical reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024.
  8. Efficient online reinforcement learning with offline data. In International Conference on Machine Learning, pp.  1577–1594. PMLR, 2023.
  9. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pp.  1471–1479, 2016.
  10. Language models are few-shot learners. In Neural Information Processing Systems (NeurIPS), 2020.
  11. Exploration by random network distillation. arXiv preprint arXiv:1810.12894, 2018.
  12. Self-supervised reinforcement learning that transfers using random features. Advances in Neural Information Processing Systems, 36, 2024.
  13. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
  14. Hierarchical relative entropy policy search. Journal of Machine Learning Research, 17(93):1–50, 2016a.
  15. Probabilistic inference for determining options in reinforcement learning. Machine Learning, 104:337–357, 2016b.
  16. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  17. Latent world models for intrinsically motivated exploration. Advances in Neural Information Processing Systems, 33:5565–5575, 2020.
  18. Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070, 2018.
  19. Stochastic neural networks for hierarchical reinforcement learning. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=B1oK8aoxe.
  20. Multi-level discovery of deep options. arXiv preprint arXiv:1703.08294, 2017.
  21. Unsupervised zero-shot reinforcement learning via functional reward encodings. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pp.  13927–13942. PMLR, 21–27 Jul 2024. URL https://proceedings.mlr.press/v235/frans24a.html.
  22. D4RL: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
  23. Reinforcement learning from passive data via latent intentions. In International Conference on Machine Learning, pp.  11321–11339. PMLR, 2023.
  24. Variational intrinsic control. arXiv preprint arXiv:1611.07507, 2016.
  25. Byol-explore: Exploration by bootstrapped prediction. Advances in neural information processing systems, 35:31855–31870, 2022.
  26. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp.  1861–1870. PMLR, 2018.
  27. Fast task inference with variational intrinsic successor features. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=BJeAHkrYDS.
  28. IDQL: Implicit Q-learning as an actor-critic method with diffusion policies. arXiv preprint arXiv:2304.10573, 2023.
  29. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  16000–16009, 2022.
  30. VIME: Variational information maximizing exploration. In Advances in Neural Information Processing Systems, pp.  1109–1117, 2016.
  31. Unsupervised behavior extraction via random intent priors. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=4vGVQVz5KG.
  32. Efficient planning in a compact latent action space. arXiv preprint arXiv:2208.10291, 2022.
  33. Variational temporal abstraction. Advances in Neural Information Processing Systems, 32, 2019.
  34. George Dimitri Konidaris. Autonomous robot skill acquisition. University of Massachusetts Amherst, 2011.
  35. Offline reinforcement learning with implicit Q-learning. arXiv preprint arXiv:2110.06169, 2021.
  36. Conservative Q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
  37. Offline-to-online reinforcement learning via balanced replay and pessimistic Q-ensemble. In Conference on Robot Learning, pp.  1702–1712. PMLR, 2022.
  38. Understanding the complexity gains of single-task RL with a curriculum. In International Conference on Machine Learning, pp.  20412–20451. PMLR, 2023.
  39. Accelerating exploration with unlabeled prior data. Advances in Neural Information Processing Systems, 36, 2024.
  40. Behavior from the void: Unsupervised active pre-training. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=fIn4wLS2XzU.
  41. Flipping coins to estimate pseudocounts for exploration in reinforcement learning. In International Conference on Machine Learning, pp.  22594–22613. PMLR, 2023.
  42. Dynamic abstraction in reinforcement learning via clustering. In Proceedings of the twenty-first international conference on Machine learning, pp.  71, 2004.
  43. Q-cut—dynamic discovery of sub-goals in reinforcement learning. In Machine Learning: ECML 2002: 13th European Conference on Machine Learning Helsinki, Finland, August 19–23, 2002 Proceedings 13, pp.  295–306. Springer, 2002.
  44. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pp.  807–814, 2010.
  45. Cal-QL: Calibrated offline RL pre-training for efficient online fine-tuning. Advances in Neural Information Processing Systems, 36, 2024.
  46. Learning and retrieval from prior data for skill-based imitation learning. In Conference on Robot Learning, 2022.
  47. Count-based exploration with neural density models. In International conference on machine learning, pp.  2721–2730. PMLR, 2017.
  48. HIQL: Offline goal-conditioned RL with latent states as actions. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a. URL https://openreview.net/forum?id=cLQCCtVDuW.
  49. METRA: Scalable unsupervised RL with metric-aware abstraction. In NeurIPS 2023 Workshop on Goal-Conditioned Reinforcement Learning, 2023b. URL https://openreview.net/forum?id=YgZNmDqyR6.
  50. Foundation policies with hilbert representations. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=LhNsSaAKub.
  51. Curiosity-driven exploration by self-supervised prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp.  16–17, 2017.
  52. Accelerating reinforcement learning with learned skill priors. In Conference on robot learning, pp.  188–204. PMLR, 2021.
  53. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  54. Learning robot skills with temporal variational inference. In International Conference on Machine Learning, pp.  8624–8633. PMLR, 2020.
  55. Dynamics-aware unsupervised discovery of skills. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=HJgLZR4KvH.
  56. Using relative novelty to identify useful temporal abstractions in reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, pp.  95, 2004.
  57. Özgür Şimşek and Andrew G. Barto. Betweenness centrality as a basis for forming skills. Workingpaper, University of Massachusetts Amherst, April 2007.
  58. Parrot: Data-driven behavioral priors for reinforcement learning. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Ysuv-WOFeKR.
  59. Hybrid RL: Using both offline and online data can make RL efficient. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=yyBis80iUuU.
  60. Option discovery in hierarchical reinforcement learning using spatio-temporal clustering. arXiv preprint arXiv:1605.05359, 2016.
  61. Incentivizing exploration in reinforcement learning with deep predictive models. arXiv preprint arXiv:1507.00814, 2015.
  62. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211, 1999.
  63. # exploration: A study of count-based exploration for deep reinforcement learning. Advances in neural information processing systems, 30, 2017.
  64. Revisiting the minimalist approach to offline reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024.
  65. Does zero-shot reinforcement learning exist? In The Eleventh International Conference on Learning Representations, 2022.
  66. Jump-start reinforcement learning. In International Conference on Machine Learning, pp.  34556–34583. PMLR, 2023.
  67. Policy finetuning: Bridging sample-efficient offline and online reinforcement learning. Advances in neural information processing systems, 34:27395–27407, 2021.
  68. Policy expansion for bridging offline-to-online reinforcement learning. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=-Y34L45JR6z.
  69. Adaptive policy learning for offline-to-online reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp.  11372–11380, 2023.

Summary

  • The paper introduces SUPE, a method that leverages unlabeled data to pretrain low-level skills and enhance online exploration in reinforcement learning tasks.
  • It employs a variational autoencoder to decompose trajectories and uses optimistic pseudo-labeling to guide high-level policy exploration.
  • Empirical results demonstrate SUPE’s superior performance in long-horizon, sparse-reward tasks, advancing efficient and strategic exploration in RL.

Leveraging Skills from Unlabeled Prior Data for Efficient Online Exploration

The paper "Leveraging Skills from Unlabeled Prior Data for Efficient Online Exploration" introduces an approach called SUPE (Skills from Unlabeled Prior data for Exploration) that focuses on employing unlabeled trajectory data to improve the exploration process within reinforcement learning (RL) environments. The main premise is to resolve the challenges present in translating the success of unsupervised pretraining, seen in supervised learning domains, to RL tasks, which require iterative self-improvement rather than just fine-tuning on pre-defined datasets.

Methodology

SUPE addresses the key challenge in RL pretraining: not just acquiring useful representations but developing effective exploration strategies for solving downstream tasks. This approach comprises two main phases:

  1. Offline Pretraining Phase: The method extracts low-level skills from prior trajectory data using a variational autoencoder (VAE). This involves decomposing trajectories into segments and learning a set of skills that are generally useful across tasks.
  2. Online Exploration Phase: In this stage, the method pseudo-labels the prior trajectory data with an optimistic reward model. These high-level task-relevant examples are then used by a high-level policy to explore the environment whilst composing the pretrained low-level skills.

Empirical Results

The efficacy of SUPE is demonstrated through extensive empirical evaluations on a set of long-horizon, sparse-reward tasks. SUPE consistently outperforms existing strategies, indicating its capability to efficiently discover and utilize exploration strategies.

  • Performance: SUPE significantly improves the speed and quality of learning in environments with challenging sparse rewards, underscoring the importance of skill pretraining and data utilization during exploration.
  • Robust Evaluations: Across various domains, SUPE achieved superior cumulative returns compared to baselines, with noteworthy performance on complex tasks requiring strategic exploration.

Implications and Future Directions

The results suggest significant practical implications for deploying RL in scenarios where exploration efficiency is paramount. By effectively utilizing unlabeled prior data, SUPE opens possibilities for more sophisticated hierarchical RL that can seamlessly integrate pretraining with efficient online learning.

In terms of theoretical implications, SUPE reinforces the value of hierarchical decomposition in addressing entanglement in trajectory data and improving data efficiency during exploration. Future research could explore fine-tuning low-level skills online, exploring alternative optimistic models, or integrating more advanced hierarchical frameworks.

SUPE demonstrates that bridging the gap between pretraining and exploration in RL through strategic data utilization can enhance learning efficiency, providing a pathway for further innovations in reinforcement learning methodologies.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 6 tweets and received 160 likes.

Upgrade to Pro to view all of the tweets about this paper: