Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

First-Explore, then Exploit: Meta-Learning to Solve Hard Exploration-Exploitation Trade-Offs (2307.02276v2)

Published 5 Jul 2023 in cs.LG and cs.AI

Abstract: Standard reinforcement learning (RL) agents never intelligently explore like a human (i.e. taking into account complex domain priors and adapting quickly based on previous exploration). Across episodes, RL agents struggle to perform even simple exploration strategies, for example systematic search that avoids exploring the same location multiple times. This poor exploration limits performance on challenging domains. Meta-RL is a potential solution, as unlike standard RL, meta-RL can learn to explore, and potentially learn highly complex strategies far beyond those of standard RL, strategies such as experimenting in early episodes to learn new skills, or conducting experiments to learn about the current environment. Traditional meta-RL focuses on the problem of learning to optimally balance exploration and exploitation to maximize the cumulative reward of the episode sequence (e.g., aiming to maximize the total wins in a tournament -- while also improving as a player). We identify a new challenge with state-of-the-art cumulative-reward meta-RL methods. When optimal behavior requires exploration that sacrifices immediate reward to enable higher subsequent reward, existing state-of-the-art cumulative-reward meta-RL methods become stuck on the local optimum of failing to explore. Our method, First-Explore, overcomes this limitation by learning two policies: one to solely explore, and one to solely exploit. When exploring requires forgoing early-episode reward, First-Explore significantly outperforms existing cumulative meta-RL methods. By identifying and solving the previously unrecognized problem of forgoing reward in early episodes, First-Explore represents a significant step towards developing meta-RL algorithms capable of human-like exploration on a broader range of domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. Magnetic control of tokamak plasmas through deep reinforcement learning, Feb 2022. URL https://www.nature.com/articles/s41586-021-04301-9.
  2. Optimization of molecules via deep reinforcement learning, Jul 2019. URL https://www.nature.com/articles/s41598-019-47148-x.
  3. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018.
  4. Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113, 2019.
  5. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  6. Accelerating reinforcement learning with learned skill priors. In Conference on robot learning, pages 188–204. PMLR, 2021.
  7. Investigating human priors for playing video games. arXiv preprint arXiv:1802.10217, 2018.
  8. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  9. Human-timescale adaptation in an open-ended task space. arXiv preprint arXiv:2301.07608, 2023.
  10. Rl22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.
  11. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016.
  12. A survey of meta-reinforcement learning. arXiv preprint arXiv:2301.08028, 2023.
  13. Reinforcement learning: An introduction. MIT Press Ltd, 2018.
  14. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. PMLR, 2018.
  15. Some considerations on learning to explore via meta-reinforcement learning. arXiv preprint arXiv:1803.01118, 2018.
  16. A survey on intrinsic motivation in reinforcement learning. arXiv preprint arXiv:1908.06976, 2019.
  17. Exploration by random network distillation. arXiv preprint arXiv:1810.12894, 2018.
  18. Exploration in deep reinforcement learning: A survey. Information Fusion, 85:1–22, 2022. ISSN 1566-2535. doi: https://doi.org/10.1016/j.inffus.2022.03.003. URL https://www.sciencedirect.com/science/article/pii/S1566253522000288.
  19. Decoupling exploration and exploitation for meta-reinforcement learning without sacrifices. In International conference on machine learning, pages 6925–6935. PMLR, 2021.
  20. Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995, 2019.
  21. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  22. In-context reinforcement learning with algorithm distillation. arXiv preprint arXiv:2210.14215, 2022.
  23. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021.
  24. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097, 2021.
  25. Long range arena: A benchmark for efficient transformers. arXiv preprint arXiv:2011.04006, 2020.
  26. Paired open-ended trailblazer (poet): Endlessly generating increasingly complex and diverse learning environments and their solutions. arXiv preprint arXiv:1901.01753, 2019.
  27. Emergent tool use from multi-agent autocurricula. arXiv preprint arXiv:1909.07528, 2019.
  28. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
  29. Openai gpt2. https://huggingface.co/docs/transformers/model_doc/gpt2. Accessed: 2023-04-01.
  30. Decoupled weight decay regularization, 2019.
  31. Muesli: Combining improvements in policy optimization. In International conference on machine learning, pages 4214–4226. PMLR, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Ben Norman (1 paper)
  2. Jeff Clune (65 papers)

Summary

We haven't generated a summary for this paper yet.