Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

In-context Exploration-Exploitation for Reinforcement Learning (2403.06826v1)

Published 11 Mar 2024 in cs.LG, cs.AI, and stat.ML

Abstract: In-context learning is a promising approach for online policy learning of offline reinforcement learning (RL) methods, which can be achieved at inference time without gradient optimization. However, this method is hindered by significant computational costs resulting from the gathering of large training trajectory sets and the need to train large Transformer models. We address this challenge by introducing an In-context Exploration-Exploitation (ICEE) algorithm, designed to optimize the efficiency of in-context policy learning. Unlike existing models, ICEE performs an exploration-exploitation trade-off at inference time within a Transformer model, without the need for explicit Bayesian inference. Consequently, ICEE can solve Bayesian optimization problems as efficiently as Gaussian process biased methods do, but in significantly less time. Through experiments in grid world environments, we demonstrate that ICEE can learn to solve new RL tasks using only tens of episodes, marking a substantial improvement over the hundreds of episodes needed by the previous in-context learning method.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. A mechanism for sample-efficient in-context learning for sparse retrieval tasks. arXiv preprint arXiv:2305.17040, 2023.
  2. Learning to learn by gradient descent by gradient descent. Advances in neural information processing systems, 29, 2016.
  3. BoTorch: A Framework for Efficient Monte-Carlo Bayesian Optimization. In Advances in Neural Information Processing Systems 33, 2020.
  4. Decision transformer: Reinforcement learning via sequence modeling. In Advances in Neural Information Processing Systems, pp. 15084–15097, 2021.
  5. Offline meta reinforcement learning – identifiability challenges and effective data collection strategies. In Advances in Neural Information Processing Systems, pp. 4607–4618, 2021.
  6. One-shot imitation learning. Advances in neural information processing systems, 30, 2017.
  7. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6, 2005.
  8. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pp. 1126–1135. PMLR, 2017.
  9. Bootstrapped meta-learning. arXiv preprint arXiv:2109.04504, 2021.
  10. Off-policy deep reinforcement learning without exploration. In International conference on machine learning, pp. 2052–2062. PMLR, 2019.
  11. Rl unplugged: A suite of benchmarks for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:7248–7259, 2020.
  12. Unsupervised meta-learning for reinforcement learning. arXiv preprint arXiv:1806.04640, 2018.
  13. Meta-learning in neural networks: A survey. IEEE transactions on pattern analysis and machine intelligence, 44(9):5149–5169, 2021.
  14. Unsupervised learning via meta-learning. arXiv preprint arXiv:1810.02334, 2018.
  15. Offline reinforcement learning as one big sequence modeling problem. In Advances in Neural Information Processing Systems, pp. 1273–1286, 2021.
  16. BayesO: A Bayesian optimization framework in Python. https://bayeso.org, 2017.
  17. Batch reinforcement learning. In Reinforcement learning: State-of-the-art, pp.  45–73. Springer, 2012.
  18. In-context reinforcement learning with algorithm distillation. In The Eleventh International Conference on Learning Representations, 2023.
  19. Multi-game decision transformers. In Advances in Neural Information Processing Systems, pp. 27921–27936, 2022.
  20. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
  21. Ke Li and Jitendra Malik. Learning to optimize. arXiv preprint arXiv:1606.01885, 2016.
  22. Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8:293–321, 1992.
  23. Pretrained transformers as universal computation engines. arXiv preprint arXiv:2103.05247, 1, 2021.
  24. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837, 2022.
  25. A simple neural attentive meta-learner. arXiv preprint arXiv:1707.03141, 2017.
  26. Offline meta-reinforcement learning with advantage weighting. In International Conference on Machine Learning, pp. 7780–7791. PMLR, 2021.
  27. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
  28. Awac: Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359, 2020.
  29. Offline meta-reinforcement learning with online self-supervision. In International Conference on Machine Learning, pp. 17811–17829. PMLR, 2022.
  30. Optimization as a model for few-shot learning. In International conference on learning representations, 2016.
  31. A generalist agent. May 2022a.
  32. A generalist agent. arXiv preprint arXiv:2205.06175, 2022b.
  33. Martin Riedmiller. Neural fitted q iteration–first experiences with a data efficient neural reinforcement learning method. In Machine Learning: ECML 2005: 16th European Conference on Machine Learning, Porto, Portugal, October 3-7, 2005. Proceedings 16, pp. 317–328. Springer, 2005.
  34. Simple principles of metalearning. Technical report IDSIA, 69:1–23, 1996.
  35. Keep doing what worked: Behavioral modelling priors for offline reinforcement learning. arXiv preprint arXiv:2002.08396, 2020.
  36. Some considerations on learning to explore via meta-reinforcement learning. arXiv preprint arXiv:1803.01118, 2018.
  37. Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3:9–44, 1988.
  38. Richard S Sutton. A history of meta-gradient: Gradient methods for meta-learning. arXiv preprint arXiv:2202.09701, 2022.
  39. Reinforcement learning: An introduction. MIT press, 2018.
  40. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999.
  41. Human-timescale adaptation in an open-ended task space. arXiv preprint arXiv:2301.07608, 2023.
  42. Learning to learn. Springer Science & Business Media, 2012.
  43. Discovery of useful questions as auxiliary tasks. Advances in Neural Information Processing Systems, 32, 2019.
  44. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016.
  45. Learned optimizers that scale and generalize. In International conference on machine learning, pp. 3751–3760. PMLR, 2017.
  46. Meta-gradient reinforcement learning. Advances in neural information processing systems, 31, 2018.
  47. Meta-gradient reinforcement learning with an objective discovered online. Advances in Neural Information Processing Systems, 33:15254–15264, 2020.
  48. Do large language models know what they don’t know? In Findings of the Association for Computational Linguistics: ACL 2023, pp.  8653–8665, July 2023.
  49. A self-tuning actor-critic algorithm. Advances in neural information processing systems, 33:20913–20924, 2020.
  50. Online decision transformer. In Proceedings of the 39th International Conference on Machine Learning, Proceedings of Machine Learning Research, pp.  27042–27059, 2022.
  51. On learning intrinsic rewards for policy gradient methods. Advances in Neural Information Processing Systems, 31, 2018.
  52. Varibad: Variational bayes-adaptive deep rl via meta-learning. Journal of Machine Learning Research, 22(289):1–39, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Zhenwen Dai (29 papers)
  2. Federico Tomasi (6 papers)
  3. Sina Ghiassian (18 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com