Online Intrinsic Rewards for Decision Making Agents from Large Language Model Feedback (2410.23022v2)
Abstract: Automatically synthesizing dense rewards from natural language descriptions is a promising paradigm in reinforcement learning (RL), with applications to sparse reward problems, open-ended exploration, and hierarchical skill design. Recent works have made promising steps by exploiting the prior knowledge of LLMs. However, these approaches suffer from important limitations: they are either not scalable to problems requiring billions of environment samples, due to requiring LLM annotations for each observation, or they require a diverse offline dataset, which may not exist or be impossible to collect. In this work, we address these limitations through a combination of algorithmic and systems-level contributions. We propose \oni, a distributed architecture that simultaneously learns an RL policy and an intrinsic reward function using LLM feedback. Our approach annotates the agent's collected experience via an asynchronous LLM server, which is then distilled into an intrinsic reward model. We explore a range of algorithmic choices for reward modeling with varying complexity, including hashing, classification, and ranking models. By studying their relative tradeoffs, we shed light on questions regarding intrinsic reward design for sparse reward problems. Our approach achieves state-of-the-art performance across a range of challenging, sparse reward tasks from the NetHack Learning Environment in a simple unified process, solely using the agent's gathered experience, without requiring external datasets. We make our code available at \url{https://github.com/facebookresearch/oni}.
- Language reward modulation for pretraining reinforcement learning, 2023. URL https://arxiv.org/abs/2308.12270.
- Pc-pg: Policy cover directed exploration for provable policy gradient learning. Advances in neural information processing systems, 33:13399–13412, 2020.
- Michael Ahn et al. Do as i can, not as i say: Grounding language in robotic affordances, 2022. URL https://arxiv.org/abs/2204.01691.
- Unifying count-based exploration and intrinsic motivation. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pp. 1479–1487, Red Hook, NY, USA, 2016. Curran Associates Inc. ISBN 9781510838819.
- The perils of trial-and-error reward design: Misdesign through overfitting and invalid task specifications. In Proceedings of the 37th AAAI Conference on Artificial Intelligence (AAAI), Feb 2023.
- Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
- R-MAX - A general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res., 3:213–231, 2002.
- Exploration by random network distillation. In International Conference on Learning Representations, 2019.
- Survey on large language model-enhanced reinforcement learning: Concept, taxonomy, and methods. arXiv preprint arXiv:2404.00282, 2024.
- Eager: Asking and answering questions for automatic reward shaping in language-guided rl. Advances in Neural Information Processing Systems, 35:12478–12490, 2022.
- Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR, abs/1406.1078, 2014. URL http://arxiv.org/abs/1406.1078.
- Accelerating reinforcement learning of robotic manipulations via feedback from large language models. arXiv preprint arXiv:2311.02379, 2023.
- Fast and accurate deep network learning by exponential linear units (elus). In Yoshua Bengio and Yann LeCun (eds.), ICLR, 2016. URL http://dblp.uni-trier.de/db/conf/iclr/iclr2016.html#ClevertUH15.
- Palm-e: An embodied multimodal language model. In arXiv preprint arXiv:2303.03378, 2023.
- Guiding pretraining in reinforcement learning with large language models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 8657–8677. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/du23f.html.
- Abhimanyu Dubey et al. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783.
- Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995, 2019.
- Minedojo: Building open-ended embodied agents with internet-scale knowledge. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL https://openreview.net/forum?id=rc8o_j8I8PX.
- Proper laplacian representation learning, 2024. URL https://arxiv.org/abs/2310.10833.
- Exploration via elliptical episodic bonuses. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022.
- Comprehensive overview of reward engineering and shaping in advancing reinforcement learning applications. arXiv preprint arXiv:2408.10215, 2024.
- Playing nethack with llms: Potential & limitations as zero-shot agents. arXiv preprint arXiv:2403.00690, 2024.
- Near-optimal reinforcement learning in polynomial time. In Machine Learning, pp. 209–232. Morgan Kaufmann, 2002.
- Guide your agent with adaptive multimodal rewards. Advances in Neural Information Processing Systems, 36, 2024.
- Diederik P Kingma. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Motif: Intrinsic motivation from artificial intelligence feedback. arXiv preprint arXiv:2310.00166, 9 2023.
- Katakomba: Tools and benchmarks for data-driven nethack, 2023.
- The NetHack Learning Environment. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2020.
- Reward design with language models. In The Eleventh International Conference on Learning Representations, 2023a. URL https://openreview.net/forum?id=10uNUgI5Kl.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023b.
- Auto mc-reward: Automated dense reward design with large language models for minecraft. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16426–16435, 2024.
- Intelligent go-explore: Standing on the shoulders of giant foundation models. arXiv preprint arXiv:2405.15143, 2024.
- Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931, 2023.
- Anssi Myffili. Nle challenge baseline using sample-factory. https://github.com/Miffyli/nle-sample-factory-baseline, 2021.
- Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the Sixteenth International Conference on Machine Learning, ICML ’99, pp. 278–287, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc. ISBN 1558606122.
- Curiosity-driven exploration by self-supervised prediction. CoRR, abs/1705.05363, 2017.
- Sample factory: Egocentric 3d control from pixels at 100000 FPS with asynchronous reinforcement learning. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp. 7652–7662. PMLR, 2020. URL http://proceedings.mlr.press/v119/petrenko20a.html.
- Nethack is hard to hack. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=tp2nEZ5zfP.
- Ride: Rewarding impact-driven exploration for procedurally-generated environments. In International Conference on Learning Representations, 2020.
- Learning to drive a bicycle using reinforcement learning and shaping. pp. 463–471, 01 1998.
- Vision-language models are zero-shot reward models for reinforcement learning. arXiv preprint arXiv:2310.12921, 2023.
- Jürgen Schmidhuber. A possibility for implementing curiosity and boredom in model-building neural controllers. In Proc. of the international conference on simulation of adaptive behavior: From animals to animats, pp. 222–227, 1991.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Model-based active exploration. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 5779–5788. PMLR, 09–15 Jun 2019.
- Intrinsically motivated reinforcement learning: An evolutionary perspective. IEEE Transactions on Autonomous Mental Development, 2(2):70–82, 2010.
- Internal rewards mitigate agent boundedness. In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 1007–1014, 2010.
- Incentivizing exploration in reinforcement learning with deep predictive models. arXiv preprint arXiv:1507.00814, 2015.
- Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018. URL http://incompleteideas.net/book/the-book-2nd.html.
- Voyager: An open-ended embodied agent with large language models. Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https://openreview.net/forum?id=ehfRiF0R3a.
- Towards better laplacian representation in reinforcement learning with generalized graph drawing. CoRR, abs/2107.05545, 2021. URL https://arxiv.org/abs/2107.05545.
- The laplacian in RL: Learning representations with efficient approximations. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=HJlNpoA5YQ.
- Read and reap the rewards: Learning to play atari with the help of instruction manuals. Advances in Neural Information Processing Systems, 36, 2024.
- Text2reward: Automated dense reward function generation for reinforcement learning. arXiv preprint arXiv:2309.11489, 2023.
- Language to rewards for robotic skill synthesis. arXiv preprint arXiv:2306.08647, 2023.
- Noveld: A simple yet effective exploration criterion. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021.
- Character-level convolutional networks for text classification. CoRR, abs/1509.01626, 2015. URL http://arxiv.org/abs/1509.01626.
- Qinqing Zheng (20 papers)
- Mikael Henaff (20 papers)
- Amy Zhang (99 papers)
- Aditya Grover (82 papers)
- Brandon Amos (49 papers)