Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Online Intrinsic Rewards for Decision Making Agents from Large Language Model Feedback (2410.23022v2)

Published 30 Oct 2024 in cs.LG, cs.AI, cs.CL, and cs.RO

Abstract: Automatically synthesizing dense rewards from natural language descriptions is a promising paradigm in reinforcement learning (RL), with applications to sparse reward problems, open-ended exploration, and hierarchical skill design. Recent works have made promising steps by exploiting the prior knowledge of LLMs. However, these approaches suffer from important limitations: they are either not scalable to problems requiring billions of environment samples, due to requiring LLM annotations for each observation, or they require a diverse offline dataset, which may not exist or be impossible to collect. In this work, we address these limitations through a combination of algorithmic and systems-level contributions. We propose \oni, a distributed architecture that simultaneously learns an RL policy and an intrinsic reward function using LLM feedback. Our approach annotates the agent's collected experience via an asynchronous LLM server, which is then distilled into an intrinsic reward model. We explore a range of algorithmic choices for reward modeling with varying complexity, including hashing, classification, and ranking models. By studying their relative tradeoffs, we shed light on questions regarding intrinsic reward design for sparse reward problems. Our approach achieves state-of-the-art performance across a range of challenging, sparse reward tasks from the NetHack Learning Environment in a simple unified process, solely using the agent's gathered experience, without requiring external datasets. We make our code available at \url{https://github.com/facebookresearch/oni}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Language reward modulation for pretraining reinforcement learning, 2023. URL https://arxiv.org/abs/2308.12270.
  2. Pc-pg: Policy cover directed exploration for provable policy gradient learning. Advances in neural information processing systems, 33:13399–13412, 2020.
  3. Michael Ahn et al. Do as i can, not as i say: Grounding language in robotic affordances, 2022. URL https://arxiv.org/abs/2204.01691.
  4. Unifying count-based exploration and intrinsic motivation. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pp.  1479–1487, Red Hook, NY, USA, 2016. Curran Associates Inc. ISBN 9781510838819.
  5. The perils of trial-and-error reward design: Misdesign through overfitting and invalid task specifications. In Proceedings of the 37th AAAI Conference on Artificial Intelligence (AAAI), Feb 2023.
  6. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  7. R-MAX - A general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res., 3:213–231, 2002.
  8. Exploration by random network distillation. In International Conference on Learning Representations, 2019.
  9. Survey on large language model-enhanced reinforcement learning: Concept, taxonomy, and methods. arXiv preprint arXiv:2404.00282, 2024.
  10. Eager: Asking and answering questions for automatic reward shaping in language-guided rl. Advances in Neural Information Processing Systems, 35:12478–12490, 2022.
  11. Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR, abs/1406.1078, 2014. URL http://arxiv.org/abs/1406.1078.
  12. Accelerating reinforcement learning of robotic manipulations via feedback from large language models. arXiv preprint arXiv:2311.02379, 2023.
  13. Fast and accurate deep network learning by exponential linear units (elus). In Yoshua Bengio and Yann LeCun (eds.), ICLR, 2016. URL http://dblp.uni-trier.de/db/conf/iclr/iclr2016.html#ClevertUH15.
  14. Palm-e: An embodied multimodal language model. In arXiv preprint arXiv:2303.03378, 2023.
  15. Guiding pretraining in reinforcement learning with large language models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  8657–8677. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/du23f.html.
  16. Abhimanyu Dubey et al. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783.
  17. Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995, 2019.
  18. Minedojo: Building open-ended embodied agents with internet-scale knowledge. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL https://openreview.net/forum?id=rc8o_j8I8PX.
  19. Proper laplacian representation learning, 2024. URL https://arxiv.org/abs/2310.10833.
  20. Exploration via elliptical episodic bonuses. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022.
  21. Comprehensive overview of reward engineering and shaping in advancing reinforcement learning applications. arXiv preprint arXiv:2408.10215, 2024.
  22. Playing nethack with llms: Potential & limitations as zero-shot agents. arXiv preprint arXiv:2403.00690, 2024.
  23. Near-optimal reinforcement learning in polynomial time. In Machine Learning, pp.  209–232. Morgan Kaufmann, 2002.
  24. Guide your agent with adaptive multimodal rewards. Advances in Neural Information Processing Systems, 36, 2024.
  25. Diederik P Kingma. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  26. Motif: Intrinsic motivation from artificial intelligence feedback. arXiv preprint arXiv:2310.00166, 9 2023.
  27. Katakomba: Tools and benchmarks for data-driven nethack, 2023.
  28. The NetHack Learning Environment. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2020.
  29. Reward design with language models. In The Eleventh International Conference on Learning Representations, 2023a. URL https://openreview.net/forum?id=10uNUgI5Kl.
  30. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023b.
  31. Auto mc-reward: Automated dense reward design with large language models for minecraft. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  16426–16435, 2024.
  32. Intelligent go-explore: Standing on the shoulders of giant foundation models. arXiv preprint arXiv:2405.15143, 2024.
  33. Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931, 2023.
  34. Anssi Myffili. Nle challenge baseline using sample-factory. https://github.com/Miffyli/nle-sample-factory-baseline, 2021.
  35. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the Sixteenth International Conference on Machine Learning, ICML ’99, pp.  278–287, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc. ISBN 1558606122.
  36. Curiosity-driven exploration by self-supervised prediction. CoRR, abs/1705.05363, 2017.
  37. Sample factory: Egocentric 3d control from pixels at 100000 FPS with asynchronous reinforcement learning. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp.  7652–7662. PMLR, 2020. URL http://proceedings.mlr.press/v119/petrenko20a.html.
  38. Nethack is hard to hack. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=tp2nEZ5zfP.
  39. Ride: Rewarding impact-driven exploration for procedurally-generated environments. In International Conference on Learning Representations, 2020.
  40. Learning to drive a bicycle using reinforcement learning and shaping. pp.  463–471, 01 1998.
  41. Vision-language models are zero-shot reward models for reinforcement learning. arXiv preprint arXiv:2310.12921, 2023.
  42. Jürgen Schmidhuber. A possibility for implementing curiosity and boredom in model-building neural controllers. In Proc. of the international conference on simulation of adaptive behavior: From animals to animats, pp.  222–227, 1991.
  43. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  44. Model-based active exploration. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp.  5779–5788. PMLR, 09–15 Jun 2019.
  45. Intrinsically motivated reinforcement learning: An evolutionary perspective. IEEE Transactions on Autonomous Mental Development, 2(2):70–82, 2010.
  46. Internal rewards mitigate agent boundedness. In Proceedings of the 27th international conference on machine learning (ICML-10), pp.  1007–1014, 2010.
  47. Incentivizing exploration in reinforcement learning with deep predictive models. arXiv preprint arXiv:1507.00814, 2015.
  48. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018. URL http://incompleteideas.net/book/the-book-2nd.html.
  49. Voyager: An open-ended embodied agent with large language models. Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https://openreview.net/forum?id=ehfRiF0R3a.
  50. Towards better laplacian representation in reinforcement learning with generalized graph drawing. CoRR, abs/2107.05545, 2021. URL https://arxiv.org/abs/2107.05545.
  51. The laplacian in RL: Learning representations with efficient approximations. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=HJlNpoA5YQ.
  52. Read and reap the rewards: Learning to play atari with the help of instruction manuals. Advances in Neural Information Processing Systems, 36, 2024.
  53. Text2reward: Automated dense reward function generation for reinforcement learning. arXiv preprint arXiv:2309.11489, 2023.
  54. Language to rewards for robotic skill synthesis. arXiv preprint arXiv:2306.08647, 2023.
  55. Noveld: A simple yet effective exploration criterion. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021.
  56. Character-level convolutional networks for text classification. CoRR, abs/1509.01626, 2015. URL http://arxiv.org/abs/1509.01626.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Qinqing Zheng (20 papers)
  2. Mikael Henaff (20 papers)
  3. Amy Zhang (99 papers)
  4. Aditya Grover (82 papers)
  5. Brandon Amos (49 papers)

Summary

Overview of "Online Intrinsic Rewards for Decision Making Agents from LLM Feedback"

The paper introduces ONI, a framework designed to improve reinforcement learning (RL) by synthesizing intrinsic rewards from feedback provided by LLMs. It addresses limitations in existing approaches and enhances reinforcement learning by employing intrinsic rewards to aid policy optimization in challenging environments marked by sparse and delayed extrinsic rewards. This research is particularly focused on resolving issues related to scalability, reward function expression, and dependency on pre-existing datasets, which are common pitfalls in prior methodologies.

Motivation and Challenges

In reinforcement learning, defining reward functions is crucial and often poses significant challenges. Many RL environments offer sparse rewards, complicating the optimization of agent policies. To address this, intrinsic rewards can facilitate exploration and learning by providing intermediate signals. However, designing such rewards requires task-specific expertise. Additionally, relying on dense reward functions or extensive pre-existing datasets is impractical for many tasks.

The research leverages LLMs to automate the reward design process, offering a novel approach that neither requires hand-crafted rewards nor extensive datasets. Instead, ONI synthesizes intrinsic rewards from observations annotated using LLM feedback, effectively addressing scalability without needing environment source code or offline datasets.

Methodology

ONI operates as a distributed system that concurrently learns both an RL policy and intrinsic reward functions in an online manner. The system consists of several key components:

  1. Architecture: ONI's architecture is built to leverage asynchronous processing to maintain system throughput, even while exchanging data with LLMs. The architecture is based on Sample Factory's asynchronous proximal policy optimization (APPO) framework.
  2. LLM Feedback: Observations from the environment are annotated by LLMs, which provide feedback that is distilled into intrinsic reward models. The primary goal is to enable the RL system to learn by deriving insights and expressing intrinsic rewards through LLM feedback.
  3. Reward Modeling Options: ONI provides three methods for reward modeling — retrieval-based, classification-based, and ranking-based models — each exploring different complexities in processing LLM feedback. This provides flexibility in the design and adaptation to various RL scenarios.
  4. State-of-the-Art Performance: The methodology was tested in the NetHack Learning Environment, a challenging benchmark known for its complexity and sparse reward structure. ONI outperformed previous state-of-the-art methods that used dense rewards or external datasets, demonstrating its efficacy and robustness.

Implications and Future Directions

This research has several practical and theoretical implications:

  • Scalability and Independence: By effectively removing the dependency on external datasets and simplifying the reward design process, ONI demonstrates a scalable approach to RL in complex environments. This could significantly impact the development of RL systems where domain-specific dense rewards aren't available.
  • Theoretical Insights: Comparing different intrinsic reward models and understanding their trade-offs provide insights into designing reward systems adapted to different levels of complexity and observation diversity.
  • Future Prospects in AI: Speculating on future developments, integrating LLMs into adaptive and self-improving RL systems opens avenues for developing autonomous agents capable of more sophisticated decision-making and problem-solving.

Overall, ONI marks a significant step forward in RL by demonstrating that robust intrinsic rewards can be synthesized effectively in an online fashion, leveraging the prior knowledge encoded in LLMs without extensive pre-existing data. This paper’s contributions could shape future research and applications in environments requiring adaptive and scalable learning mechanisms.