Online Intrinsic Rewards for Decision Making Agents from Large Language Model Feedback (2410.23022v2)

Published 30 Oct 2024 in cs.LG, cs.AI, cs.CL, and cs.RO

Abstract: Automatically synthesizing dense rewards from natural language descriptions is a promising paradigm in reinforcement learning (RL), with applications to sparse reward problems, open-ended exploration, and hierarchical skill design. Recent works have made promising steps by exploiting the prior knowledge of LLMs. However, these approaches suffer from important limitations: they are either not scalable to problems requiring billions of environment samples, due to requiring LLM annotations for each observation, or they require a diverse offline dataset, which may not exist or be impossible to collect. In this work, we address these limitations through a combination of algorithmic and systems-level contributions. We propose \oni, a distributed architecture that simultaneously learns an RL policy and an intrinsic reward function using LLM feedback. Our approach annotates the agent's collected experience via an asynchronous LLM server, which is then distilled into an intrinsic reward model. We explore a range of algorithmic choices for reward modeling with varying complexity, including hashing, classification, and ranking models. By studying their relative tradeoffs, we shed light on questions regarding intrinsic reward design for sparse reward problems. Our approach achieves state-of-the-art performance across a range of challenging, sparse reward tasks from the NetHack Learning Environment in a simple unified process, solely using the agent's gathered experience, without requiring external datasets. We make our code available at \url{https://github.com/facebookresearch/oni}.

References (56)

Authors (5)

Qinqing Zheng (20 papers)
Mikael Henaff (20 papers)
Amy Zhang (99 papers)
Aditya Grover (82 papers)
Brandon Amos (49 papers)

Summary

Overview of "Online Intrinsic Rewards for Decision Making Agents from LLM Feedback"

The paper introduces ONI, a framework designed to improve reinforcement learning (RL) by synthesizing intrinsic rewards from feedback provided by LLMs. It addresses limitations in existing approaches and enhances reinforcement learning by employing intrinsic rewards to aid policy optimization in challenging environments marked by sparse and delayed extrinsic rewards. This research is particularly focused on resolving issues related to scalability, reward function expression, and dependency on pre-existing datasets, which are common pitfalls in prior methodologies.

Motivation and Challenges

In reinforcement learning, defining reward functions is crucial and often poses significant challenges. Many RL environments offer sparse rewards, complicating the optimization of agent policies. To address this, intrinsic rewards can facilitate exploration and learning by providing intermediate signals. However, designing such rewards requires task-specific expertise. Additionally, relying on dense reward functions or extensive pre-existing datasets is impractical for many tasks.

The research leverages LLMs to automate the reward design process, offering a novel approach that neither requires hand-crafted rewards nor extensive datasets. Instead, ONI synthesizes intrinsic rewards from observations annotated using LLM feedback, effectively addressing scalability without needing environment source code or offline datasets.

Methodology

ONI operates as a distributed system that concurrently learns both an RL policy and intrinsic reward functions in an online manner. The system consists of several key components:

Architecture: ONI's architecture is built to leverage asynchronous processing to maintain system throughput, even while exchanging data with LLMs. The architecture is based on Sample Factory's asynchronous proximal policy optimization (APPO) framework.
LLM Feedback: Observations from the environment are annotated by LLMs, which provide feedback that is distilled into intrinsic reward models. The primary goal is to enable the RL system to learn by deriving insights and expressing intrinsic rewards through LLM feedback.
Reward Modeling Options: ONI provides three methods for reward modeling — retrieval-based, classification-based, and ranking-based models — each exploring different complexities in processing LLM feedback. This provides flexibility in the design and adaptation to various RL scenarios.
State-of-the-Art Performance: The methodology was tested in the NetHack Learning Environment, a challenging benchmark known for its complexity and sparse reward structure. ONI outperformed previous state-of-the-art methods that used dense rewards or external datasets, demonstrating its efficacy and robustness.

Implications and Future Directions

This research has several practical and theoretical implications:

Scalability and Independence: By effectively removing the dependency on external datasets and simplifying the reward design process, ONI demonstrates a scalable approach to RL in complex environments. This could significantly impact the development of RL systems where domain-specific dense rewards aren't available.
Theoretical Insights: Comparing different intrinsic reward models and understanding their trade-offs provide insights into designing reward systems adapted to different levels of complexity and observation diversity.
Future Prospects in AI: Speculating on future developments, integrating LLMs into adaptive and self-improving RL systems opens avenues for developing autonomous agents capable of more sophisticated decision-making and problem-solving.

Overall, ONI marks a significant step forward in RL by demonstrating that robust intrinsic rewards can be synthesized effectively in an online fashion, leveraging the prior knowledge encoded in LLMs without extensive pre-existing data. This paper’s contributions could shape future research and applications in environments requiring adaptive and scalable learning mechanisms.

PDF Markdown

Tweets

https://twitter.com/brandondamos/status/1869824670901645746

https://twitter.com/HenaffMikael/status/1869859707772600813