Reward Design with Language Models (2303.00001v1)

Published 27 Feb 2023 in cs.LG, cs.AI, and cs.CL

Abstract: Reward design in reinforcement learning (RL) is challenging since specifying human notions of desired behavior may be difficult via reward functions or require many expert demonstrations. Can we instead cheaply design rewards using a natural language interface? This paper explores how to simplify reward design by prompting a LLM such as GPT-3 as a proxy reward function, where the user provides a textual prompt containing a few examples (few-shot) or a description (zero-shot) of the desired behavior. Our approach leverages this proxy reward function in an RL framework. Specifically, users specify a prompt once at the beginning of training. During training, the LLM evaluates an RL agent's behavior against the desired behavior described by the prompt and outputs a corresponding reward signal. The RL agent then uses this reward to update its behavior. We evaluate whether our approach can train agents aligned with user objectives in the Ultimatum Game, matrix games, and the DealOrNoDeal negotiation task. In all three tasks, we show that RL agents trained with our framework are well-aligned with the user's objectives and outperform RL agents trained with reward functions learned via supervised learning

PDF Abstract

Reward Design with LLMs: An Overview

The paper "Reward Design with LLMs" presents a novel approach to addressing the complexities of reward design in reinforcement learning (RL). The authors explore the use of LLMs, such as GPT-3, to generate reward signals that reflect human-like behavior based on textual prompts. This work aims to simplify the reward design process, which traditionally involves the challenging task of specifying reward functions or requiring extensive expert demonstrations.

Key Concepts and Approach

At the core, the paper proposes using LLMs as proxy reward functions within an RL framework. Instead of crafting complex reward functions or gathering voluminous labeled data, users can provide a small number of examples (few-shot) or a simple description (zero-shot) of desired behaviors. These inputs form a prompt to the LLM, which generates a reward signal based on the RL agent's actions during training.

The framework operates as follows: users specify their objectives via natural language at the onset. The LLM evaluates the agent's behavior against these objectives, providing a reward signal that guides the agent's learning process. This method leverages the LLM's pre-trained understanding of human norms and behaviors, offering a potentially more intuitive interface for non-expert users.

Experimental Evaluation

The authors assess their method across three different tasks:

Ultimatum Game: This single-timestep game illustrates the learning of behaviors from examples where specifying exact objectives is ambiguous. Results indicate that LLMs can achieve high labeling accuracy with only a few examples, even when against supervised learning models.
Matrix Games: For well-known decision concepts like Pareto-optimality, the paper explores zero-shot learning without examples. The results demonstrate that LLMs can internalize these concepts and produce consistent reward signals, outperforming a control where no objectives are specified.
DealOrNoDeal Negotiation Task: In this longer-horizon environment, LLMs assist in training agents that align with user-defined negotiation styles such as "versatile" or "stubborn." Through pilot studies, agents trained using LLM-derived rewards are found more aligned with user objectives than alternatives.

Numerical Results and Implications

The experiments indicate that in the Ultimatum Game and DealOrNoDeal task, RL agents trained with LLM-generated rewards align more closely with user objectives compared to traditional RL agents. In matrix games, the LLM framework shows the capability to generalize objectives such as total welfare and equality effectively.

These findings suggest that LLMs, leveraged as proxy rewards, can reduce the expertise and effort required in reward design, making RL accessible to a broader range of users. The implication is a potential shift towards more user-friendly AI systems wherein human objectives are more easily translatable into agent policies through natural language interfacing.

Future Directions

The authors highlight future research avenues, including expanding the framework to support more granular reward structures beyond binary signals, integrating multi-modal foundation models for richer inputs, and conducting comprehensive user studies. The approach sets a foundational step towards more aligned AI systems that respect nuanced human objectives.

Overall, this research offers a promising alternative path in RL by utilizing the strengths of LLMs to streamline reward design, paving the way for enhanced AI alignment and interpretability.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Minae Kwon (10 papers)
Sang Michael Xie (21 papers)
Kalesha Bullard (8 papers)
Dorsa Sadigh (162 papers)

Citations (157)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/qtnx_/status/1758156585979805703