Large Language Models can Implement Policy Iteration

Published 7 Oct 2022 in cs.LG | (2210.03821v2)

Abstract: This work presents In-Context Policy Iteration, an algorithm for performing Reinforcement Learning (RL), in-context, using foundation models. While the application of foundation models to RL has received considerable attention, most approaches rely on either (1) the curation of expert demonstrations (either through manual design or task-specific pretraining) or (2) adaptation to the task of interest using gradient methods (either fine-tuning or training of adapter layers). Both of these techniques have drawbacks. Collecting demonstrations is labor-intensive, and algorithms that rely on them do not outperform the experts from which the demonstrations were derived. All gradient techniques are inherently slow, sacrificing the "few-shot" quality that made in-context learning attractive to begin with. In this work, we present an algorithm, ICPI, that learns to perform RL tasks without expert demonstrations or gradients. Instead we present a policy-iteration method in which the prompt content is the entire locus of learning. ICPI iteratively updates the contents of the prompt from which it derives its policy through trial-and-error interaction with an RL environment. In order to eliminate the role of in-weights learning (on which approaches like Decision Transformer rely heavily), we demonstrate our algorithm using Codex, a LLM with no prior knowledge of the domains on which we evaluate it.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (16)

View on Semantic Scholar

Summary

The paper introduces an in-context policy iteration method that leverages LLMs to update reinforcement learning policies without expert demonstrations.
It employs iterative prompt updates and Q-value estimation from LLM-generated rollouts to self-improve the policy during various RL tasks.
Experiments reveal that larger models, like Codex variants, effectively generate task-specific rollouts that accelerate policy learning.

Introduction

The integration of Reinforcement Learning (RL) and foundation models such as LLMs (LMs) has led to fascinating developments. The primary strategies for RL application on these models generally consist of leveraging curated expert demonstrations or adapter layers, both of which possess inherent drawbacks. This paper presents an alternative approach that employs in-context learning capabilities of LMs for policy iteration in RL tasks, eliminating the need for expert demonstrations or gradient-based optimization methods.

Existing RL applications fall into two categories: leveraging expert demonstrations or relying on gradient-based methods such as transformer models. Expert demonstrations often lack the capacity to outperform the experts from whom the demonstrations were derived. Gradient-based methods, whilst powerful, abandon the appealing properties of foundation models that enable learning without direct task-specific training. The proposed method navigates these constraints using in-context learning, demonstrated by showing the method's effectiveness across different LLMs for a variety of simple RL tasks.

Methodology

The presented In-Context Policy Iteration (ICPI) method iteratively updates the prompt content in RL environments, thus inducing the role of a world-model and a rollout-policy solely through in-context learning. The policy is improved by acting in the environment to maximize the Q-value estimates obtained from the LM-generated rollouts. This self-improvement attribute of ICPI enables it to refine policies iteratively, dispensing with the need for gradients and expert demonstrations. Additionally, the paper detailed the prompt construction and the method for computing Q-values relying on the accumulated experience within the model's context window.

Experiments and Results

The approach was empirically validated on six illustrative RL tasks, demonstrating the capability of ICPI to learn policies rapidly. Additionally, different pre-trained LLMs, including GPT-J, OPT-30B, and variants of Codex, were tested to investigate the impact of model size and domain knowledge. The experiments revealed that larger models, in particular, the code-davinci-001 variant of Codex, consistently demonstrated learning. It was also notable that the models' ability to generate rollouts reflective of their task-specific logic was crucial to learning success.

Conclusion

The paper introduces a significant stride in RL by leveraging the in-context learning capabilities of LLMs to iterate policies without expert demonstrations or training model parameters. It offers an architecture- and expert-agnostic approach to RL, highlighting the potential of large LLMs to generalize and adapt to diverse RL tasks. The empirical results may be preliminary but the concept implies a promising avenue leveraging the ever-increasing capabilities of foundation models, opening doors to more complex and varied applications as LLMs evolve.

Markdown Report Issue