Introduction
The integration of Reinforcement Learning (RL) and foundation models such as LLMs (LMs) has led to fascinating developments. The primary strategies for RL application on these models generally consist of leveraging curated expert demonstrations or adapter layers, both of which possess inherent drawbacks. This paper presents an alternative approach that employs in-context learning capabilities of LMs for policy iteration in RL tasks, eliminating the need for expert demonstrations or gradient-based optimization methods.
Related Work
Existing RL applications fall into two categories: leveraging expert demonstrations or relying on gradient-based methods such as transformer models. Expert demonstrations often lack the capacity to outperform the experts from whom the demonstrations were derived. Gradient-based methods, whilst powerful, abandon the appealing properties of foundation models that enable learning without direct task-specific training. The proposed method navigates these constraints using in-context learning, demonstrated by showing the method's effectiveness across different LLMs for a variety of simple RL tasks.
Methodology
The presented In-Context Policy Iteration (ICPI) method iteratively updates the prompt content in RL environments, thus inducing the role of a world-model and a rollout-policy solely through in-context learning. The policy is improved by acting in the environment to maximize the Q-value estimates obtained from the LM-generated rollouts. This self-improvement attribute of ICPI enables it to refine policies iteratively, dispensing with the need for gradients and expert demonstrations. Additionally, the paper detailed the prompt construction and the method for computing Q-values relying on the accumulated experience within the model's context window.
Experiments and Results
The approach was empirically validated on six illustrative RL tasks, demonstrating the capability of ICPI to learn policies rapidly. Additionally, different pre-trained LLMs, including GPT-J, OPT-30B, and variants of Codex, were tested to investigate the impact of model size and domain knowledge. The experiments revealed that larger models, in particular, the code-davinci-001 variant of Codex, consistently demonstrated learning. It was also notable that the models' ability to generate rollouts reflective of their task-specific logic was crucial to learning success.
Conclusion
The paper introduces a significant stride in RL by leveraging the in-context learning capabilities of LLMs to iterate policies without expert demonstrations or training model parameters. It offers an architecture- and expert-agnostic approach to RL, highlighting the potential of large LLMs to generalize and adapt to diverse RL tasks. The empirical results may be preliminary but the concept implies a promising avenue leveraging the ever-increasing capabilities of foundation models, opening doors to more complex and varied applications as LLMs evolve.