LLMs Are In-Context Reinforcement Learners (2410.05362v1)

Published 7 Oct 2024 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs can learn new tasks through in-context supervised learning (i.e., ICL). This work studies if this ability extends to in-context reinforcement learning (ICRL), where models are not given gold labels in context, but only their past predictions and rewards. We show that a naive application of ICRL fails miserably, and identify the root cause as a fundamental deficiency at exploration, which leads to quick model degeneration. We propose an algorithm to address this deficiency by increasing test-time compute, as well as a compute-bound approximation. We use several challenging classification tasks to empirically show that our ICRL algorithms lead to effective learning from rewards alone, and analyze the characteristics of this ability and our methods. Overall, our results reveal remarkable ICRL abilities in LLMs.

PDF HTML Abstract

Analyzing LLMs as In-Context Reinforcement Learners

The paper under review explores the intriguing capacity of LLMs to engage in in-context reinforcement learning (ICRL). This concept extends beyond in-context supervised learning, which involves embedding supervised input-output pairs within an LLM’s context. In ICRL, however, models attempt to learn from interactions that consist of inputs, predictions, and associated rewards, thus aligning the framework with reinforcement learning (RL).

Core Contributions

The authors identify a fundamental challenge in applying LLMs to ICRL: a deficiency in exploration that leads to degeneration of model predictions. They propose stochastic prompt construction and selective use of positive rewards to remedy this. Their methodology shows substantial improvements, suggesting LLMs can indeed learn in-context from rewards.

Methodological Insights

Naive ICRL: This straightforward approach quickly degenerates. The model repeatedly predicts the same output due to the lack of exploratory behavior. This highlights an intrinsic inability of LLMs to navigate the action space without guided exploration.
Explorative ICRL: By introducing stochasticity in prompt composition and filtering episodes to include only those with positive rewards, Explorative ICRL shows significant efficacy. This method increases model performance dramatically across several classification tasks, suggesting that exploration can be effectively introduced through prompt variability.
Approximate ICRL: Aimed at reducing computational overhead, Approximate ICRL maintains multiple potential contexts, updating them stochastically. This allows a trade-off between computational efficiency and learning effectiveness.

Empirical Evaluation

The authors utilize benchmarks such as Banking-77, Clinic-150, and TREC, focusing on the challenges posed by large output spaces and contextual bandits. Results indicate that Explorative ICRL achieves notable performance improvements, narrowing the gap to supervised ICL—particularly in tasks with extensive label spaces.

Key Results

Llama's performance in the Banking-77 task improves from 17.2% to 66.0% accuracy through ICRL.
The approximation approach shows promise, especially with the Llama model, recognizing potential dependencies on model strength and complexity.
Exploration deficiencies and the inability to learn from negative examples are major barriers that the proposed methods address.

Implications and Reflections

The paper suggests that LLMs possess untapped potential for learning through simpler RL signals, opening avenues for their application in dynamic environments without explicit supervision. However, significant challenges remain in tuning exploration parameters and managing computational demands.

Future Directions

This work sets a foundation for future studies to explore the scalability of ICRL in more complex tasks such as summarization or question-answering tasks where reward structures are more nuanced. Addressing negative signal processing and ensuring efficient computation over long contexts remain pivotal challenges.

In summary, the paper highlights a novel direction for LLM capabilities, suggesting a bridge between explicit supervised learning paradigms and more autonomous learning systems adapting through inherent RL capabilities. This exploration enriches the ongoing discourse on LLMs’ capacity to generalize learning skills as emergent properties rather than solely engineered features.