Large Language Models Are Semi-Parametric Reinforcement Learning Agents

Published 9 Jun 2023 in cs.CL and cs.AI | (2306.07929v2)

Abstract: Inspired by the insights in cognitive science with respect to human memory and reasoning mechanism, a novel evolvable LLM-based (LLM) agent framework is proposed as REMEMBERER. By equipping the LLM with a long-term experience memory, REMEMBERER is capable of exploiting the experiences from the past episodes even for different task goals, which excels an LLM-based agent with fixed exemplars or equipped with a transient working memory. We further introduce Reinforcement Learning with Experience Memory (RLEM) to update the memory. Thus, the whole system can learn from the experiences of both success and failure, and evolve its capability without fine-tuning the parameters of the LLM. In this way, the proposed REMEMBERER constitutes a semi-parametric RL agent. Extensive experiments are conducted on two RL task sets to evaluate the proposed framework. The average results with different initialization and training sets exceed the prior SOTA by 4% and 2% for the success rate on two task sets and demonstrate the superiority and robustness of REMEMBERER.

Abstract PDF Upgrade to Chat

Authors (6)

Citations (13)

View on Semantic Scholar

Summary

The paper proposes a semi-parametric framework using a persistent memory to improve LLM-driven reinforcement learning agents.
It employs the RLEM pipeline to incorporate past experiences into in-context learning, yielding up to a 4% performance boost across benchmarks.
Ablation studies confirm that n-step bootstrapping and the use of discouraged actions are key to preventing repetitive failures and ensuring consistency.

LLMs Are Semi-Parametric Reinforcement Learning Agents

Introduction

The paper "LLMs Are Semi-Parametric Reinforcement Learning Agents" (2306.07929) introduces Rememberer, an LLM-based agent framework designed to enhance decision-making capabilities by integrating a long-term experience memory. Unlike LLM agents with transient working memory, Rememberer leverages Reinforcement Learning with Experience Memory (RLEM) to update and utilize past interaction experiences without fine-tuning model parameters, positioning itself as a semi-parametric RL agent. The experimental findings indicate that Rememberer outperforms state-of-the-art benchmarks by 4% and 2% in success rates across two task sets.

Methodology

RLEM Pipeline

RLEM enables LLM agents to learn from interaction experiences via a persistent external memory, contrasting traditional RL approaches that primarily modify model parameters.

Figure 1: Pipeline of RLEM and architecture of Rememberer.

Experience Memory

The experience memory maintains a record of task information, observations, actions, and associated $Q$ value estimations, treated as non-parametric augmentations to the LLM. Experiences are sourced from interactions with the environment, recorded without exhaustive explorations due to limited training trajectories.

Figure 2: Comparison of the LLM-based agents with short-term working memory and long-term experience memory.

Usage of Experience

Stored experiences serve as dynamic exemplars for few-shot in-context learning. The memory is queried based on task and observation similarity functions to construct prompts that guide LLM decision-making through encouraged and discouraged action forecasts.

Figure 3: Exemplar for WebShop. YAML markups are adopted to avoid confusing the keywords like ``Observation:'' with the colon-ended titles in the page representation.

Experimental Results

Task Setup

Evaluations on WebShop and WikiHow task sets employed GPT-3.5 (text-davinci-003) API for implementation. Experiments demonstrated a significant performance increase, affirming the efficacy of equipping LLMs with a persistent memory to exploit interaction experiences.

Figure 4: Capability evolving on WebShop.

Performance Metrics

In WebShop, Rememberer attained an average score of 0.68 with a 0.39 success rate, surpassing both previous SOTA methodologies and RL/IL benchmarks. On WikiHow, Rememberer achieved a 2.63 average reward and 0.93 success rate.

Figure 5: Number of experiences in each training epoch.

Ablation Studies

Multiple ablation studies corroborated the pivotal role of $n$ -step bootstrapping in refining $Q$ value estimations, denoting improved agent decision-making consistency. The inclusion of discouraged actions in feedback further proved crucial in preventing repetitive failures.

Figure 6: Case of the ablation study on the discouraged actions. As there are no valuable actions to encourage in the experience, a random action is generated. When the discouraged actions with low value are omitted, the LLM may repeat the failure with the same pattern.

Implications and Future Work

The Rememberer framework prescribes a scalable alternative to fine-tuning-intensive RL implementations for LLM-based agents, enhancing efficiency through persistent memory utilization. Future research should adapt Rememberer to environments with extensive observation and elaborate the integration of contemporary RL techniques under RLEM.

Conclusion

The paper advances the paradigm of semi-parametric LLM agents, demonstrating compelling performance enhancements through structured memory integration, positioning Rememberer as a robust framework for decision-making tasks reliant on episodic memory akin to human cognitive processes.

Markdown Report Issue