Reinforcement Learning for Long-Horizon Interactive LLM Agents (2502.01600v3)

Published 3 Feb 2025 in cs.LG and cs.AI

Abstract: Interactive digital agents (IDAs) leverage APIs of stateful digital environments to perform tasks in response to user requests. While IDAs powered by instruction-tuned LLMs can react to feedback from interface invocations in multi-step exchanges, they have not been trained in their respective digital environments. Prior methods accomplish less than half of tasks in sophisticated benchmarks such as AppWorld. We present a reinforcement learning (RL) approach that trains IDAs directly in their target environments. We formalize this training as a partially observable Markov decision process and derive LOOP, a data- and memory-efficient variant of proximal policy optimization. LOOP uses no value network and maintains exactly one copy of the underlying LLM in memory, making its implementation straightforward and as memory-efficient as fine-tuning a single LLM. A 32-billion-parameter agent trained with LOOP in the AppWorld environment outperforms the much larger OpenAI o1 agent by 9 percentage points (15% relative). To our knowledge, this is the first reported application of RL to IDAs that interact with a stateful, multi-domain, multi-app environment via direct API calls. Our analysis sheds light on the effectiveness of RL in this area, showing that the agent learns to consult the API documentation, avoid unwarranted assumptions, minimize confabulation, and recover from setbacks.

Summary

The paper introduces a novel RL algorithm, LOOP, that enhances LLM-based interactive agents' performance by 15% in complex digital tasks.
It presents a tailored approach that bypasses traditional value networks, enabling agents to manage long-horizon tasks across stateful environments.
Empirical evaluations show improved API documentation usage and reduced execution setbacks, marking a significant step in autonomous agent training.

Reinforcement Learning for Long-Horizon Interactive LLM Agents

The paper under review presents a comprehensive paper on the application of reinforcement learning (RL) to train Interactive Digital Agents (IDAs) that use LLMs to autonomously operate in multi-app, stateful digital environments, notably AppWorld. This research is pertinent for scenarios where agents need to perform complex, long-horizon tasks by interacting with APIs in a stateful digital workspace. The critical contribution of the paper is the proposition and validation of a training strategy for LLM-based agents using reinforcement learning, specifically optimized through the development of an innovative algorithm termed as LOOP (Leave-One-Out Proximal Policy Optimization).

Research Overview

The paper begins by contextualizing the problem IDAs face: executing complex, sequence-dependent tasks across various software applications using LLM-based control. Existing approaches fail to achieve satisfactory success rates, as evidenced by AppWorld benchmark tasks where top-performing models like OpenAI's o1 agent achieve success rates slightly above 50%. These benchmarks demand sophisticated reasoning, planning, and adaptability to varying task specifications, which prior approaches have not adequately addressed.

Methods

The authors introduce LOOP, a variant of Proximal Policy Optimization tailored to the challenges of IDAs in complex environments. LOOP eschews the traditional value network, instead relying on an efficient usage of memory by maintaining a single backbone LLM. The algorithm employs a policy gradient method with a leave-one-out baseline estimate, enhancing sample efficiency by leveraging off-policy samples without succumbing to the memory overhead typically associated with multiple policy versions during training. Such methodological innovations allow LOOP to support agents in managing the extensive contexts and long dependencies inherent in AppWorld tasks.

Results

The empirical results are striking. A 32 billion parameter IDA trained with LOOP outperformed significantly larger models, such as the OpenAI o1 agent—registering a 15% relative improvement in task goal completion. As evaluated on two test sets from AppWorld (test-normal and test-challenge), LOOP-enhanced agents show enhanced capabilities in API documentation consultation, assumption avoidance, confabulation reduction, and recovery from execution setbacks.

Implications

Practically speaking, this RL-driven approach indicates a viable pathway to improving interactive agent performance in real-world digital environments where manual programming for every possible task sequence is infeasible. Furthermore, theoretically, it proposes a novel interaction protocol that might aid future research in more generalized AI learning frameworks that can operate under partial observability and stateful contexts.

The paper highlights key emergent behaviors in trained agents, such as increased interaction with documentation and fewer incorrect assumptions, supporting the notion that RL not only equips agents with improved task success metrics but also with more human-like problem-solving strategies.

Future Directions

The paper opens several avenues for further exploration. Beyond AppWorld, other domains could benefit from LOOP-style optimization, particularly if adapted to include non-deterministic environments or tasks with stochastic outcomes. Additionally, further tuning of reinforcement learning hyperparameters tailored to dynamically evolving environments could yield even more robust agents. The exploration of these areas could lead toward more generalized intelligence in machine interactions with digital environments.

In summary, the paper demonstrates a substantial advancement in the training of LLM-based interactive agents using reinforcement learning. It provides a robust framework and evidence of the efficacy of RL in enhancing agent performance in complex API-driven environments, thereby setting a standard for future developments in autonomous digital agents.

Related Papers

Tweets

https://twitter.com/petrenko_ai/status/1887199464366240002

https://twitter.com/rudrankriyam/status/1887568595070157226

https://twitter.com/leloykun/status/1903382513810288774

https://twitter.com/bronzeagepapi/status/1887418417650823462

https://twitter.com/leloykun/status/1902054272348647496

https://twitter.com/artifishal_i/status/1903458623046439126

HackerNews

Digital Agent outperforms o1 by 15% – trained with new RL-variant similar to R1 (11 points, 0 comments)