Planning without Search: Refining Frontier LLMs with Offline Goal-Conditioned RL (2505.18098v1)

Published 23 May 2025 in cs.CL and cs.AI

Abstract: LLMs excel in tasks like question answering and dialogue, but complex tasks requiring interaction, such as negotiation and persuasion, require additional long-horizon reasoning and planning. Reinforcement learning (RL) fine-tuning can enable such planning in principle, but suffers from drawbacks that hinder scalability. In particular, multi-turn RL training incurs high memory and computational costs, which are exacerbated when training LLMs as policies. Furthermore, the largest LLMs do not expose the APIs necessary to be trained in such manner. As a result, modern methods to improve the reasoning of LLMs rely on sophisticated prompting mechanisms rather than RL fine-tuning. To remedy this, we propose a novel approach that uses goal-conditioned value functions to guide the reasoning of LLM agents, that scales even to large API-based models. These value functions predict how a task will unfold given an action, allowing the LLM agent to evaluate multiple possible outcomes, both positive and negative, to plan effectively. In addition, these value functions are trained over reasoning steps rather than full actions, to be a concise and light-weight module that facilitates decision-making in multi-turn interactions. We validate our method on tasks requiring interaction, including tool use, social deduction, and dialogue, demonstrating superior performance over both RL fine-tuning and prompting methods while maintaining efficiency and scalability.

PDF Abstract

This paper introduces Planning with a Natural Language Critic (\textsf{PNLC}), a novel approach to imbue LLM agents with long-horizon planning and reasoning capabilities for complex, multi-turn interactive tasks. The authors highlight that while LLMs excel in many language tasks, scaling traditional Reinforcement Learning (RL) fine-tuning to large, frontier models like GPT-4 for interactive environments is challenging due to high computational and memory costs, and limited API access. Existing methods often rely on sophisticated prompting or costly inference-time search, which can be computationally expensive and may not lead to truly data-driven decision-making.

PNLC addresses these limitations by proposing a method that leverages offline RL to train a lightweight, auxiliary goal-conditioned value function. This function, instead of directly training the LLM policy, acts as a natural language critic at inference time to guide the LLM agent's reasoning. The core idea is to learn a critic from offline data that predicts the likelihood of achieving various outcomes given the current state and a proposed action (specifically, a high-level "thought").

The training process involves:

Offline Data Collection: Using a dataset of task-specific trajectories generated by a prior agent (e.g., a suboptimal LLM agent).
Data Processing: Summarizing full interaction histories into compact descriptions and embedding these natural language descriptions (states and thoughts) using a cheaper LLM (like GPT-3).
Value Function Training: Training a goal-conditioned Q-value function $Q(s, a^{\mathsf{tht}, g)$ using a modified Implicit Q-learning (IQL) algorithm. This function predicts the likelihood of reaching a future state (goal $g$ ) after executing a specific thought ($a^{\mathsf{tht}$) from state $s$ . The goals are sampled random future states from the offline trajectories, and the reward $r(s, g)$ is a binary indicator of whether state $s$ is the goal state $g$ . The value function is a lightweight MLP trained over the LLM embeddings, not directly on natural language.

At inference time, the LLM agent utilizes the learned goal-conditioned value function as a natural language critic. Given the current state and a proposed thought, the LLM generates $n$ hypothetical positive and negative future outcomes (goals). The critic then uses the trained value function to compute the likelihood $Q(s, a^{\mathsf{tht}, g)$ for each generated goal. These likelihoods are presented to the LLM agent as a natural language value assessment. The LLM agent can then use this rich feedback, which summarizes potential future outcomes without costly online simulation, to refine its proposed thought for the current step. The authors found that a small number of refinement iterations (e.g., $m=2$ ) is often sufficient.

The method is evaluated on three diverse interactive tasks: Web Shopping (Yao et al., 2022 ) (a tool-use task), AvalonBench (Light et al., 2023 ) (a social deduction game), and a Persuasion task (Salvi et al., 21 Mar 2024 , Wang et al., 2019 ) (a goal-oriented dialogue). PNLC consistently outperforms state-of-the-art methods, including direct RL fine-tuning (like ArCHer (Zhou et al., 29 Feb 2024 )) and prompting-based approaches relying on inference-time search (like LATS (Light et al., 20 Aug 2024 ) or task-specific methods like Agent Q (Putta et al., 13 Aug 2024 ) and Strategist (Light et al., 20 Aug 2024 )). A key finding is that PNLC achieves superior performance while requiring significantly lower inference-time computation compared to search-based methods. For example, on WebShop, PNLC achieves a higher score (78.2) and comparable success rate (48.0%) to Agent Q ( $n=30$ ) (77.1 score, 48.0% SR) but is much faster (5s vs 46s per decision). Similarly, on Avalon, PNLC achieves a 47.0% winrate compared to Strategist ( $n=30$ )'s 42.0%, with a much lower inference time (6s vs 62s).

An ablation paper confirms that both the goal-conditioned training and the inference-time refinement process using natural language values are crucial for PNLC's performance. Removing either component degrades performance to the level of simple prompting baselines like ReAct (Yao et al., 2022 ).

The paper's contributions are summarized as:

Introducing goal-conditioned value functions that model outcome probabilities for LLMs to reason about long-term effects.
Applying these value functions to high-level thoughts for efficient decision-making.
Proposing a novel inference process where the LLM refines decisions based on predicted positive and negative outcomes from the critic.

While effective and scalable, the authors acknowledge limitations, including the need to train task-specific value functions, potential reliance on the LLM's ability to reason about hypothetical futures in out-of-domain tasks, and the dependence on the quality and diversity of the offline training data. The potential for dual use in tasks like persuasion is also noted.

Implementation details, including specific prompt templates and hyperparameters for each task, are provided in the appendix.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Joey Hong (23 papers)
Anca Dragan (62 papers)
Sergey Levine (531 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/fly51fly/status/1927121956987453673

https://twitter.com/svlevine/status/1927212799081549953

https://twitter.com/AryanPa66861306/status/1933231745387704530

https://twitter.com/SamuelAlbanie/status/1929110634035073156

YouTube

Show All Videos