This paper introduces Planning with a Natural Language Critic (\textsf{PNLC}), a novel approach to imbue LLM agents with long-horizon planning and reasoning capabilities for complex, multi-turn interactive tasks. The authors highlight that while LLMs excel in many language tasks, scaling traditional Reinforcement Learning (RL) fine-tuning to large, frontier models like GPT-4 for interactive environments is challenging due to high computational and memory costs, and limited API access. Existing methods often rely on sophisticated prompting or costly inference-time search, which can be computationally expensive and may not lead to truly data-driven decision-making.
PNLC addresses these limitations by proposing a method that leverages offline RL to train a lightweight, auxiliary goal-conditioned value function. This function, instead of directly training the LLM policy, acts as a natural language critic at inference time to guide the LLM agent's reasoning. The core idea is to learn a critic from offline data that predicts the likelihood of achieving various outcomes given the current state and a proposed action (specifically, a high-level "thought").
The training process involves:
- Offline Data Collection: Using a dataset of task-specific trajectories generated by a prior agent (e.g., a suboptimal LLM agent).
- Data Processing: Summarizing full interaction histories into compact descriptions and embedding these natural language descriptions (states and thoughts) using a cheaper LLM (like GPT-3).
- Value Function Training: Training a goal-conditioned Q-value function $Q(s, a^{\mathsf{tht}, g)$ using a modified Implicit Q-learning (IQL) algorithm. This function predicts the likelihood of reaching a future state (goal ) after executing a specific thought ($a^{\mathsf{tht}$) from state . The goals are sampled random future states from the offline trajectories, and the reward is a binary indicator of whether state is the goal state . The value function is a lightweight MLP trained over the LLM embeddings, not directly on natural language.
At inference time, the LLM agent utilizes the learned goal-conditioned value function as a natural language critic. Given the current state and a proposed thought, the LLM generates hypothetical positive and negative future outcomes (goals). The critic then uses the trained value function to compute the likelihood $Q(s, a^{\mathsf{tht}, g)$ for each generated goal. These likelihoods are presented to the LLM agent as a natural language value assessment. The LLM agent can then use this rich feedback, which summarizes potential future outcomes without costly online simulation, to refine its proposed thought for the current step. The authors found that a small number of refinement iterations (e.g., ) is often sufficient.
The method is evaluated on three diverse interactive tasks: Web Shopping (Yao et al., 2022 ) (a tool-use task), AvalonBench (Light et al., 2023 ) (a social deduction game), and a Persuasion task (Salvi et al., 21 Mar 2024 , Wang et al., 2019 ) (a goal-oriented dialogue). PNLC consistently outperforms state-of-the-art methods, including direct RL fine-tuning (like ArCHer (Zhou et al., 29 Feb 2024 )) and prompting-based approaches relying on inference-time search (like LATS (Light et al., 20 Aug 2024 ) or task-specific methods like Agent Q (Putta et al., 13 Aug 2024 ) and Strategist (Light et al., 20 Aug 2024 )). A key finding is that PNLC achieves superior performance while requiring significantly lower inference-time computation compared to search-based methods. For example, on WebShop, PNLC achieves a higher score (78.2) and comparable success rate (48.0%) to Agent Q () (77.1 score, 48.0% SR) but is much faster (5s vs 46s per decision). Similarly, on Avalon, PNLC achieves a 47.0% winrate compared to Strategist ()'s 42.0%, with a much lower inference time (6s vs 62s).
An ablation paper confirms that both the goal-conditioned training and the inference-time refinement process using natural language values are crucial for PNLC's performance. Removing either component degrades performance to the level of simple prompting baselines like ReAct (Yao et al., 2022 ).
The paper's contributions are summarized as:
- Introducing goal-conditioned value functions that model outcome probabilities for LLMs to reason about long-term effects.
- Applying these value functions to high-level thoughts for efficient decision-making.
- Proposing a novel inference process where the LLM refines decisions based on predicted positive and negative outcomes from the critic.
While effective and scalable, the authors acknowledge limitations, including the need to train task-specific value functions, potential reliance on the LLM's ability to reason about hypothetical futures in out-of-domain tasks, and the dependence on the quality and diversity of the offline training data. The potential for dual use in tasks like persuasion is also noted.
Implementation details, including specific prompt templates and hyperparameters for each task, are provided in the appendix.