Reasoning with Language Model is Planning with World Model (2305.14992v2)

Published 24 May 2023 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs have shown remarkable reasoning capabilities, especially when prompted to generate intermediate reasoning steps (e.g., Chain-of-Thought, CoT). However, LLMs can still struggle with problems that are easy for humans, such as generating action plans for executing tasks in a given environment, or performing complex math, logical, and commonsense reasoning. The deficiency stems from the key fact that LLMs lack an internal $\textit{world model}$ to predict the world $\textit{state}$ (e.g., environment status, intermediate variable values) and simulate long-term outcomes of actions. This prevents LLMs from performing deliberate planning akin to human brains, which involves exploring alternative reasoning paths, anticipating future states and rewards, and iteratively refining existing reasoning steps. To overcome the limitations, we propose a new LLM reasoning framework, $\underline{R}$easoning vi$\underline{a}$ $\underline{P}$lanning $\textbf{(RAP)}$. RAP repurposes the LLM as both a world model and a reasoning agent, and incorporates a principled planning algorithm (based on Monto Carlo Tree Search) for strategic exploration in the vast reasoning space. During reasoning, the LLM (as agent) incrementally builds a reasoning tree under the guidance of the LLM (as world model) and task-specific rewards, and obtains a high-reward reasoning path efficiently with a proper balance between exploration $\textit{vs.}$ exploitation. We apply RAP to a variety of challenging reasoning problems including plan generation, math reasoning, and logical inference. Empirical results on these tasks demonstrate the superiority of RAP over various strong baselines, including CoT and least-to-most prompting with self-consistency. RAP on LLAMA-33B surpasses CoT on GPT-4 with 33% relative improvement in a plan generation setting.

Citations (354)

View on Semantic Scholar

Summary

The paper introduces RAP, which integrates world models with LLMs using Monte Carlo Tree Search to overcome reasoning limitations.
Experimental results demonstrate RAP's superior performance over Chain-of-Thought in Blocksworld, GSM8K, and logical inference tasks.
The framework offers a robust strategy by simulating future states and optimizing decision-making via dynamic reward formulations and iterative planning.

Reasoning with LLM is Planning with World Model

The paper introduces "Reasoning via Planning (RAP)," a framework that augments LLMs with world models and planning algorithms to enhance reasoning capabilities. By integrating deliberate planning, RAP addresses inherent limitations in LLM's reasoning by employing Monte Carlo Tree Search (MCTS) to balance exploration and exploitation efficiently.

Introduction to RAP

RAP aims to overcome the lack of internal world models in traditional LLM approaches, which restricts their ability to simulate world states and predict long-term outcomes of actions. The framework repurposes LLMs as both reasoning agents and world models, enabling them to generate and simulate alternative reasoning paths, anticipate future states, and iteratively refine reasoning strategies.

Figure 1: An overview of Reasoning via Planning (RAP). Compared with previous LLM reasoning methods like Chain-of-Thought \cite{wei2022chain}.

Implementing RAP

RAP is implemented as a structured framework where LLMs utilize world models to dynamically predict next states given current states and actions. This is illustrated in various domains such as Blocksworld for plan generation, and PrOntoQA for logical reasoning.

World Model: In RAP, the world model predicts transitions between states. For example, in Blocksworld, the LLM details block configurations to determine subsequent states after actions like moving or stacking blocks.
Reward Formulation: RAP employs various reward functions to evaluate reasoning steps, such as likelihood of actions, state confidence, and self-evaluation mechanisms. These leverage the LLM's internal world knowledge to assess and guide decision-making.
Monte Carlo Tree Search: MCTS facilitates strategic exploration by constructing a reasoning tree. It selectively expands the tree by sampling actions, simulating potential future outcomes using the world model, and back-propagating estimated rewards to update state-action value functions.
Figure 2: An illustration of the four phases in an iteration in MCTS planning.

Experimental Evaluation

RAP demonstrates substantial improvements across diverse reasoning tasks:

Blocksworld Task: RAP significantly outperforms the Chain-of-Thought method, achieving a higher success rate in transitioning blocks to target configurations, especially in complex scenarios requiring multiple steps.
Mathematical Reasoning (GSM8K): RAP shows enhanced accuracy in solving math problems by dynamically generating and evaluating sub-questions, integrating RAP-Aggregation for better results.
Figure 3: Results on GSM-8K, with different numbers of sampled paths or iterations.
Logical Inference (PrOntoQA): RAP outperforms baseline methods in both prediction accuracy and proof generation, illustrating its superior ability in logical reasoning tasks.

Implications and Future Work

RAP's versatile approach to reasoning paves the way for improved LLM applications in complex, real-world scenarios requiring strategic planning. Integrating planning algorithms with LLMs as world models presents a robust methodology for tackling previously challenging tasks. Future research may focus on refining reward functions and exploring the integration of fine-tuning techniques or external tools to expand RAP's applicability across various domains.

In conclusion, the Reasoning via Planning framework elevates the reasoning capabilities of LLMs by leveraging internal simulations and principled planning frameworks, marking a notable advance in AI research and applications.