Reasoning with Language Model is Planning with World Model (2305.14992v2)

Published 24 May 2023 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs have shown remarkable reasoning capabilities, especially when prompted to generate intermediate reasoning steps (e.g., Chain-of-Thought, CoT). However, LLMs can still struggle with problems that are easy for humans, such as generating action plans for executing tasks in a given environment, or performing complex math, logical, and commonsense reasoning. The deficiency stems from the key fact that LLMs lack an internal $\textit{world model}$ to predict the world $\textit{state}$ (e.g., environment status, intermediate variable values) and simulate long-term outcomes of actions. This prevents LLMs from performing deliberate planning akin to human brains, which involves exploring alternative reasoning paths, anticipating future states and rewards, and iteratively refining existing reasoning steps. To overcome the limitations, we propose a new LLM reasoning framework, $\underline{R}$easoning vi$\underline{a}$ $\underline{P}$lanning $\textbf{(RAP)}$. RAP repurposes the LLM as both a world model and a reasoning agent, and incorporates a principled planning algorithm (based on Monto Carlo Tree Search) for strategic exploration in the vast reasoning space. During reasoning, the LLM (as agent) incrementally builds a reasoning tree under the guidance of the LLM (as world model) and task-specific rewards, and obtains a high-reward reasoning path efficiently with a proper balance between exploration $\textit{vs.}$ exploitation. We apply RAP to a variety of challenging reasoning problems including plan generation, math reasoning, and logical inference. Empirical results on these tasks demonstrate the superiority of RAP over various strong baselines, including CoT and least-to-most prompting with self-consistency. RAP on LLAMA-33B surpasses CoT on GPT-4 with 33% relative improvement in a plan generation setting.

PDF Abstract

This paper introduces Reasoning via Planning (RAP), a framework designed to enhance the reasoning capabilities of LLMs by incorporating principles from planning and world models. The authors argue that standard LLM reasoning approaches like Chain-of-Thought (CoT) lack crucial components analogous to human planning: an internal world model to simulate states and action outcomes, a reward mechanism for guidance, and strategic exploration of the reasoning space.

RAP addresses these limitations by repurposing the LLM itself to function as both a reasoning agent and a world model.

LLM as World Model: RAP defines reasoning as a Markov Decision Process (MDP).
- State ( $s_t$ ): Represents the current situation in the reasoning process. This is defined differently based on the task (e.g., block configuration in Blocksworld, intermediate variable values in math problems, focused facts in logical reasoning).
- Action ( $a_t$ ): Represents the next reasoning step (e.g., moving a block, proposing a sub-question, selecting a logical rule). Actions are sampled by the LLM (as agent).
- Transition: The LLM (as world model) predicts the next state ( $s_{t+1}$ ) given the current state ( $s_t$ ) and the chosen action ( $a_t$ ) using specific prompts. This creates a reasoning trace $(s_0, a_0, s_1, \dots, s_T)$ .
Reward Design: A reward function $r_t = r(s_t, a_t)$ $r_{t} = r (s_{t}, a_{t})$ assesses the quality of each reasoning step. Several types of rewards are proposed:
- Action Likelihood: The log probability of the LLM generating action $a_t$ given $s_t$ .
- State Confidence: For tasks involving state prediction (like answering sub-questions in math), the confidence is measured by sampling multiple answers and using the frequency of the most common one.
- Self-Evaluation: The LLM assesses the correctness of a step (e.g., by predicting the probability of "Yes" to "Is this step correct?").
- Task-Specific Heuristics: Custom rewards based on task goals (e.g., comparing the current block state to the goal state in Blocksworld).
Planning with MCTS: RAP employs Monte Carlo Tree Search (MCTS) to explore the vast space of possible reasoning paths.
- MCTS iteratively builds a tree where nodes are states and edges are actions.
- It uses the Upper Confidence bounds applied to Trees (UCT) algorithm during the Selection phase to balance exploring less-visited paths and exploiting high-reward paths based on estimated state-action values ( $Q(s, a)$ ).
- The Expansion phase adds new actions/states to the tree from leaf nodes using the LLM.
- The Simulation phase estimates future rewards from new nodes (though in RAP, rewards are often calculated directly or based on heuristics rather than full rollouts).
- The Back-propagation phase updates the $Q$ values along the selected path using the obtained rewards.
- After a budget of iterations, the best reasoning trace is selected (e.g., the path with the highest reward or the most visited path).
RAP-Aggregation: For tasks where only the final answer matters (like math), results from multiple MCTS traces can be aggregated (e.g., via voting) to improve robustness, similar to self-consistency.

Experiments and Results:

RAP was evaluated on three tasks using LLaMA-33B by default:

Plan Generation (Blocksworld):
- State: Block configuration. Action: Moving a block. Rewards: Action likelihood + task-specific goal proximity.
- RAP significantly outperformed CoT. For 2/4/6-step problems, RAP achieved an average success rate of 64% (RAP $^{(20)}$ ), while CoT was near 0% for 4/6 steps. LLaMA-33B with RAP surpassed GPT-4 with CoT by 33%.
- Improvements attributed to valid action generation via state tracking, backtracking ability, and effective reward signals.
Math Reasoning (GSM8k):
- State: Intermediate variable values. Action: Proposing a sub-question. Rewards: Self-evaluation + state confidence.
- RAP $^{(10)}$ + Aggregation achieved 51.6% accuracy, outperforming CoT+SC $^{(10)}$ (46.8%) and Least-to-Most+SC $^{(10)}$ (42.5%).
Logical Reasoning (PrOntoQA):
- State: Current focused fact. Action: Selecting a rule. Reward: Self-evaluation.
- RAP achieved 94.2% prediction accuracy and 78.8% proof accuracy, surpassing CoT (87.8% / 64.8%) and CoT+SC (89.8% prediction).

Further experiments on the full Blocksworld dataset with Llama-2 70B showed RAP maintaining strong performance on much harder problems (up to 12 steps) where CoT largely failed, demonstrating the framework's scalability with model capability. Ablation studies confirmed the benefits of combining different reward types.

Conclusion: The RAP framework effectively integrates world modeling and planning (via MCTS) into LLM reasoning, enabling more strategic exploration and leading to substantial performance improvements on complex reasoning tasks compared to standard prompting methods.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Shibo Hao (15 papers)
Yi Gu (69 papers)
Haodi Ma (8 papers)
Joshua Jiahua Hong (1 paper)
Zhen Wang (571 papers)
Daisy Zhe Wang (31 papers)
Zhiting Hu (74 papers)

Citations (354)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/developerlin/status/1835672332813402476

https://twitter.com/LaBlua/status/1835307922177319233

https://twitter.com/Ber18791531/status/1901519018856419517

https://twitter.com/Soumanta_Das/status/1879625611523899722

https://twitter.com/RomanEngeler/status/1887562488276726178