This paper introduces Reasoning via Planning (RAP), a framework designed to enhance the reasoning capabilities of LLMs by incorporating principles from planning and world models. The authors argue that standard LLM reasoning approaches like Chain-of-Thought (CoT) lack crucial components analogous to human planning: an internal world model to simulate states and action outcomes, a reward mechanism for guidance, and strategic exploration of the reasoning space.
RAP addresses these limitations by repurposing the LLM itself to function as both a reasoning agent and a world model.
- LLM as World Model: RAP defines reasoning as a Markov Decision Process (MDP).
- State (): Represents the current situation in the reasoning process. This is defined differently based on the task (e.g., block configuration in Blocksworld, intermediate variable values in math problems, focused facts in logical reasoning).
- Action (): Represents the next reasoning step (e.g., moving a block, proposing a sub-question, selecting a logical rule). Actions are sampled by the LLM (as agent).
- Transition: The LLM (as world model) predicts the next state () given the current state () and the chosen action () using specific prompts. This creates a reasoning trace .
- Reward Design: A reward function assesses the quality of each reasoning step. Several types of rewards are proposed:
- Action Likelihood: The log probability of the LLM generating action given .
- State Confidence: For tasks involving state prediction (like answering sub-questions in math), the confidence is measured by sampling multiple answers and using the frequency of the most common one.
- Self-Evaluation: The LLM assesses the correctness of a step (e.g., by predicting the probability of "Yes" to "Is this step correct?").
- Task-Specific Heuristics: Custom rewards based on task goals (e.g., comparing the current block state to the goal state in Blocksworld).
- Planning with MCTS: RAP employs Monte Carlo Tree Search (MCTS) to explore the vast space of possible reasoning paths.
- MCTS iteratively builds a tree where nodes are states and edges are actions.
- It uses the Upper Confidence bounds applied to Trees (UCT) algorithm during the Selection phase to balance exploring less-visited paths and exploiting high-reward paths based on estimated state-action values ().
- The Expansion phase adds new actions/states to the tree from leaf nodes using the LLM.
- The Simulation phase estimates future rewards from new nodes (though in RAP, rewards are often calculated directly or based on heuristics rather than full rollouts).
- The Back-propagation phase updates the values along the selected path using the obtained rewards.
- After a budget of iterations, the best reasoning trace is selected (e.g., the path with the highest reward or the most visited path).
- RAP-Aggregation: For tasks where only the final answer matters (like math), results from multiple MCTS traces can be aggregated (e.g., via voting) to improve robustness, similar to self-consistency.
Experiments and Results:
RAP was evaluated on three tasks using LLaMA-33B by default:
- Plan Generation (Blocksworld):
- State: Block configuration. Action: Moving a block. Rewards: Action likelihood + task-specific goal proximity.
- RAP significantly outperformed CoT. For 2/4/6-step problems, RAP achieved an average success rate of 64% (RAP), while CoT was near 0% for 4/6 steps. LLaMA-33B with RAP surpassed GPT-4 with CoT by 33%.
- Improvements attributed to valid action generation via state tracking, backtracking ability, and effective reward signals.
- Math Reasoning (GSM8k):
- State: Intermediate variable values. Action: Proposing a sub-question. Rewards: Self-evaluation + state confidence.
- RAP + Aggregation achieved 51.6% accuracy, outperforming CoT+SC (46.8%) and Least-to-Most+SC (42.5%).
- Logical Reasoning (PrOntoQA):
- State: Current focused fact. Action: Selecting a rule. Reward: Self-evaluation.
- RAP achieved 94.2% prediction accuracy and 78.8% proof accuracy, surpassing CoT (87.8% / 64.8%) and CoT+SC (89.8% prediction).
Further experiments on the full Blocksworld dataset with Llama-2 70B showed RAP maintaining strong performance on much harder problems (up to 12 steps) where CoT largely failed, demonstrating the framework's scalability with model capability. Ablation studies confirmed the benefits of combining different reward types.
Conclusion: The RAP framework effectively integrates world modeling and planning (via MCTS) into LLM reasoning, enabling more strategic exploration and leading to substantial performance improvements on complex reasoning tasks compared to standard prompting methods.