Plan of Thought (PoT) Framework

Updated 19 November 2025

Plan of Thought (PoT) is a framework that formalizes multi-step reasoning for LLMs using a POMDP structure to guide decision-making under uncertainty.
It leverages LLM self-reflection as a dynamic heuristic and incorporates an anytime Monte Carlo planning method to efficiently explore complex reasoning spaces.
Empirical evaluations on the Game of 24 benchmark show that PoT significantly outperforms chain-of-thought and tree-search methods, achieving an 89.4% success rate.

Plan of Thought (PoT)

Plan of Thought (PoT) is a framework for multi-step problem solving with LLMs that formalizes the reasoning process as online planning in a partially observable Markov decision process (POMDP) and leverages LLM self-reflection as a search heuristic. PoT was introduced to address the limitations of chain-of-thought (CoT) and earlier tree-search-augmented methods in compositional reasoning domains, demonstrating significant gains in both accuracy and computational efficiency, especially in tasks such as the "Game of 24" (Liu, 29 Apr 2024).

1. Formal POMDP Formulation of Reasoning

PoT frames LLM-based stepwise reasoning as sequential decision making over a state-action space characterized by uncertainty in the value of intermediate reasoning states. Formally, the POMDP is specified as $(S, A, O, T, Z, R, \gamma)$ :

State Space ( $S = S_{sub} \times U$ ):
- $s_{sub}$ : The explicit current subproblem or partial solution; for example, the remaining numbers and partial expressions in Game of 24.
- $u$ : A latent, unobservable "true usefulness" score of the current subproblem.
Action Space ( $A = \{\text{continue}, \text{rollback}, \text{think}\}$ ):
- "continue": Fix the current thought and proceed.
- "rollback": Undo the last thought and revert.
- "think": Prompt the LLM to generate a new candidate substep.
Observation Space ( $O \cong S_{sub} \times V$ ):
- $o=(s_{sub}, v)$ , where $v$ is provided by the LLM's own reflection, $V(s_{sub}) \sim p^{value}_\theta(o|s_{sub})$ , mapping to qualitative labels ("sure", "likely", "impossible").
Transition Function ( $T(s'|s, a)$ ): Deterministic transitions reflecting action semantics.
Observation Function ( $Z(o|s, a)$ ): LLM value judgment via $p^{value}_\theta$ .
Reward Function ( $R(s, a)$ ): Sparse reward: $R_{max}$ if a completed trajectory is correct (by LLM judgment), $R_{min}$ otherwise.
Discount Factor ( $\gamma=1$ ): No penalty for long reasoning chains.

This formalism provides a unified interface for both "thinking" (substep generation) and "reasoning" (selection among thoughts), allowing generalization to arbitrary multi-step reasoning domains (Liu, 29 Apr 2024).

2. Heuristic-Guided Search via LLM Reflections

A core feature of PoT is the use of LLM self-reflections to generate state-specific search heuristics. At each non-terminal state $s$ , the LLM is prompted to evaluate the promise of $s$ using a single-token prompt that samples from $\{$ "sure", "likely", "impossible" $\}$ . Probabilities are then mapped to a scalar heuristic $h(s)$ via

$h(s) = \sum_o \text{score}(o) \cdot p^{value}_\theta(o \mid s)$

with typical score mapping: $\text{score}("sure")=1.0$ , $\text{score}("likely")=0.5$ , $\text{score}("impossible")=0$ . This dynamic, model-driven heuristic replaces hand-coded or synthetic priorities, biasing the planner toward chains that the LLM itself judges more promising (Liu, 29 Apr 2024).

3. POMCP-Based Anytime Planning

PoT integrates the LLM and heuristic mechanism into a variant of Partially Observable Monte Carlo Planning (POMCP):

Statistics: Each action-observation history %%%%24%%%% is tracked with counts $N(h)$ , $N(h,a)$ and value estimates $Q(h,a)$ .
Selection: The next action is chosen via an upper-confidence bound (UCB) criterion augmented by the LLM-based heuristic:

$a^* = \arg\max_a [ Q(h, a) + c \sqrt{\frac{\ln N(h)}{N(h,a)+1}} + h(s') ]$

where $s'$ is the successor of $(h,a)$ and $c$ encodes exploration level.

Expansion: New histories are expanded on first visitation by sampling $o \sim p^{value}_\theta(\cdot|s')$ .
Rollout: Instead of random action sampling, the rollout policy executes up to a fixed depth $d$ using LLM chain-of-thought completions. The resulting leaf trajectory $\tau$ is assigned a posterior-weighted value estimate:

$V_{rollout}(s_f) = \frac{R_{max}\cdot p^{evaluate}_\theta(\tau^* \mid \tau) + R_{min}\cdot p^{evaluate}_\theta(\neg\tau^* \mid \tau)}{R_{max} + R_{min}}$

Backpropagation: Standard temporal-difference update.

This approach ensures anytime characteristics: when the computation budget is exhausted, the most plausible action/trajectory found so far is returned (Liu, 29 Apr 2024).

4. Empirical Evaluation and Performance

PoT has been evaluated on the canonical "Game of 24" benchmark (100 problems), directly comparable to Tree of Thoughts (ToT) (Liu, 29 Apr 2024). Experimental setup:

Thought Generation: GPT-3.5-Turbo-Instruct ( $p^{thought}_\theta$ )
Reflection/Evaluation: GPT-4-Turbo ( $p^{value}_\theta$ , $p^{evaluate}_\theta$ )

Method	Success Rate
Chain-of-Thought (CoT)	4.0%
Tree of Thoughts (b=1)	45%
Tree of Thoughts (b=5)	74%
Plan of Thoughts (PoT)	89.4%

PoT thus outperforms both CoT and ToT by a wide margin. Additional metrics for anytime behavior: with a 1-hour normalized cutoff, PoT achieves an AUC of 81.4%, and 83.7% of instances are solved within 10 minutes (Liu, 29 Apr 2024).

5. Algorithmic and Practical Advantages

PoT exhibits several practical strengths relative to previous approaches:

Anytime Computation: Simulation budget is allocated adaptively; easy instances are solved fast, and more effort is dynamically devoted to harder cases.
Heuristic Bias: LLM-derived heuristics prioritize search toward higher-value chains, preventing wasteful expansion of low-potential branches.
Scalability: Monte Carlo planning, rather than fixed-breadth tree search, enables resource-efficient scaling to problems with high solution multiplicity or deep compositionality (Liu, 29 Apr 2024).

Additionally, PoT's modular reward and heuristic structure permits adaptation to alternative domains and user utility functions.

6. Limitations and Future Directions

Limitations highlighted for PoT include:

Compute Cost: The planning process is more computationally intensive than plain CoT, especially due to repeated LLM calls for both proposing thoughts and heuristic evaluation.
Heuristic Reliability: Dependence on zero-shot LLM value posteriors can introduce bias or failure if the LM's self-reflections misalign with true likelihood of solution success.
Domain Specialization: Current evaluations are limited to arithmetic combinatorial settings (e.g., Game of 24); extension to broader domains (e.g., symbolic proofs, program synthesis) remains to be validated.

Proposed future work targets the training of surrogate value models to approximate $p^{value}_\theta$ for rollout cost reduction and the application of PoT to richer reasoning environments beyond arithmetic puzzles (Liu, 29 Apr 2024).

7. Relation to Tree-Search and Planning-Based Reasoning

PoT builds on but generalizes prior approaches such as Tree of Thoughts (ToT), which employs deterministic (fixed-breadth) search over LLM-generated reasoning trees. By adopting a POMDP formalism, leveraging stochastic heuristics, and integrating Monte Carlo anytime search, PoT enables more efficient exploration of high-branching reasoning spaces. Unlike ToT, which statically enumerates proposals at each subproblem, PoT dynamically adjusts both the depth and breadth of exploration, guided by the LLM's statewise confidence.

This synergy of LLM generation, reflective evaluation, and principled planning constitutes a new paradigm for LLM-based compositional problem solving in discrete domains (Liu, 29 Apr 2024).

PDF Markdown Chat (Pro)

References (1)

Plan of Thoughts: Heuristic-Guided Problem Solving with Large Language Models (2024)

Follow Topic

Get notified by email when new papers are published related to Plan of Thought (PoT).