Plan-and-Solve Prompting Overview

Updated 16 December 2025

Plan-and-Solve Prompting is a strategy that decomposes complex tasks into distinct planning and solving phases, enhancing multi-step consistency.
It employs an explicit planning phase to identify subgoals and a systematic execution phase to reduce errors like missing steps.
The framework spans diverse applications—from mathematical problem solving to vision-based UI automation—supported by methods such as CAAP and BSM.

Plan-and-solve prompting is a family of prompting strategies for LLMs that decomposes complex reasoning or action-generation tasks into explicit planning and solving/execution phases. Unlike monolithic prompt designs or basic chain-of-thought (CoT) prompting, plan-and-solve frameworks enforce a structured, multi-phase approach—first inducing a plan (task decomposition, subgoal identification, or high-level prescription), and then guiding the model to systematically execute the plan. This paradigm improves multi-step consistency, interpretability, and reduces common errors such as missing-step and semantic reasoning failures across diverse domains, including mathematical problem solving, complex question answering, vision-based UI automation, and document reasoning.

1. Core Principles and Taxonomy

All plan-and-solve prompting methods share the architectural motif of separating the generation of a problem-solving plan from its subsequent execution. The main stages can be categorized as:

Planning Phase: The LLM is required to pause and produce a high-level plan, decomposition, or list of subgoals before attempting detailed solutions or actions.
Solving/Execution Phase: Conditioned on the explicit plan, the LLM (and/or downstream components) executes each subtask, computes intermediate results, or issues actions in sequence or parallel.

Most plan-and-solve prompting frameworks differ along several methodological axes:

Dimension	Variants	Notable Methods
Plan structure	Linear list, tree, or graph	Plan-and-Solve (Wang et al., 2023), BSM (Saha et al., 2023)
Decomposition modality	Manual, in-context/few-shot, zero-shot, or learned from feedback	Learning to Plan (Guo et al., 2023), UPAR (Geng et al., 2023)
Execution granularity	Stepwise monolithic, iterative with state update, parallel	CAAP (Cho et al., 11 Jun 2024), PEARL (Sun et al., 2023)
Task domain	Math, QA, action, vision, document, multi-agent	CAAP, PEARL, Successive Prompting (Dua et al., 2022)

This taxonomy enables compositional extension and integration with related paradigms, such as program-of-thoughts, least-to-most, branching, and memory-augmented prompting.

2. Foundational Frameworks and Formalisms

Basic Plan-and-Solve Prompting

The canonical form, as introduced in "Plan-and-Solve Prompting" (Wang et al., 2023), divides the prompt in two:

Produce a high-level plan:
- Instruction: "Let's first understand the problem and devise a plan to solve the problem."
- Output: natural-language enumeration of subgoals or operations: 1., 2., ...
Execute the plan step by step:
- Instruction: "Then, let's carry out the plan and solve the problem step by step."
- Output: chain-of-thought reasoning, with explicit computation per plan step.

The process may be formalized as: $\text{Prompt:}\quad Q: x \qquad A: \text{(Plan) } p_1,\dots,p_k. \qquad \text{(Solve) inference steps following p_1···p_k.}$ The decoupling of planning and solving reduces missing-step and semantic errors compared to direct CoT prompting.

Successive Prompting

"Successive Prompting for Decomposing Complex Questions" (Dua et al., 2022) operationalizes plan-and-solve as an iterative QD (Question Decomposition) and QA (Question Answering) loop:

At each round, the model generates a simpler subquestion, solves it, and accumulates the history.
Iteration halts upon emitting an "EOQ" (end-of-questions) marker; the last answer is the final solution.

Formally, for input $(p, q)$ and history $H$ , at stage $k$ : $q_k = \mathcal{D}(p, q, H),\quad a_k = \mathcal{S}(p, q_k)$ The full generative process is: $y = \mathcal{D}(p, q, H_{s})\ ,\ \text{subject to}\ q_{s+1} = \text{``EOQ''}$ This strategy enables modular supervision, explicit decomposition annotation, and direct integration of symbolic solvers when necessary.

Context-Aware Action Planning Prompting (CAAP)

"CAAP" (Cho et al., 11 Jun 2024) adapts the plan-and-solve paradigm to vision-only software control by alternating between:

Planning (Decision Maker): Uses screenshot-derived UI descriptions, task goals, and action histories to prompt LLM-based action planning.
Execution (Action Executor): Parses planned actions, executes them (keyboard/mouse), captures new state, and repeats the planning.

At each time step $t$ , context collection integrates three perspectives: $\mathcal{C}_t = (f_{\rm task}(T), f_{\rm screen}(S_t), f_{\rm hist}(H_t))$ with

$P_t = \text{LLMPlan}(\mathcal{C}_t),\quad S_{t+1} = \text{capture\_screenshot()}$

Chain-of-thought and few-shot demonstrations are tightly interleaved in prompt templates, yielding high robustness in UI automation.

Branch-Solve-Merge (BSM)

The "Branch-Solve-Merge" (BSM) protocol (Saha et al., 2023) instantiates a modular plan-and-solve pipeline where:

Branch: Decompose the task into $k$ parallel subproblems (criteria, concept-sets, constraints) using a plan prompted branch.
Solve: Independently solve all branches with isolated prompts.
Merge: Fuse partial solutions into a coherent answer.

Formally: $\texttt{branch}(x) \rightarrow \{x^{(1)},...,x^{(k)}\}\ \texttt{solve}(x^{(i)}) \rightarrow y^{(i)}\ \texttt{merge}(\{y^{(i)}\}) \rightarrow y$

3. Prompt Engineering, Templates, and Best Practices

Effective plan-and-solve prompting depends on precise, well-calibrated instructions covering both planning and solving trajectories. Representative prompt structures include:

Vanilla PS (Wang et al., 2023):

1
2
3

Q: [problem]
A: Let's first understand the problem and devise a plan to solve the problem.
Then, let's carry out the plan and solve the problem step by step.

Enhanced PS+:

1
2
3

Q: [problem]
A: Let's first understand the problem, extract relevant variables and their corresponding numerals, and devise a complete plan.
Then, let's carry out the plan, calculate intermediate variables (pay attention to correct numerical calculation and commonsense), solve the problem step by step, and show the answer.

CAAP Planning Prompt (Cho et al., 11 Jun 2024): Four-section prompt: human demonstrations, task/screen/history context, chain-of-thought induction, function-call output format.

UPAR Plan/Act Phases (Geng et al., 2023):

[Preceded by Context/Understand phase]
Let's make a briefly plan to solve this question step by step:
[Your plan here.]
Now, let's execute the plan step by step:
[Your solution here.]

PEARL Action Planning (Sun et al., 2023):

[List of available actions]
[Few-shot plan decomposition examples]
[Question] ...
My new actions: ...
My sequence of actions: ...

Best practices refined by ablation include:

Always require the explicit plan before solution steps.
Emphasize variable grounding (listing concrete values before reasoning).
Separate reasoning from answer extraction (e.g., via a follow-up prompt).
Use self-consistency voting for reliability (e.g., sampling 10 solutions).
Consider prompt phrasing sensitivity and validate on held-out splits.

4. Empirical Performance and Comparative Analysis

Empirical studies on mathematical, commonsense, symbolic, and real-world automation tasks show that plan-and-solve prompting yields substantial accuracy improvements compared to both zero-shot and few-shot CoT, as well as monolithic prompting baselines.

Mathematical and Symbolic Tasks

PS+ (Wang et al., 2023) achieves up to +8.0% absolute improvement over zero-shot CoT on MultiArith and consistently closes the gap to or exceeds 8-shot manual CoT, reaching 91.8%/59.3%/76.7% on MultiArith/GSM8K/SVAMP.
UPAR (Geng et al., 2023) raises GSM8K-H accuracy from 22.9% to 58.3% (UPAR-S) and causal-judgment accuracy from 67.9% to 75.4%.
Successive Prompting (Dua et al., 2022) outperforms CoT by 3.5–4.3 F1 (DROP few-shot, in-context), and fine-tuned models show +5.4 F1 over strong symbolic baselines.
Learning to Plan (Guo et al., 2023) delivers a +15% advantage over zero-shot CoT on AMPS, and demonstrates cross-model plan transfer (plans learned by ChatGPT improving GPT-4 performance).

Vision-based Automation

CAAP (Cho et al., 11 Jun 2024) achieves 94.4% average success on MiniWoB++ (vision-only, no HTML), outperforming prior vision-based agents, while requiring only 1.48 demos/task (99 total). Ablation reveals −2.4% loss without CoT instructions and −8.7% without demos.

Long Document Reasoning

PEARL (Sun et al., 2023) achieves 70.9% accuracy on “Long” QuALITY test cases (vs GPT-4 ZS 64.3%), demonstrating that programmatic, plan-execute decomposition is essential for complex, compositional document tasks.

Multi-faceted Evaluation and Generation

Branch-Solve-Merge (Saha et al., 2023) improves human-LLM agreement in evaluation by +12 points over zero-shot, and increases constraint satisfaction in story generation by 12%, outperforming LLaMA2-chat and closing the gap to proprietary models.

5. Extensions, Variants, and Theoretical Considerations

The plan-and-solve paradigm subsumes and is orthogonal to numerous other prompting innovations:

Hint-before-Solving (HSP) (Fu et al., 22 Feb 2024): Inserting a hint step before planning is neutral or slightly detrimental for strict plan-first methods, but offers improvements when integrated with CoT; excessive early biasing can suppress effective plan decomposition.
Branching and Graph-of-Thoughts: BSM (Saha et al., 2023) extends linear plan-and-solve via parallel subproblem exploration, reducing bias and variance through aggregation/merging.
Kantian Epistemological Foundations (UPAR (Geng et al., 2023)): Plan stage aligns with “Reason,” distinct from raw understanding and reflective self-correction.
Synthetic Decomposition Supervision (Dua et al., 2022): Synthetic data can bootstrap question decomposition performance with minimal manual annotation.

The key theoretical motivations are reduction in myopic errors, improved interpretability (by making the plan inspectable), and decreased reliance on in-context demonstrations for generalization. Parallel or compositional planning mitigates “long-range drift” and context-length limitations.

6. Limitations, Failure Modes, and Future Directions

Plan-and-solve prompting is not universally optimal for all LLM architectures or all problem domains. Noted limitations include:

Granularity trade-off: Too-coarse plans yield no benefit; too-fine induce inference cost by requiring multiple LLM passes (Dua et al., 2022).
Plan Overfitting or Confounding: Upstream hints may suppress effective planning (Fu et al., 22 Feb 2024).
Execution Error Propagation: In iterative/stepwise execution, errors in early plan steps can compound (Sun et al., 2023).
Supervision cost: Manual decomposition annotation remains expensive despite synthetic alternatives.

Future research suggests:

Integrating retrieval-based or memory-augmented plan representations.
Hybridization with symbolic or algorithmic tool augmentation.
Adaptive plan revision during execution (tree- or graph-of-thoughts).
Standardized benchmarks for hierarchical or compositional multi-step tasks, especially for vision and multi-modal agents.

7. Summary Table of Representative Plan-and-Solve Prompting Methods

Method	Planning Style	Execution Style	Domain	Key Results
Plan-and-Solve (Wang et al., 2023)	Linear, plain text plan	Sequential CoT, natural language	Math, regex, symbolic	+8% over ZS-CoT; matches few-shot
CAAP (Cho et al., 11 Jun 2024)	Multi-perspective, iterative	Plan-exec loop, function call	UI automation (vision)	94.4% MiniWoB++, SOTA (vision-only)
UPAR (Geng et al., 2023)	Kantian “Reason,” multi-step	Plan→Act→Reflect	Math, causal, science	GSM8K-H: 22.9%→58.3%
Branch-Solve-Merge (Saha et al., 2023)	Parallel branch plan	Parallel solve, merge	Eval, gen., constraint	+12pp agreement, −34pp bias
PEARL (Sun et al., 2023)	Action mining, sequenced plan	Stepwise action execution	Long document reasoning	+6.6pp on “Long” QuALITY
Successive Prompting (Dua et al., 2022)	Iterative QD/QA, dynamic	Interleaved QD/QA	QA, compositional QA	+5.4 F1 over baselines (DROP)
Learning to Plan (Guo et al., 2023)	Learned, feedback-updated plan	CoT guided by plan	Math, logical, robotic	AMPS +15pp; plan transfer works

Plan-and-solve prompting has established itself as a robust paradigm for improving LLM reasoning on multi-step, compositional, and high-stakes tasks through explicit, inspectable, and often modular planning and execution strategies.