Plan-and-Solve Prompting Overview
- Plan-and-Solve Prompting is a strategy that decomposes complex tasks into distinct planning and solving phases, enhancing multi-step consistency.
- It employs an explicit planning phase to identify subgoals and a systematic execution phase to reduce errors like missing steps.
- The framework spans diverse applications—from mathematical problem solving to vision-based UI automation—supported by methods such as CAAP and BSM.
Plan-and-solve prompting is a family of prompting strategies for LLMs that decomposes complex reasoning or action-generation tasks into explicit planning and solving/execution phases. Unlike monolithic prompt designs or basic chain-of-thought (CoT) prompting, plan-and-solve frameworks enforce a structured, multi-phase approach—first inducing a plan (task decomposition, subgoal identification, or high-level prescription), and then guiding the model to systematically execute the plan. This paradigm improves multi-step consistency, interpretability, and reduces common errors such as missing-step and semantic reasoning failures across diverse domains, including mathematical problem solving, complex question answering, vision-based UI automation, and document reasoning.
1. Core Principles and Taxonomy
All plan-and-solve prompting methods share the architectural motif of separating the generation of a problem-solving plan from its subsequent execution. The main stages can be categorized as:
- Planning Phase: The LLM is required to pause and produce a high-level plan, decomposition, or list of subgoals before attempting detailed solutions or actions.
- Solving/Execution Phase: Conditioned on the explicit plan, the LLM (and/or downstream components) executes each subtask, computes intermediate results, or issues actions in sequence or parallel.
Most plan-and-solve prompting frameworks differ along several methodological axes:
| Dimension | Variants | Notable Methods |
|---|---|---|
| Plan structure | Linear list, tree, or graph | Plan-and-Solve (Wang et al., 2023), BSM (Saha et al., 2023) |
| Decomposition modality | Manual, in-context/few-shot, zero-shot, or learned from feedback | Learning to Plan (Guo et al., 2023), UPAR (Geng et al., 2023) |
| Execution granularity | Stepwise monolithic, iterative with state update, parallel | CAAP (Cho et al., 11 Jun 2024), PEARL (Sun et al., 2023) |
| Task domain | Math, QA, action, vision, document, multi-agent | CAAP, PEARL, Successive Prompting (Dua et al., 2022) |
This taxonomy enables compositional extension and integration with related paradigms, such as program-of-thoughts, least-to-most, branching, and memory-augmented prompting.
2. Foundational Frameworks and Formalisms
Basic Plan-and-Solve Prompting
The canonical form, as introduced in "Plan-and-Solve Prompting" (Wang et al., 2023), divides the prompt in two:
- Produce a high-level plan:
- Instruction: "Let's first understand the problem and devise a plan to solve the problem."
- Output: natural-language enumeration of subgoals or operations: 1., 2., ...
- Execute the plan step by step:
- Instruction: "Then, let's carry out the plan and solve the problem step by step."
- Output: chain-of-thought reasoning, with explicit computation per plan step.
The process may be formalized as: $\text{Prompt:}\quad Q: x \qquad A: \text{(Plan) } p_1,\dots,p_k. \qquad \text{(Solve) inference steps following p_1···p_k.}$ The decoupling of planning and solving reduces missing-step and semantic errors compared to direct CoT prompting.
Successive Prompting
"Successive Prompting for Decomposing Complex Questions" (Dua et al., 2022) operationalizes plan-and-solve as an iterative QD (Question Decomposition) and QA (Question Answering) loop:
- At each round, the model generates a simpler subquestion, solves it, and accumulates the history.
- Iteration halts upon emitting an "EOQ" (end-of-questions) marker; the last answer is the final solution.
Formally, for input and history , at stage : The full generative process is: This strategy enables modular supervision, explicit decomposition annotation, and direct integration of symbolic solvers when necessary.
Context-Aware Action Planning Prompting (CAAP)
"CAAP" (Cho et al., 11 Jun 2024) adapts the plan-and-solve paradigm to vision-only software control by alternating between:
- Planning (Decision Maker): Uses screenshot-derived UI descriptions, task goals, and action histories to prompt LLM-based action planning.
- Execution (Action Executor): Parses planned actions, executes them (keyboard/mouse), captures new state, and repeats the planning.
At each time step , context collection integrates three perspectives: with
Chain-of-thought and few-shot demonstrations are tightly interleaved in prompt templates, yielding high robustness in UI automation.
Branch-Solve-Merge (BSM)
The "Branch-Solve-Merge" (BSM) protocol (Saha et al., 2023) instantiates a modular plan-and-solve pipeline where:
- Branch: Decompose the task into parallel subproblems (criteria, concept-sets, constraints) using a plan prompted branch.
- Solve: Independently solve all branches with isolated prompts.
- Merge: Fuse partial solutions into a coherent answer.
Formally:
3. Prompt Engineering, Templates, and Best Practices
Effective plan-and-solve prompting depends on precise, well-calibrated instructions covering both planning and solving trajectories. Representative prompt structures include:
- Vanilla PS (Wang et al., 2023):
1 2 3
Q: [problem] A: Let's first understand the problem and devise a plan to solve the problem. Then, let's carry out the plan and solve the problem step by step.
- Enhanced PS+:
1 2 3 |
Q: [problem] A: Let's first understand the problem, extract relevant variables and their corresponding numerals, and devise a complete plan. Then, let's carry out the plan, calculate intermediate variables (pay attention to correct numerical calculation and commonsense), solve the problem step by step, and show the answer. |
- CAAP Planning Prompt (Cho et al., 11 Jun 2024): Four-section prompt: human demonstrations, task/screen/history context, chain-of-thought induction, function-call output format.
- UPAR Plan/Act Phases (Geng et al., 2023):
1 2 3 4 5
[Preceded by Context/Understand phase] Let's make a briefly plan to solve this question step by step: [Your plan here.] Now, let's execute the plan step by step: [Your solution here.]
- PEARL Action Planning (Sun et al., 2023):
1 2 3 4 5
[List of available actions] [Few-shot plan decomposition examples] [Question] ... My new actions: ... My sequence of actions: ...
Best practices refined by ablation include:
- Always require the explicit plan before solution steps.
- Emphasize variable grounding (listing concrete values before reasoning).
- Separate reasoning from answer extraction (e.g., via a follow-up prompt).
- Use self-consistency voting for reliability (e.g., sampling 10 solutions).
- Consider prompt phrasing sensitivity and validate on held-out splits.
4. Empirical Performance and Comparative Analysis
Empirical studies on mathematical, commonsense, symbolic, and real-world automation tasks show that plan-and-solve prompting yields substantial accuracy improvements compared to both zero-shot and few-shot CoT, as well as monolithic prompting baselines.
Mathematical and Symbolic Tasks
- PS+ (Wang et al., 2023) achieves up to +8.0% absolute improvement over zero-shot CoT on MultiArith and consistently closes the gap to or exceeds 8-shot manual CoT, reaching 91.8%/59.3%/76.7% on MultiArith/GSM8K/SVAMP.
- UPAR (Geng et al., 2023) raises GSM8K-H accuracy from 22.9% to 58.3% (UPAR-S) and causal-judgment accuracy from 67.9% to 75.4%.
- Successive Prompting (Dua et al., 2022) outperforms CoT by 3.5–4.3 F1 (DROP few-shot, in-context), and fine-tuned models show +5.4 F1 over strong symbolic baselines.
- Learning to Plan (Guo et al., 2023) delivers a +15% advantage over zero-shot CoT on AMPS, and demonstrates cross-model plan transfer (plans learned by ChatGPT improving GPT-4 performance).
Vision-based Automation
- CAAP (Cho et al., 11 Jun 2024) achieves 94.4% average success on MiniWoB++ (vision-only, no HTML), outperforming prior vision-based agents, while requiring only 1.48 demos/task (99 total). Ablation reveals −2.4% loss without CoT instructions and −8.7% without demos.
Long Document Reasoning
- PEARL (Sun et al., 2023) achieves 70.9% accuracy on “Long” QuALITY test cases (vs GPT-4 ZS 64.3%), demonstrating that programmatic, plan-execute decomposition is essential for complex, compositional document tasks.
Multi-faceted Evaluation and Generation
- Branch-Solve-Merge (Saha et al., 2023) improves human-LLM agreement in evaluation by +12 points over zero-shot, and increases constraint satisfaction in story generation by 12%, outperforming LLaMA2-chat and closing the gap to proprietary models.
5. Extensions, Variants, and Theoretical Considerations
The plan-and-solve paradigm subsumes and is orthogonal to numerous other prompting innovations:
- Hint-before-Solving (HSP) (Fu et al., 22 Feb 2024): Inserting a hint step before planning is neutral or slightly detrimental for strict plan-first methods, but offers improvements when integrated with CoT; excessive early biasing can suppress effective plan decomposition.
- Branching and Graph-of-Thoughts: BSM (Saha et al., 2023) extends linear plan-and-solve via parallel subproblem exploration, reducing bias and variance through aggregation/merging.
- Kantian Epistemological Foundations (UPAR (Geng et al., 2023)): Plan stage aligns with “Reason,” distinct from raw understanding and reflective self-correction.
- Synthetic Decomposition Supervision (Dua et al., 2022): Synthetic data can bootstrap question decomposition performance with minimal manual annotation.
The key theoretical motivations are reduction in myopic errors, improved interpretability (by making the plan inspectable), and decreased reliance on in-context demonstrations for generalization. Parallel or compositional planning mitigates “long-range drift” and context-length limitations.
6. Limitations, Failure Modes, and Future Directions
Plan-and-solve prompting is not universally optimal for all LLM architectures or all problem domains. Noted limitations include:
- Granularity trade-off: Too-coarse plans yield no benefit; too-fine induce inference cost by requiring multiple LLM passes (Dua et al., 2022).
- Plan Overfitting or Confounding: Upstream hints may suppress effective planning (Fu et al., 22 Feb 2024).
- Execution Error Propagation: In iterative/stepwise execution, errors in early plan steps can compound (Sun et al., 2023).
- Supervision cost: Manual decomposition annotation remains expensive despite synthetic alternatives.
Future research suggests:
- Integrating retrieval-based or memory-augmented plan representations.
- Hybridization with symbolic or algorithmic tool augmentation.
- Adaptive plan revision during execution (tree- or graph-of-thoughts).
- Standardized benchmarks for hierarchical or compositional multi-step tasks, especially for vision and multi-modal agents.
7. Summary Table of Representative Plan-and-Solve Prompting Methods
| Method | Planning Style | Execution Style | Domain | Key Results |
|---|---|---|---|---|
| Plan-and-Solve (Wang et al., 2023) | Linear, plain text plan | Sequential CoT, natural language | Math, regex, symbolic | +8% over ZS-CoT; matches few-shot |
| CAAP (Cho et al., 11 Jun 2024) | Multi-perspective, iterative | Plan-exec loop, function call | UI automation (vision) | 94.4% MiniWoB++, SOTA (vision-only) |
| UPAR (Geng et al., 2023) | Kantian “Reason,” multi-step | Plan→Act→Reflect | Math, causal, science | GSM8K-H: 22.9%→58.3% |
| Branch-Solve-Merge (Saha et al., 2023) | Parallel branch plan | Parallel solve, merge | Eval, gen., constraint | +12pp agreement, −34pp bias |
| PEARL (Sun et al., 2023) | Action mining, sequenced plan | Stepwise action execution | Long document reasoning | +6.6pp on “Long” QuALITY |
| Successive Prompting (Dua et al., 2022) | Iterative QD/QA, dynamic | Interleaved QD/QA | QA, compositional QA | +5.4 F1 over baselines (DROP) |
| Learning to Plan (Guo et al., 2023) | Learned, feedback-updated plan | CoT guided by plan | Math, logical, robotic | AMPS +15pp; plan transfer works |
Plan-and-solve prompting has established itself as a robust paradigm for improving LLM reasoning on multi-step, compositional, and high-stakes tasks through explicit, inspectable, and often modular planning and execution strategies.