Plan*RAG: Structured Planning in RAG Pipelines
- Plan*RAG is a family of frameworks that introduce symbolic planning into RAG pipelines, decomposing complex queries into atomic sub-problems.
- It employs methodologies like DAG-based planning and state-transition models to guide tailored retrieval and generation steps.
- Plan*RAG enhances performance in multi-hop QA, decision-making, medical planning, and code generation by reducing error propagation and improving interpretability.
Plan*RAG is a collective term for a family of frameworks and methodologies that explicitly introduce symbolic or structured planning to Retrieval-Augmented Generation (RAG) pipelines. Plan*RAG architectures decompose complex tasks into sequenced, often atomic, sub-problems before invoking retrieval, aiming to systematically mitigate error propagation and reasoning drift in knowledge-intensive tasks including multi-hop question answering, code generation, decision-making, and domain-specific planning. This approach is realized in a variety of domains (e.g., text QA, embodied AI, software engineering, medical planning), and shares a characteristic “plan-then-retrieve” or “plan-augmented-retrieve” pattern, contrasting with earlier “retrieve-then-generate” RAG. Recent Plan*RAG instantiations demonstrate improved accuracy, interpretability, and robustness across zero-shot and domain-adapted settings.
1. Core Principles and Motivations
Plan*RAG introduces planning as an explicit, structured intermediary between the user’s query and the retrieval-augmented generation steps. Whereas classic RAG pipelines condition generation directly on retrieved knowledge chunks, Plan*RAG approaches first produce a formalized plan—such as a reasoning DAG, sequential subgoals, or domain-specific pseudocode—that then guides or parameterizes retrieval:
- Query Decomposition: The input is decomposed into atomic, typically single-hop, sub-queries or reasoning steps that collectively define the high-level reasoning path required to answer the original query (Verma et al., 2024, Zhang et al., 25 Feb 2025, Zhang et al., 23 Apr 2025, Lyu et al., 2024).
- Planning Outside LM Context: The reasoning plan is generated and maintained external to the LLM’s context window, circumventing context-length limitations associated with in-context chain-of-thought prompting. For example, “Plan*RAG” formalizes the plan as a Directed Acyclic Graph (DAG) whose sub-nodes correspond to atomic queries or facts (Verma et al., 2024).
- Error Localization and Attribution: By explicitly separating planning, retrieval, and aggregation, Plan*RAG architectures enable targeted verification and multi-granularity consistency checks at each reasoning step (Zhang et al., 23 Apr 2025).
- Adaptivity and Efficiency: Modular design allows for plug-and-play integration with a range of retrievers, generators, and evaluators, and supports parallel execution of plan steps (Verma et al., 2024).
This paradigm addresses major RAG limitations: fragmented reasoning chains, context overflow, error compounding across multi-hop tasks, and an inability to attribute final outputs to discrete evidentiary supports.
2. Representative Plan*RAG Architectures
A diversity of Plan*RAG systems have been developed, sharing a “plan–retrieve–generate” backbone with domain-specific extensions:
| System | Planning Mechanism | Retrieval Strategy | Domain/Application |
|---|---|---|---|
| Plan*RAG (Verma et al., 2024) | Test-time DAG generation, atomic subqueries | Per-node retrieval, Relevance/Critic experts | Multi-hop QA |
| PAR RAG (Zhang et al., 23 Apr 2025) | Top-down plan decomposition, JSON trace | Multi-granularity (coarse+fine) per sub-question | Multi-hop QA |
| LevelRAG (Zhang et al., 25 Feb 2025) | Symbolic high-level searcher, iterative logic planning | Hybrid (sparse/dense/web), query rewriting | QA (single/multi-hop) |
| RPG (Lyu et al., 2024) | Iterative plan-token prediction, plan–answer cycles | Fine-grained selection, multi-task prompt tuning | Knowledge-intensive QA |
| PlanRAG (Lee et al., 2024) | Explicit subgoal generation for decision analysis | Plan-driven SQL/Cypher queries | Decision-making QA |
| PERC (Yoo et al., 2024) | Pseudocode plan-based retrieval, plan as query | Semantic retrieval over plan representations | Code generation |
| ThreatLens (Saha et al., 11 May 2025) | Multi-agent LLM planners for threat/policy/test plan | Vector RAG, iterative user–agent loop | Hardware security |
| MedPlan (Hsu et al., 23 Mar 2025) | Strict SOAP-inspired (Assessment→Plan) planning | Plan- and history-level retrieval | Medical plan generation |
| P-RAG (Xu et al., 2024) | Progressive, iterative plan–retrieve cycles | Scene+goal similarity, growing DB | Embodied task planning |
| Plan+RAG-Code (Bassamzadeh et al., 2024) | DSL plan structuring, function- and few-shot retrieval | API/function metadata + example code | NL to DSL/Automation |
A common pattern is sequential or iterative execution: a plan is composed (by the LLM or auxiliary planner), each plan element triggers a tailored retrieval and generation step (often re-ranked or filtered for relevance), results are aggregated, and—when required—verifiers or critics introduce plan revision or error correction.
3. Formal Models and Algorithmic Structures
Most Plan*RAG systems model planning as a symbolic or partially symbolic process:
- DAG-Based Reasoning (Plan*RAG) (Verma et al., 2024): The plan is a DAG with nodes corresponding to atomic queries. For each , the system instantiates the sub-query by injecting the answers of parent nodes, retrieves supporting documents, and generates , the sub-answer. Parallelization is enabled for nodes at the same depth.
- State-Transition Model (LevelRAG) (Zhang et al., 25 Feb 2025): High-level planning states track both the set of resolved subqueries and their interim summaries; actions include decomposition (“decompose”), summarization, verification, and supplementation. Search is terminated when all subqueries’ summaries are judged sufficient for final answer synthesis.
- Plan-then-Act-and-Review (PAR RAG) (Zhang et al., 23 Apr 2025): The plan is a structured sequence ; at each step coarse- and fine-grained retrievals are performed, followed by consistency checks. The Action module executes sub-queries; the Review module iteratively verifies or revises sub-answers via multi-passage citation overlap.
- Iterative Plan–Answer Cycles (RPG) (Lyu et al., 2024): A plan token specifies the next subtopic, guiding paragraph-level retrieval for an answer segment . Plan–answer iteration continues until output completion or early stopping.
- Explicit Plan for Data-Driven QA (PlanRAG) (Lee et al., 2024): LLM emits a stepwise plan ; each step is translated into SQL/Cypher database queries for observation, then LLM integrates observations using business rules; re-planning is invoked as needed.
- Plan-as-Query Retrieval (PERC) (Yoo et al., 2024): Code examples are mapped to pseudocode plans, and retrieval is performed over plan representations; retrieved examples are converted as needed to the target programming language.
4. Domain-Specific Adaptations and Applications
Plan*RAG methodology has been adapted for a range of knowledge-intensive settings:
- Multi-hop and Long-form Question Answering: Direct evidence for improved multi-step reasoning fidelity, error localization, and overall accuracy. For example, Plan*RAG (Verma et al., 2024) improves HotpotQA accuracy from 25.49% (standard RAG) to 35.67% and F1 from 31.22 to 39.68, while PAR RAG (Zhang et al., 23 Apr 2025) achieves relative EM/F1 uplifts of +31.6% and +37.9% over state-of-the-art baselines on HotpotQA and MuSiQue.
- Decision-Making over Structured Data: PlanRAG outperforms prior iterative RAG by +15.8 pp in Locating and +7.4 pp in Building scenarios on the Decision QA benchmark (Lee et al., 2024).
- Medical Plan Generation: MedPlan’s “Plan × RAG” mirrors clinician workflow by first producing an assessment, then using retrieved cross-patient plus self-history SOAP records to generate personalized treatment plans, yielding up to +0.3183 BLEU and +0.5213 METEOR (Medical-Mixtral-7B-v2k) (Hsu et al., 23 Mar 2025).
- Threat Modeling and Hardware Verification: ThreatLens employs multi-agent planners (threat, policy, plan generation) with RAG, reducing manual effort ~75% and achieving 92% precision in threat filtering on NEORV32 SoC (Saha et al., 11 May 2025).
- Code and DSL Generation: PERC’s plan-as-query retrieval outperforms code retrieval baselines in both in- and cross-language settings, e.g., in MultiPL-E, Ruby: 67.27% (RepoCoder) → 69.81%, Lua: 60.81% → 64.10% (Yoo et al., 2024); Plan+RAG for DSL generation matches fine-tuned baselines in-domain and exceeds them by +7 pts similarity on out-of-domain APIs (Bassamzadeh et al., 2024).
- Embodied AI: P-RAG’s iterative, database-augmented planning improves unseen task success rates on ALFRED: GPT-4 (no retrieval) 7.05% → P-RAG after 3 iters 14.11%, and with self-iteration up to 27.4% (Xu et al., 2024).
5. Theoretical and Empirical Impact
Plan*RAG approaches advance RAG systems by:
- Reducing Error Propagation: Top-down planning, coupled with per-step verification, prevents local retrieval/generation failures from corrupting downstream reasoning.
- Improving Attribution: Atomic subqueries, each linked to a discrete retrieved document, provide strong evidence traceability—76% of answers are exact substrings of the retrieved doc in Plan*RAG (Verma et al., 2024).
- Enhancing Modular Integration: Plug-and-play design supports deployment atop arbitrary LLMs, retrievers (BM25, DPR, Contriever), and verification modules, requiring minimal or no model fine-tuning (Verma et al., 2024, Zhang et al., 25 Feb 2025).
- Enabling Scalability and Efficiency: Parallel plan step execution and context-bounded node retrieval mitigate context window overflow and reduce unnecessary retrievals, as with the Critic Expert in Plan*RAG (retrieval calls reduced by 19% with negligible accuracy loss) (Verma et al., 2024).
- Performance Gains: Across domains, Plan*RAG variants match or exceed proprietary models (e.g., LevelRAG surpasses GPT4o and ReSP) (Zhang et al., 25 Feb 2025), show significant performance uplift compared to vanilla one-pass RAG, and exhibit enhanced generalization to low-resource regimes.
6. Limitations, Challenges, and Future Directions
Notable limitations identified in Plan*RAG research include:
- Computational Overhead: Multi-step planning, per-step retrieval, review modules, and verification add latency (e.g., PAR RAG average RTPQ ≈ 26s) and increase inference cost (Zhang et al., 23 Apr 2025).
- Planning Quality Sensitivity: Poor initial plan decomposition or specification can cause retrieval to miss critical evidence or narrow the search space excessively (Lyu et al., 2024, Yoo et al., 2024).
- Database Scalability and Memory: Progressive accumulation of trajectories or intermediate plans may cause database growth and potential retrieval efficiency degradation, as seen in P-RAG (Xu et al., 2024).
- Limits of Current LLM Reasoners: P-RAG and similar systems plateau as LLM reasoning capabilities (particularly for embodied, non-textual tasks) saturate (Xu et al., 2024).
- Automatic Plan Extraction: The quality of LLM-generated plans or pseudocode may be a failure point (e.g., erroneous plan steps, unreliable pseudocode extraction in PERC) (Yoo et al., 2024).
Ongoing work investigates learned retriever/reranker modules, adaptive granularity planning, plan critics or quality validators, efficient memory condensation, and cross-modal plan representations (e.g., integrating vision directly in embodied settings). A plausible implication is that tighter coupling between learned planning agents and retrieval subsystems, or joint end-to-end optimization as in trainable consistency/verifier modules, could further enhance accuracy and robustness.
7. Summary Table: Plan*RAG Systems and Key Features
| System | Planning | Retrieval | Result/Claim | Reference |
|---|---|---|---|---|
| Plan*RAG | Test-time DAG | Per-node, atomic | +2–6 Acc/F1 on multi-hop QA | (Verma et al., 2024) |
| LevelRAG | Symbolic searcher | Hybrid (S/W/D) | Outperforms GPT4o, F1 up to 69.33% | (Zhang et al., 25 Feb 2025) |
| PAR RAG | JSON plan, review | Multi-granular | +31.6% EM over baseline on HotpotQA | (Zhang et al., 23 Apr 2025) |
| MedPlan | SOAP plan | Patient+history | BLEU up to 0.3183, 66% ↑ clinical eval | (Hsu et al., 23 Mar 2025) |
| ThreatLens | Multi-agent plan | Vector, iterative | 75% manual effort ↓, 92% precision | (Saha et al., 11 May 2025) |
| PERC | Pseudocode plan | Plan-as-query | +1–5 pp Pass@1 on underrepresented PLs | (Yoo et al., 2024) |
| PlanRAG | Stepwise plan | SQL/Cypher gen | +15.8pp / +7.4pp accuracy on DQA | (Lee et al., 2024) |
| RPG | Plan token per step | Paragraph select | +8.5 F1 (2Wiki), +9.1 ROUGE (ASQA) | (Lyu et al., 2024) |
| P-RAG | Progressive planning | History+goal/scene | +7% SR (ALFRED: 7.05%→14.11%→27.4%) | (Xu et al., 2024) |
| Plan+RAG-Code | DSL plan, function | Example+API meta | +7 pts sim on OOD API DSL generation | (Bassamzadeh et al., 2024) |
References
- Plan*RAG: Efficient Test-Time Planning for Retrieval Augmented Generation (Verma et al., 2024)
- LevelRAG: Enhancing Retrieval-Augmented Generation with Multi-hop Logic Planning over Rewriting Augmented Searchers (Zhang et al., 25 Feb 2025)
- Credible plan-driven RAG method for Multi-hop Question Answering (Zhang et al., 23 Apr 2025)
- MedPlan: A Two-Stage RAG-Based System for Personalized Medical Plan Generation (Hsu et al., 23 Mar 2025)
- ThreatLens: LLM-guided Threat Modeling and Test Plan Generation for Hardware Security Verification (Saha et al., 11 May 2025)
- PERC: Plan-As-Query Example Retrieval for Underrepresented Code Generation (Yoo et al., 2024)
- PlanRAG: A Plan-then-Retrieval Augmented Generation for Generative LLMs as Decision Makers (Lee et al., 2024)
- Retrieve-Plan-Generation: An Iterative Planning and Answering Framework for Knowledge-Intensive LLM Generation (Lyu et al., 2024)
- Progressive Retrieval Augmented Generation For Planning on Embodied Everyday Task (Xu et al., 2024)
- Plan with Code: Comparing approaches for robust NL to DSL generation (Bassamzadeh et al., 2024)