Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 27 tok/s Pro
GPT-4o 100 tok/s Pro
Kimi K2 204 tok/s Pro
GPT OSS 120B 433 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Documentation Retrieval Improves Planning Language Generation (2509.19931v1)

Published 24 Sep 2025 in cs.IR

Abstract: Certain strong LLMs have shown promise for zero-shot formal planning by generating planning languages like PDDL. Yet, performance of most open-source models under 50B parameters has been reported to be close to zero due to the low-resource nature of these languages. We significantly improve their performance via a series of lightweight pipelines that integrates documentation retrieval with modular code generation and error refinement. With models like Llama-4-Maverick, our best pipeline improves plan correctness from 0\% to over 80\% on the common BlocksWorld domain. However, while syntactic errors are substantially reduced, semantic errors persist in more challenging domains, revealing fundamental limitations in current models' reasoning capabilities.\footnote{Our code and data can be found at https://github.com/Nangxxxxx/PDDL-RAG

Summary

  • The paper presents a novel pipeline that integrates documentation retrieval with modular generation and iterative error refinement for reliable PDDL synthesis.
  • It achieves significant gains in syntactic accuracy (from 0% to over 90%) and semantic accuracy (from 0% to over 80%) across multiple planning domains.
  • The study underscores that targeted snippet retrieval and example-based documentation yield more precise corrections than whole-document strategies.

Documentation Retrieval for Low-Resource Planning Language Generation

The paper "Documentation Retrieval Improves Planning Language Generation" (2509.19931) presents a systematic paper of retrieval-augmented pipelines for generating formal planning languages, specifically PDDL, using open-source LLMs with fewer than 50B parameters. The work addresses the persistent challenge of low syntactic and semantic accuracy in LLM-generated PDDL, especially in low-resource settings, and demonstrates that targeted documentation retrieval, modular code generation, and iterative error refinement can yield substantial improvements in plan correctness.

Background: LLMs for Planning and Formalization

The paper distinguishes between two paradigms for LLM-based planning: LLM-as-Planner, where the model directly outputs action sequences, and LLM-as-Formalizer, where the model generates a formal domain/problem specification (e.g., PDDL) that is then solved by a classical planner. The latter approach is favored for its interpretability and verifiability, but prior work has shown that open-source LLMs under 100B parameters perform poorly as PDDL formalizers due to the scarcity of training data and the complexity of the language. Figure 1

Figure 1: A simplified illustration of LLM-as-Planner and LLM-as-Formalizer on the BlocksWorld domain.

Pipeline Design: Modular Generation and Iterative Refinement

The core contribution is a set of lightweight pipelines that integrate documentation retrieval at multiple stages of the PDDL generation process. The pipelines are as follows:

  • Base: Zero-shot PDDL generation from domain and problem descriptions.
  • Once w/ Whole Doc: The LLM is provided with the entire PDDL documentation before generation.
  • Modular w/ Specific Doc: The LLM incrementally generates PDDL components (types, predicates, actions, etc.), each time guided by the most relevant documentation snippet.
  • Refinement w/o Doc: Error feedback from a PDDL solver is used to prompt the LLM for corrections.
  • Refinement w/ Retrieved Doc: Error feedback is used to retrieve the most relevant documentation (via BM25 or embedding-based retrieval), which is then provided to the LLM for targeted correction.

A key innovation is the use of error localization: the LLM identifies the code segment responsible for an error, which is then used as a query for documentation retrieval, leading to more precise and effective corrections. Figure 2

Figure 2: Overview of a pipeline that retrieves documents based on error codes located by the LLM, using them as hints to correct the code.

Experimental Setup

Experiments are conducted on four open-source LLMs (8B–32B parameters) across four planning domains (BlocksWorld, Mystery BlocksWorld, Logistics, Barman) from the IPC benchmark. The evaluation uses syntactic accuracy (no syntax errors) and semantic accuracy (plan correctness) as metrics, with Muise's dual-bfws-ffparser as the planner and VAL4 for plan validation.

Results: Documentation Retrieval Substantially Improves Syntactic Accuracy

The results demonstrate that documentation retrieval, when integrated with modular generation and error refinement, leads to dramatic improvements in syntactic accuracy for low-resource LLMs. For example, Llama-4-Maverick's syntactic accuracy on BlocksWorld increases from 0% (Base) to over 90% with documentation, and semantic accuracy rises from 0% to over 80%. Other models, such as Llama-4-Scout, also see significant gains, though absolute performance remains lower in more complex domains. Figure 3

Figure 3: Syntactic accuracy (orange) and semantic accuracy (blue) on various planning domains.

The largest improvements are observed in the first round of refinement, with diminishing returns in subsequent iterations. Figure 4

Figure 4: Syntactic accuracy on various rounds of Refinement w/ Code-Retrieved Doc.

Analysis: Documentation Type, Retrieval Method, and Model Proficiency

Several nuanced findings emerge from the ablation studies:

  • Specific documentation snippets are more effective than providing the entire documentation, especially for less proficient models, which are overwhelmed by large contexts.
  • Examples in documentation are more beneficial than textual descriptions for correcting syntax errors, indicating that LLMs leverage concrete patterns more effectively than abstract explanations. Figure 5

    Figure 5: Syntactic accuracy of different models under various document conditions on BlocksWorld.

  • BM25 retrieval is robust for code-retrieved refinement, while embedding-based retrieval can outperform BM25 in feedback-retrieved settings but may degrade performance in others.
  • LLMs rely more on documentation during initial code generation than during error refinement, where internal representations and previously generated code play a larger role.
  • Models with higher base PDDL proficiency (e.g., QwQ-32B, Qwen3-8B) benefit more from whole-document approaches, while less proficient models require modular, stepwise guidance.

Case Studies: Error Correction via Documentation

The paper provides concrete examples of how documentation retrieval corrects both syntactic and semantic errors in PDDL generation. For instance, when an LLM produces an invalid action definition or predicate type assignment, the pipeline retrieves the relevant documentation section and guides the LLM to produce a corrected version, as shown in the BlocksWorld domain. Figure 6

Figure 6: Domain Description (DD) for the BlocksWorld domain.

Figure 7

Figure 7: Problem Description (PD) for the BlocksWorld domain.

Figure 8

Figure 8: Corrected Domain File (DF) for the BlocksWorld domain.

Figure 9

Figure 9: Problem File (PF) for the BlocksWorld domain.

Prompt Engineering and Pipeline Implementation

The paper details prompt templates for each pipeline variant, emphasizing minimal, stepwise, and context-aware prompting. For modular generation, the LLM is prompted to generate only the next component of the domain file, while for refinement, the LLM is provided with the previous code, error feedback, and retrieved documentation. Figure 10

Figure 10: Base Prompt.

Figure 11

Figure 11: Modular w/ Specific Doc Prompt.

Figure 12

Figure 12: Once w/ Whole Doc Prompt.

Figure 13

Figure 13: Refinement w/o Doc Prompt.

Figure 14

Figure 14: Refinement w/ Retrieved Doc Prompt.

The pseudocode for the retrieval-augmented, iterative correction pipeline is provided, highlighting the integration of error parsing, code localization, and documentation retrieval.

Limitations and Implications

While the proposed pipelines yield substantial improvements in syntactic accuracy, semantic correctness remains limited by the LLM's reasoning and world modeling capabilities. Documentation retrieval cannot compensate for fundamental deficiencies in the model's understanding of the planning domain, especially in complex or under-specified environments. The approach also assumes the availability of high-quality, well-structured documentation, which may not generalize to all domains.

The findings have practical implications for deploying LLMs in low-resource, domain-specific code generation tasks. Retrieval-augmented generation and modular prompting can bridge the gap for smaller models, but further advances in model architecture and training are required to address semantic limitations.

Conclusion

This work demonstrates that documentation retrieval, when combined with modular code generation and iterative error refinement, can transform the performance of open-source LLMs on low-resource planning language generation tasks. The approach is particularly effective for syntactic correctness, enabling models that previously failed entirely to become functional as planning formalizers. However, semantic accuracy remains a bottleneck, underscoring the need for future research on LLM reasoning and world modeling in formal domains. The methodology and insights are broadly applicable to other low-resource, domain-specific language generation settings, provided that high-quality documentation is available.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Explain it Like I'm 14

Plain-English Summary: How Reading the “Rulebook” Helps AI Write Better Plans

1) What is this paper about?

This paper explores how to help AI models write correct step-by-step plans by having them read the right “rulebook” first. The plans are written in a special planning language called PDDL, which is like a coding language for describing actions, rules, and goals in a world (for example, stacking blocks). The authors show that giving AI the most relevant parts of the PDDL documentation (the rulebook) and guiding how it writes and fixes code can massively improve results—especially for smaller, open-source models.

2) What questions did the researchers ask?

In simple terms, they asked:

  • Can AIs that usually struggle with planning in PDDL do much better if we give them the right documentation at the right time?
  • Is it better to give all the documentation at once, or just the specific parts needed for each step?
  • When the AI makes mistakes, can we use error messages to retrieve the best part of the documentation to help it fix the code?
  • Does documentation mostly help with “grammar” mistakes (syntax) or also with deeper meaning mistakes (semantics)?
  • Do examples in the docs help more than plain descriptions?

3) How did they do it?

They tested AI models in text-based planning worlds like Blocks World (stacking blocks), Logistics (moving packages), and Barman (mixing drinks). The goal: given a description of the world and the goal, have the AI write two PDDL files:

  • Domain file: defines the actions and rules (like “pickup,” “stack,” and what must be true before and after each action).
  • Problem file: defines the starting state and the goal.

They tried several approaches:

  • Base: Ask the AI to write PDDL from scratch with just the task description.
  • Once w/ Whole Doc: Give the AI the entire PDDL documentation and then have it write everything.
  • Modular w/ Specific Doc: Break the job into smaller steps (like “write the list of actions,” then “write the types,” etc.) and give only the relevant documentation for each step.
  • Refinement (fixing mistakes): Run the AI’s code through a planner/validator (a program that checks if the code is valid and can produce a correct plan). If there’s an error, use that error to retrieve the most helpful doc snippet and ask the AI to fix its code.

To find the right doc snippet, they used a simple search method called BM25 (think: keyword-based library search). In some tests, they also tried an embedding-based retriever (a smarter, meaning-based search).

Here’s the basic cycle they used:

  • Step 1: AI writes draft PDDL.
  • Step 2: A solver checks the code and reports errors.
  • Step 3: Retrieve the most helpful documentation for that specific error.
  • Step 4: AI fixes the code using the error message and the doc.
  • Step 5: Repeat fixing a few times if needed.

Key terms explained:

  • Syntax errors: Like grammar mistakes in code (wrong keyword, missing parentheses).
  • Semantic errors: The code runs, but the rules don’t match the world correctly (so the plan doesn’t actually achieve the goal).
  • Retrieval: Finding the best parts of the documentation to help with the current task.

4) What did they find, and why is it important?

Main takeaways:

  • Documentation boosts performance a lot, especially on easier tasks.
    • On Blocks World, a model that went from 0% correct plans jumped to over 80% when the pipeline used documentation effectively. Syntax accuracy also shot above 90% for that model.
  • Giving specific, targeted documentation helps fix syntax (grammar) errors.
    • Modular generation with the right doc snippet for each step works well.
    • Using the error message to find a matching doc snippet helps during fixing.
  • Examples in the documentation are extra helpful.
    • Docs with examples beat plain descriptions for reducing syntax mistakes.
  • Documentation helps most at the beginning.
    • Docs are very useful during the first code-writing phase, less so during later rounds of fixing.
  • Stronger models can handle the whole documentation at once.
    • More capable models do fine when you give them the entire doc. Less capable models get overwhelmed and do better with modular, bite-sized doc pieces.
  • Semantic errors remain hard.
    • Docs can fix grammar, but they don’t fully solve deeper reasoning problems. This becomes clear in harder worlds (like Logistics or Barman), where the AI still struggles to represent the world’s rules correctly.
  • A few rounds of fixing help, but gains drop off.
    • The first round of refinement gives the biggest improvement; later rounds help less.
  • Retrieval method matters.
    • For some settings, embedding-based retrieval found better doc snippets than BM25; in others, BM25 was more reliable. There’s no one-size-fits-all winner.

Why it matters:

  • It shows a practical way to make smaller, open-source AIs useful for planning by pairing them with the right documentation and a smart generation-and-fix pipeline.
  • It separates surface-level correctness (syntax) from real understanding (semantics), pointing to where future improvements are needed.

5) What does this mean for the future?

  • With smart use of documentation and careful step-by-step generation, smaller and cheaper AI models can write useful planning code—great news for teams that can’t afford huge models.
  • However, to truly solve harder problems, AIs need better reasoning, not just better grammar. Future work should focus on improving world modeling and logic, maybe with better training data, better tools, or hybrid systems that combine AI with classical planners.
  • Good, well-organized docs with examples are key. Building better documentation and better ways to retrieve just the right part can make AIs much more reliable.
  • The approach here could help with other niche or “low-resource” coding languages, not just PDDL, wherever clear rulebooks exist.
Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of what remains missing, uncertain, or unexplored in the paper, framed to guide future research.

  • Robustness to naturalistic inputs: The pipeline is evaluated on heavily templated domain/problem descriptions; how does performance change with noisy, under-specified, or real-world narrative descriptions (e.g., ambiguous goals, incomplete constraints)?
  • Documentation coverage and quality: Only Planning Wiki is used; effects of outdated, incomplete, or inconsistent documentation are not quantified. How do doc source choice, versioning, and quality control affect outcomes?
  • Doc segmentation strategy: The modular approach partitions docs by PDDL components, but the partitioning method (chunk size, boundaries, overlap) is not examined. What segmentation policies maximize retrieval precision without overwhelming weaker models?
  • Error-to-doc retrieval mapping: BM25 or simple embeddings are used ad hoc. Can code-aware or structure-informed retrieval (grammar-aware, AST/CFG-guided, symbolic index of PDDL constructs) more reliably surface the right snippet for a given error?
  • Error localization reliability: The method relies on an LLM to “localize” error-causing code prior to retrieval, but localization accuracy and its impact on refinement success are not measured. How do static analyzers or rule-based parsers compare for pinpointing faulty lines?
  • Semantic error reduction strategies: Documentation substantially reduces syntax errors but not semantics. What additional mechanisms (domain invariants, state-space reachability checks, goal-condition completeness tests, counterexample-guided repair) best address semantic failures such as loops, unreachable goals, or insufficient action sets?
  • Evaluation design and metric fidelity: “Semantic accuracy” is defined via planning and validation with VAL, reportedly “against the gold DF and PF,” which may mismatch a plan derived from a different generated DF/PF. How should correctness be measured when the generated domain/problem differ from gold (e.g., goal satisfaction in the intended task independent of DF/PF idiosyncrasies), and how do metrics like plan optimality, length, and solver runtime change conclusions?
  • Planner and validator dependence: Only dual-bfws-ffparser and VAL are used. Do improvements transfer across diverse planners (e.g., FastDownward, FF, LPG) and validators, and are some tools more sensitive to certain PDDL features (typing, ADL, conditional effects)?
  • Domain coverage and PDDL feature breadth: Experiments cover Blocks World, Logistics, Barman (+ Mystery Blocks), but advanced PDDL features (quantifiers, conditional effects, negative preconditions, durative actions, numeric fluents) are not tested. How does performance scale with feature complexity?
  • Generalization beyond PDDL: The SMT/Z3 exploration is limited and negative. What pipeline adaptations (error taxonomies, solver feedback normalization, language-specific documentation) are needed for other low-resource formal languages (SMT-LIB, Alloy, TLA+, STRIPS variants)?
  • Model scale and prompting ablations: Only 8B–32B open-source models, zero-shot prompting are considered. How do few-shot prompts, tool-usage, self-consistency/tree-of-thought, or lightweight fine-tuning alter doc utility and semantic performance, and what are the scaling laws beyond 50B?
  • Adaptive refinement policies: Refinement is capped at three iterations, with gains concentrated in the first round. Can dynamic stopping criteria, error-class–aware policies, or curriculum-based refinement improve efficiency and semantics?
  • Example vs description design: Examples outperform descriptions, but “example” characteristics (granularity, diversity, coverage of edge cases) are not analyzed. Can synthetic example generation tailored to the current domain/problem systematically close semantic gaps?
  • Hybrid retrieval and reranking: Embeddings help in feedback-retrieved but hurt in code-retrieved settings. What hybrid schemes (BM25 first-pass + embedding reranker, or supervised LTR) best balance precision and robustness across error types and domains?
  • Cost, latency, and scalability: The computational overhead of retrieval, multiple solver calls, and iterative generation is not reported. How do time/cost trade-offs evolve with domain complexity, and what budget-aware pipeline variants preserve most gains?
  • Reproducibility and variance: Run-to-run variance, seed control, and sensitivity to sampling parameters are not discussed. How stable are results across restarts, and does documentation reduce variance?
  • Fine-grained error taxonomy: Errors are grouped as syntax vs semantic. A more granular taxonomy (requirements flags, predicate arity/type mismatches, object set inconsistencies, precondition/effect mis-specifications, initial-state errors) could guide targeted retrieval and repair. Can such a taxonomy improve triage and fix rates?
  • Mystery Blocks robustness: Keyword perturbations combat memorization, but impacts on retrieval (term mismatch between feedback and docs) are not analyzed. What query normalization or synonym mapping (e.g., ontology alignment) mitigates obfuscation?
  • Long-term constraint retention: Documentation influences initial generation more than later refinement. How can constraints learned from docs be explicitly tracked and enforced across iterations (e.g., constraint buffers, unit tests, invariants)?
  • Curriculum modularization: Modular generation helps weaker models, but the optimal step ordering and granularity are not studied. What curricula (requirements → types → predicates → actions → PF init/goal) maximize semantic correctness without overfitting syntax?
  • Solver feedback normalization: Using raw solver messages as queries is brittle. Can feedback be normalized into structured error schemas or enriched with failure traces to improve retrieval and repair?
  • Human-in-the-loop augmentation: The potential of minimal human hints (e.g., invariant annotations, goal decompositions) in one or two steps to close semantic gaps is unexplored. What is the smallest effective human intervention?
  • Documentation trust and safety: The pipeline assumes doc correctness. How can systems detect and mitigate misleading/conflicting documentation (e.g., cross-source verification, confidence scoring), especially when docs are community-maintained?
  • Cross-model doc utilization analysis: Models “vary in their ability to leverage documentation,” but no mechanism-level analysis exists. Can attention/tracing or token-level attribution quantify doc consumption and predict when docs help or hurt?
  • Transfer to real tasks and deployments: The paper’s scope is limited to simulated IPC-style tasks. What happens in integrated robotic or workflow planning settings where environment states are partially observed, stochastic, or evolving, and where plans must be executed?
Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Practical Applications

Practical Applications Overview

Below are actionable, real-world applications that follow directly from the paper’s findings and methods. They are grouped into Immediate Applications (deployable with today’s tools and the described pipelines) and Long-Term Applications (requiring further research, scaling, or model capability advances).

Immediate Applications

  • Documentation-augmented formal planning assistants
    • Sector: software, robotics, logistics
    • Use case: Natural-language task descriptions are transformed into PDDL domain/problem files via modular generation plus documentation retrieval, then solved/validated with planners (e.g., dual-bfws-ffparser) and VAL.
    • Tools/products/workflows: “PDDL Copilot” IDE plugin; CI step that runs solver/VAL and performs up to 1–2 refinement rounds; a service wrapper exposing the full pipeline: NL spec → retrieval → modular PDDL generation → solver → validation → error-localization → doc-retrieval → correction.
    • Assumptions/dependencies: Well-structured, accurate documentation (examples strongly preferred over descriptions); high-quality domain/problem descriptions; BM25 or embedding retriever tuned to the phase (BM25 for code-retrieved refinement); access to planners/validators; tasks not too complex semantically.
  • IDE extensions for low‑resource DSLs (beyond PDDL)
    • Sector: software engineering
    • Use case: Code editors that retrieve relevant documentation snippets and examples when users write DSL code, reducing syntax errors and lint warnings.
    • Tools/products/workflows: “DocPrompting for DSLs” extension; on-save syntactic validation; lightweight error localization that retrieves targeted doc snippets; 1–2 iteration fix suggestions.
    • Assumptions/dependencies: Documentation coverage and currency; supported solver/validator per DSL; smaller LLMs benefit most from modular, stepwise templates.
  • Documentation curation and example-first authoring
    • Sector: documentation engineering, education
    • Use case: Rewrite/augment DSL docs to maximize example density (since examples outperform descriptions), partition docs by code components (types, predicates, actions) for targeted retrieval.
    • Tools/products/workflows: Documentation pipelines that auto-segment content; “Example Index” for predicates/actions; retrieval-ready doc stores (BM25 and embeddings).
    • Assumptions/dependencies: Editorial capacity; doc hosting with search/index; governance over doc updates.
  • Training and onboarding for planning languages
    • Sector: education, academia, industry training
    • Use case: Interactive tutors that demonstrate modular PDDL generation; students solve Blocks World logistics-like scenarios with automatic syntax checks and feedback-driven doc retrieval.
    • Tools/products/workflows: Courseware that integrates the paper’s prompts/pipeline; microprojects emphasizing initial generation with specific docs and limited refinement iterations.
    • Assumptions/dependencies: Availability of benchmark tasks; classroom compute; alignment of curricula with PDDL/IPC basics.
  • Lab-scale robotics and simulated logistics formalization
    • Sector: robotics, logistics
    • Use case: In controlled environments, convert textual task specs into executable plans (e.g., pick-and-place procedures, simple routing) by generating validated PDDL.
    • Tools/products/workflows: Simulation-to-PDDL generator; lab orchestration system that runs the pipeline and deploys plans to a motion/planning stack; logging/VAL integration for auditability.
    • Assumptions/dependencies: Limited domain complexity; stable environment models; integration with downstream planners; tolerance for semantic errors (human-in-the-loop approval).
  • CI/CD gates for syntactic correctness in planning DSLs
    • Sector: software, MLOps
    • Use case: Automatically ensure DSL artifacts (PDDL domains/problems) are syntactically valid before merging/release, using doc-augmented generation and refinement.
    • Tools/products/workflows: CI job calling the pipeline on changed DSL files; report builder highlighting syntax fixes and retrieved doc rationale; “first-iteration fix” policy.
    • Assumptions/dependencies: Access to planners/validators in CI; stable retrieval indexes; small iteration budgets (most gains occur in the first refinement round).

Long-Term Applications

  • Robust semantic planning from natural language in complex domains
    • Sector: robotics, supply chain, healthcare operations, energy grid planning
    • Use case: Translate messy, real-world task descriptions into semantically correct formal models/plans (temporal/numeric PDDL, TAMP, hybrid constraints) that perform under uncertainty.
    • Tools/products/workflows: Next-gen LLM formalizers with improved reasoning/world modeling; richer doc stores (temporal/numeric examples, domain ontologies); multi-phase refinement with human-in-the-loop validation.
    • Assumptions/dependencies: Advances in LLM reasoning; better environment modeling; standardized domain descriptions; richer validation suites; safety approvals for deployment.
  • Generalized “RAG-for-DSLs” platform with phase-aware retrieval
    • Sector: software
    • Use case: A service that supports many low-resource languages (PDDL, SMT, robotics DSLs), automatically choosing BM25 vs embeddings depending on phase (initial generation vs feedback-driven vs code-retrieved refinement).
    • Tools/products/workflows: Retrieval orchestrator that switches strategies; code localization models; policy for examples-first retrieval; adapter prompts based on model capability (modular for weaker models, whole-doc for stronger).
    • Assumptions/dependencies: High-quality multi-DSL documentation; persistent benchmarks; routing policy backed by empirical performance.
  • Auditable, interpretable planning systems for regulated sectors
    • Sector: policy/governance, healthcare, finance, public-sector IT
    • Use case: Require interpretable formal planning artifacts (PDDL/DSL) for automated decision systems, with documentation-backed traceability for errors and corrections.
    • Tools/products/workflows: Compliance toolkits that store solver feedback, retrieved docs, and change rationales; certification processes anchored in plan validation (VAL) and doc references.
    • Assumptions/dependencies: Regulatory acceptance of formal planning artifacts; audit record standards; sustained documentation governance.
  • Automated documentation synthesis and maintenance for DSLs
    • Sector: documentation engineering, software
    • Use case: Generate and continuously update example-rich docs (from code corpora and validated plans) to improve downstream LLM formalization performance.
    • Tools/products/workflows: Doc synthesis pipelines that mine successful plans to produce examples; semantic clustering of examples; doc A/B testing for pipeline performance uplift.
    • Assumptions/dependencies: Access to plan repositories; legal rights to synthesize docs; tooling to measure doc impact on accuracy.
  • NL-to-plan agents with interactive correction loops
    • Sector: consumer assistants, enterprise productivity
    • Use case: Assistants that decompose user goals, produce candidate formal plans, run solvers, then ask targeted questions to resolve semantic gaps, leveraging retrieved docs and examples.
    • Tools/products/workflows: Conversational refinement (error localization explanations, doc snippets shown to users); adaptive iteration budgets; mixed-initiative edit proposals.
    • Assumptions/dependencies: UX for human-in-the-loop; privacy-safe logging; model advances for semantic correction beyond syntax.
  • Standardization of domain/problem description practices
    • Sector: policy, industry consortia, academia
    • Use case: Establish best-practice templates for domain/problem descriptions that are retrieval-friendly (segmentable, example-rich), improving consistency and model performance at scale.
    • Tools/products/workflows: Style guides; schema for domain/problem docs; validators that flag underspecified or noisy descriptions.
    • Assumptions/dependencies: Community adoption; cross-organization governance; demonstrated performance gains to motivate uptake.
  • Cross-language formalization bridges (PDDL ↔ SMT/TAMP/other DSLs)
    • Sector: software, robotics, OR
    • Use case: Translate between planning languages with documentation-aware prompts and formal validators, enabling heterogeneous planning stacks.
    • Tools/products/workflows: Multi-DSL converters; cross-validator orchestration; retrieval indexes spanning multiple language specs and examples.
    • Assumptions/dependencies: Mature documentation for target DSLs; tools to validate semantics across languages; advances beyond current SMT results (which underperform in complex tasks per the paper).
  • Domain-specific planning knowledge bases with example-centric retrieval
    • Sector: healthcare pathways, manufacturing process planning, energy dispatch
    • Use case: Curated KBs of domain actions/predicates and canonical examples, optimized for LLM retrieval during formalization and refinement.
    • Tools/products/workflows: Domain ontologies linked to PDDL fragments; example libraries; retrieval services co-designed with planners.
    • Assumptions/dependencies: Domain SME involvement; mapping from domain semantics to formal language constructs; sustained curation.

Notes on Feasibility and Design Choices Drawn from the Paper

  • Documentation is most effective for syntax; semantic correctness remains a bottleneck with smaller LLMs.
  • Examples outperform descriptions; prioritize example-rich documentation and retrieval.
  • Modular generation helps weaker models; whole-doc prompting suits more proficient models.
  • Code-retrieved doc (with BM25) is typically more effective than feedback-retrieved doc; embeddings can help in some feedback-retrieved settings—make retrieval strategy phase-aware.
  • Most gains occur in the first refinement iteration; cap iterations to 1–2 in production pipelines.
  • Performance depends on clean domain/problem descriptions and reliable solver/validator integration; noisy or underspecified inputs degrade outcomes.
Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Glossary

  • ADL: An expressive PDDL requirement extending STRIPS with advanced constructs such as quantifiers and conditional effects. "(:requirements : strips :adl : typing)"
  • BM25: A probabilistic information retrieval ranking function used to retrieve the most relevant documentation passages. "Table 1: Comparison of BM25 vs Embedding-base retriever results across domains, methods, and models."
  • Barman: An IPC benchmark planning domain involving cocktail mixing and bar tools. "Barman from the International Planning Com- petition (IPC, 1998)"
  • Blocks World: A classic AI planning domain where blocks are stacked and unstacked on a table. "over 80% on the common Blocks World domain."
  • dual-bfws-ffparser planner: A heuristic search-based planner used to solve and generate plans from PDDL domain/problem files. "We use the dual-bfws-ffparser planner Muise (2016) to solve for the plan"
  • Embedding-based retriever: A retrieval method that uses vector embeddings to fetch relevant documentation for code/planning refinement. "The embedding-based retriever exhibits diver- gent effects across refinement settings."
  • International Planning Competition (IPC): A long-running benchmark competition and suite for evaluating automated planners. "International Planning Com- petition (IPC, 1998)"
  • LLM-as-Formalizer: A paradigm where an LLM generates formal planning representations (e.g., PDDL) that a solver uses to compute plans. "In contrast, the LLM-as-Formalizer (Tang et al., 2024; Guo et al., 2024; Zhang et al., 2024) approach"
  • LLM-as-Planner: A paradigm where an LLM directly outputs action plans from environment descriptions without an external formal solver. "Figure 1: A simplified illustration of LLM-as-Planner and LLM-as-Formalizer on the Blocks World domain."
  • Logistics: An IPC transportation planning domain involving routing packages via trucks and planes. "Logis- tics"
  • Mystery Blocks World: A Blocks World variant with perturbed vocabulary to reduce memorization. "Mystery Blocks World (Valmeekam et al., 2023)"
  • PDDL (Planning Domain Definition Language): A formal language for specifying planning domains and problems for automated planners. "Planning Domain Defi- nition Language (PDDL) as one of the predominant formal languages for LLM planning"
  • Planning Wiki: A community-maintained knowledge base used here as the documentation source for PDDL. "We crawl, process and use the Planning Wiki2 as the source of documentation of the PDDL language."
  • Satisfiability Modulo Theories (SMT): A class of decision problems and solvers that determine satisfiability with respect to background theories (e.g., arithmetic). "Satisfiability Modulo Theories (SMT) solvers—specifically Z3, a general-purpose solver for constraint satisfaction planning problems."
  • Semantic accuracy: An evaluation metric indicating whether a correct, goal-achieving plan is produced. "Semantic accuracy is the percentage of problems where a plan is not only found but also correct."
  • STRIPS: A foundational PDDL requirement/fragment characterized by simple preconditions and add/delete effects. "(:requirements : strips :adl : typing)"
  • Syntactic accuracy: An evaluation metric indicating whether generated PDDL files contain no syntax errors. "Syntactic accuracy is the percentage of problems where no syntax er- ror are returned by the planning solver."
  • VAL4: A plan validation tool that checks the correctness of plans against PDDL domain and problem specifications. "the VAL4 (Howey et al., 2004) to validate the plan against the gold DF and PF."
  • Z3: A widely used SMT solver capable of checking satisfiability for constraints in planning and other domains. "specifically Z3, a general-purpose solver for constraint satisfaction planning problems."
  • zero-shot: Performing a task without task-specific fine-tuning or in-context examples at inference time. "we begin with the most basic setting, referred to as Base, where a LLM zero-shot generates PDDL code."
Dice Question Streamline Icon: https://streamlinehq.com

Authors (2)

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com