- The paper proposes a test-time scaling framework using Best-of-N sampling for initialization and iVML for iterative refinement to generate PDDL symbolic world models from natural language.
- This approach achieves high success rates, such as 85.2% on the NL2Domain task with a 7B model, significantly improving upon baseline direct LLM generation methods.
- Generating formal PDDL models enables rigorous classical planning algorithms to avoid hallucination and rule violation issues common in direct LLM-as-planner approaches.
The paper addresses critical limitations in formal planning using LLMs by generating explicit symbolic world models in Planning Domain Definition Language (PDDL) from natural language descriptions. The work presents a test-time compute scaling framework that bypasses the need for finetuning by leveraging two complementary strategies: Best-of-N (BoN) sampling and Instance Verbalized Machine Learning (iVML) for iterative refinement.
The approach is organized in two distinct phases:
- BoN Sampling for Initialization
The method first harnesses a stochastic exploration mechanism where the LLM generates multiple candidate PDDL domains with high temperature. Each candidate’s quality is quantified by a score computed as the sum of log-likelihoods of its generated tokens. By retaining the top‑K candidates from N parallel samples, the technique effectively reduces the cold-start problem and ensures high-quality initializations. Empirical results indicate that even moderate increases in the number of samples (e.g. BoN‑8) yield significant performance improvements over standard single-pass generation.
- iVML for Closed-Loop Refinement
Building upon the BoN-based initialization, the method introduces a novel instance verbalized machine learning framework. iVML operates in a closed-loop fashion by employing two LLM functions: an optimizer LLM, denoted as fopt, that critiques the generated PDDL domain against the natural language specification; and an update LLM, fupdate, that incorporates this feedback to produce refined versions. This iterative process minimizes a loss function L(G,D)—where G represents the input description and D the PDDL domain—with each update aiming to reduce logical inconsistencies and syntactical errors. The authors provide prompt templates for each function, ensuring that the model systematically identifies and corrects issues such as precondition conflicts and syntactic mishandlings inherent in PDDL's Lisp-like structure.
The framework is evaluated on two primary tasks:
- NL2Domain Task: Converting natural language descriptions to robust PDDL domain code.
- Prob2Domain Task: Deriving consistent PDDL domains from given PDDL problem files.
Across these benchmarks, the method demonstrates substantial performance gains. For instance, using the Qwen2.5-Coder model at 7B parameters, the approach achieves an 85.2% success rate on NL2Domain and a 71.4% success rate on Prob2Domain, which represents a considerable improvement over baselines such as o1-mini that score 41.7% and 33.7%, respectively. Similar trends are observed across a range of model scales—from 1.5B up to 72B parameters—and when compared with other closed‑source models.
Additional analysis in the paper includes:
Ablation studies reveal that while BoN sampling alone exhibits a phase of early improvement followed by saturation—and even degradation beyond a certain sample size—iVML achieves monotonic performance improvement over increasing iterations (with improvements observed up to 80 iterations). This underscores the method’s robustness in navigating non‑convex optimization landscapes typical in formal synthesis tasks.
- Comparison to Direct LLM‑Based Planners:
The authors compare their PDDL abstraction approach against LLM‑as‑planner techniques where planning is performed directly via text completion. The ambiguity of natural language in these approaches results in rule violations, erroneous state transitions, and incorrect goal estimations. In contrast, by translating the problem into a PDDL domain, the method enables the use of classical planning algorithms such as A* to perform rigorous heuristic search, thereby mitigating hallucination issues and ensuring adherence to domain constraints.
- Domain-Specific Case Studies:
The method is validated across a variety of planning benchmarks—including Blocksworld, Termes, Floor‑tile, Tyreworld, and Barman—demonstrating its broad applicability. Detailed case studies highlight how iVML iteratively corrects errors such as invalid predicate usage and wrong precondition formulations, resulting in logically consistent and syntactically correct domains.
- Limitations and Future Work:
The paper acknowledges that while syntactic validation is robust using tools such as VAL, semantic verification remains an open challenge. Furthermore, simulations are performed under idealized conditions (e.g., fully observable and noise‑free states), which may differ from real‑world applications in robotics.
In summary, the paper offers a highly technical and effective framework for transforming the ambiguous outputs of LLMs into explicit, formalized models for planning. By coupling probabilistic exploration via BoN sampling with self‑refinement through iVML, the paper significantly improves the quality and consistency of PDDL domain synthesis without requiring additional model training. These improvements are corroborated by strong numerical performance gains and thorough ablation studies, demonstrating a promising direction for integrating test‑time compute scaling with formal symbolic reasoning.