Generating Symbolic World Models via Test-time Scaling of Large Language Models (2502.04728v2)

Published 7 Feb 2025 in cs.AI

Abstract: Solving complex planning problems requires LLMs to explicitly model the state transition to avoid rule violations, comply with constraints, and ensure optimality-a task hindered by the inherent ambiguity of natural language. To overcome such ambiguity, Planning Domain Definition Language (PDDL) is leveraged as a planning abstraction that enables precise and formal state descriptions. With PDDL, we can generate a symbolic world model where classic searching algorithms, such as A*, can be seamlessly applied to find optimal plans. However, directly generating PDDL domains with current LLMs remains an open challenge due to the lack of PDDL training data. To address this challenge, we propose to scale up the test-time computation of LLMs to enhance their PDDL reasoning capabilities, thereby enabling the generation of high-quality PDDL domains. Specifically, we introduce a simple yet effective algorithm, which first employs a Best-of-N sampling approach to improve the quality of the initial solution and then refines the solution in a fine-grained manner with verbalized machine learning. Our method outperforms o1-mini by a considerable margin in the generation of PDDL domains, achieving over 50\% success rate on two tasks (i.e., generating PDDL domains from natural language description or PDDL problems). This is done without requiring additional training. By taking advantage of PDDL as state abstraction, our method is able to outperform current state-of-the-art methods on almost all competition-level planning tasks.

Summary

The paper proposes a test-time scaling framework using Best-of-N sampling for initialization and iVML for iterative refinement to generate PDDL symbolic world models from natural language.
This approach achieves high success rates, such as 85.2% on the NL2Domain task with a 7B model, significantly improving upon baseline direct LLM generation methods.
Generating formal PDDL models enables rigorous classical planning algorithms to avoid hallucination and rule violation issues common in direct LLM-as-planner approaches.

The paper addresses critical limitations in formal planning using LLMs by generating explicit symbolic world models in Planning Domain Definition Language (PDDL) from natural language descriptions. The work presents a test-time compute scaling framework that bypasses the need for finetuning by leveraging two complementary strategies: Best-of-N (BoN) sampling and Instance Verbalized Machine Learning (iVML) for iterative refinement.

The approach is organized in two distinct phases:

BoN Sampling for Initialization The method first harnesses a stochastic exploration mechanism where the LLM generates multiple candidate PDDL domains with high temperature. Each candidate’s quality is quantified by a score computed as the sum of log-likelihoods of its generated tokens. By retaining the top‑ $K$ candidates from $N$ parallel samples, the technique effectively reduces the cold-start problem and ensures high-quality initializations. Empirical results indicate that even moderate increases in the number of samples (e.g. BoN‑8) yield significant performance improvements over standard single-pass generation.
iVML for Closed-Loop Refinement Building upon the BoN-based initialization, the method introduces a novel instance verbalized machine learning framework. iVML operates in a closed-loop fashion by employing two LLM functions: an optimizer LLM, denoted as $f_{\mathrm{opt}}$ , that critiques the generated PDDL domain against the natural language specification; and an update LLM, $f_{\mathrm{update}}$ , that incorporates this feedback to produce refined versions. This iterative process minimizes a loss function $\mathcal{L}(\mathcal{G}, \mathbf{D})$ —where $\mathcal{G}$ represents the input description and $\mathbf{D}$ the PDDL domain—with each update aiming to reduce logical inconsistencies and syntactical errors. The authors provide prompt templates for each function, ensuring that the model systematically identifies and corrects issues such as precondition conflicts and syntactic mishandlings inherent in PDDL's Lisp-like structure.

The framework is evaluated on two primary tasks:

NL2Domain Task: Converting natural language descriptions to robust PDDL domain code.
Prob2Domain Task: Deriving consistent PDDL domains from given PDDL problem files.

Across these benchmarks, the method demonstrates substantial performance gains. For instance, using the Qwen2.5-Coder model at 7B parameters, the approach achieves an 85.2% success rate on NL2Domain and a 71.4% success rate on Prob2Domain, which represents a considerable improvement over baselines such as o1-mini that score 41.7% and 33.7%, respectively. Similar trends are observed across a range of model scales—from 1.5B up to 72B parameters—and when compared with other closed‑source models.

Additional analysis in the paper includes:

Convergence Behavior:

Ablation studies reveal that while BoN sampling alone exhibits a phase of early improvement followed by saturation—and even degradation beyond a certain sample size—iVML achieves monotonic performance improvement over increasing iterations (with improvements observed up to 80 iterations). This underscores the method’s robustness in navigating non‑convex optimization landscapes typical in formal synthesis tasks.

Comparison to Direct LLM‑Based Planners:

The authors compare their PDDL abstraction approach against LLM‑as‑planner techniques where planning is performed directly via text completion. The ambiguity of natural language in these approaches results in rule violations, erroneous state transitions, and incorrect goal estimations. In contrast, by translating the problem into a PDDL domain, the method enables the use of classical planning algorithms such as A* to perform rigorous heuristic search, thereby mitigating hallucination issues and ensuring adherence to domain constraints.

Domain-Specific Case Studies:

The method is validated across a variety of planning benchmarks—including Blocksworld, Termes, Floor‑tile, Tyreworld, and Barman—demonstrating its broad applicability. Detailed case studies highlight how iVML iteratively corrects errors such as invalid predicate usage and wrong precondition formulations, resulting in logically consistent and syntactically correct domains.

Limitations and Future Work:

The paper acknowledges that while syntactic validation is robust using tools such as VAL, semantic verification remains an open challenge. Furthermore, simulations are performed under idealized conditions (e.g., fully observable and noise‑free states), which may differ from real‑world applications in robotics.

In summary, the paper offers a highly technical and effective framework for transforming the ambiguous outputs of LLMs into explicit, formalized models for planning. By coupling probabilistic exploration via BoN sampling with self‑refinement through iVML, the paper significantly improves the quality and consistency of PDDL domain synthesis without requiring additional model training. These improvements are corroborated by strong numerical performance gains and thorough ablation studies, demonstrating a promising direction for integrating test‑time compute scaling with formal symbolic reasoning.

PDF Markdown

Tweets

https://twitter.com/_akhaliq/status/1888804624347349309

https://twitter.com/rohanpaul_ai/status/1893812676955889864

https://twitter.com/arXivGPT/status/1889375115449868455

https://twitter.com/jkumarsharma998/status/1893180928014623042

https://twitter.com/jkumarsharma998/status/1893180918761767387

Generating Symbolic World Models via Test-time Scaling of Large Language Models (2502.04728v2)

Summary

Related Papers

Tweets