Simulation-to-Rules (VLMFP)
- Simulation-to-Rules (VLMFP) is a method that translates visual and language inputs into formal, rule-based PDDL representations for automated planning.
- It employs a dual-model framework where SimVLM encodes scenarios and GenVLM generates and refines PDDL files through iterative feedback.
- The approach demonstrates robust generalization across unseen visual environments, reducing the need for human input in extracting planning rules.
Simulation-to-Rules (VLMFP) describes a vision–LLM (VLM) guided paradigm for autonomously formalizing visual planning scenarios into symbolic, rule-based representations. The approach is instantiated in the Dual-VLM framework VLMFP, which achieves reliable translation of image- and language-conditioned tasks into Planning Domain Definition Language (PDDL) files, thereby enabling formal symbolic planning directly from visual inputs. This method addresses the challenge of automatically extracting all necessary planning rules, rather than relying solely on problem-specific files that require human input or environment access, and demonstrates generalized planning and simulation capabilities across diverse domains (Hao et al., 3 Oct 2025).
1. Formal Problem Setting and Notation
Simulation-to-Rules (VLMFP) is designed to convert visual and linguistic domain descriptions into executable formal planning rules. A visual planning scenario is specified by:
- A natural language description (defining rules, actions, and constraints)
- An image encoding the spatial configuration (e.g., a grid layout)
The target output consists of:
- : the PDDL domain file (defining predicates, actions, and transition rules)
- : the PDDL problem file (specifying initial state and goal conditions)
Optimal solutions are formal plans such that a symbolic planner operating on computes a valid action sequence to achieve the designated goal state. The central technical challenge lies in generating both and from and , including accurate symbolic representations of general rules, not just problem instances.
The VLMFP architecture introduces two distinct VLMs:
- SimVLM: A vision–LLM specialized for simulating environment dynamics, describing scenarios, and judging goal reachability based on current state and action sequences.
- GenVLM: A generative vision–LLM (e.g., GPT-4o) tasked with producing and refining the PDDL files interactively, leveraging feedback arising from discrepancies between symbolic and simulated outcomes.
2. Architecture and Workflow
VLMFP employs a Dual-VLM workflow orchestrated through four key stages:
- Scenario Encoding: SimVLM generates a concise natural language description 0 summarizing spatial relationships and object configurations derived from 1 and 2.
- Initial PDDL Generation: GenVLM uses 3 and 4 to synthesize initial candidate PDDL files 5.
- Simulation Consistency Check and Feedback Collection: Random action sequences 6 are sampled; SimVLM reports simulated transition outcomes while the planner executes those sequences on 7. Discrepancies (i.e., 8) are collected as feedback for further refinement.
- Iterative Refinement: GenVLM receives mismatch feedback and updates the files. This loop continues until either perfect alignment is reached (measured by the EW score, see Section 4), a valid plan is found, or the maximum number of iterations is reached.
Pseudocode for this process, following (Hao et al., 3 Oct 2025), is as follows: 1
Termination occurs when no mismatches are observed and the symbolic planner can solve the formalized problem, or after a preset iteration cap.
3. Output Representation: PDDL Domain and Problem Files
The PDDL domain file 9 encodes type signatures, predicates, and action schemas, whereas the problem file 0 introduces instance-specific objects, initial state predicates, and goal formulations.
An illustrative example for the "FrozenLake" environment is as follows:
- Domain file (1):
2
- Problem file (2):
3
The same domain file 3 generalizes across all instances for a given problem class, while 4 is adapted to the particular state configuration generated from the visual input.
4. Evaluation Metrics and Empirical Results
Multiple quantitative indicators are established for benchmarking VLMFP (Hao et al., 3 Oct 2025):
- SimVLM Metrics:
- Task description accuracy
- Execution reason accuracy
- Execution result accuracy
- Goal-reaching judgment accuracy
- For seen and unseen appearances, SimVLM achieves high performance, with rates ranging from 82% to 95.5%.
- Planning Validity:
The success rate is defined as the proportion of tested instances where the planner, operating on generated 5, reaches the designated goal. VLMFP with GPT-4o yields a planning validity of 70.0% for seen appearances and 54.1% for unseen ones, markedly superior to CodePDDL baselines (32.3% for unseen cases).
- EW (Exploration Walk) Score:
EW quantifies the bidirectional agreement between SimVLM simulations and PDDL execution across sampled action sequences:
6
where 7 and 8 are averages of the expected valid sequence rates under SimVLM and PDDL models, respectively. A high EW score indicates close alignment between simulated and formalized domains.
| Metric | Seen (%) | Unseen (%) |
|---|---|---|
| SimVLM TaskDesc | 95.5 | 92.6 |
| SimVLM ExecResult | 85.5 | 87.8 |
| SimVLM GoalReach | 82.4 | 85.6 |
| VLMFP Planning Validity | 70.0 | 54.1 |
| CodePDDL Baseline Validity | 30.7 | 32.3 |
This demonstrates robust generalization both to novel visual environments and to previously unseen instance configurations.
5. Generalization and System Limitations
The generalization capabilities of Simulation-to-Rules (VLMFP) are evidenced at multiple levels:
- Visual generalization: SimVLM accuracy remains above 82% when evaluated on domains rendered in unseen visual styles.
- Rule generalization: Novel FrozenLake rule variants (e.g., teleportation, skip-action after hazard) show reasoning/execution rates for SimVLM between 59–99% in most cases. Notably, specific rules requiring multi-step state resets present systematic challenges—e.g., the skip-action rule shows correct textual reasoning (71%) but only 0% execution fidelity due to state-tracking errors.
- Scalability: The same 9 supports all instances within a domain, with 0 automatically adapted, illustrating broad intra-domain generalization.
Limitations observed include:
- Occasional omission of required predicates (e.g., directional constraints) in the generated problem file, leading to incomplete or unexecutable plans.
- In complex domains requiring many object types and intricate preconditions (e.g., Sokoban, Printer), initial generation may lack necessary constraints, with planning accuracy collapsing in the absence of iterative refinement.
- SimVLM’s state-tracking capacity limits plan fidelity in rules requiring non-local state resets or novel dynamics.
A plausible implication is that while VLMFP can formalize and generalize many classes of planning tasks, reliance on vision–LLMs for environment simulation imposes an upper bound on achievable logical fidelity when rules depart structurally from training data.
6. Related Research and Positioning
VLMFP is positioned directly in response to prior hybrid approaches in neuro-symbolic planning which leverage VLMs to convert visual problems into PDDL for downstream symbolic planning, yet require human-authored domain files or extensive interaction with the environment for verification. Unique to VLMFP is the end-to-end automation of both domain and problem file synthesis and the closure of the formalization loop by iterative simulation-based refinement mediated by a pair of specialized VLMs (Hao et al., 3 Oct 2025).
Simulation-to-Rules stands in contrast to classical model-free data-driven simulation in computational mechanics (Ciftci et al., 2021), where transition rules are constructed from structured (often physical) data and classification of behavioral regimes rather than extracted from visual-linguistic input. The unification of simulation and symbolic formalization within VLMFP reflects a general trend toward integrative, data-efficient neuro-symbolic planning frameworks.
7. Significance and Future Directions
Simulation-to-Rules (VLMFP) demonstrates viable automatic formalization of complex visual planning domains with generalization to new visual styles, object configurations, and varied rule sets. The approach moves beyond instance-specific reasoning, enabling symbolic execution at scale directly from perception data.
Open challenges and future potential include:
- Enhancing VLM reasoning for rule sets requiring non-local state dependencies and memory of complex transitions
- Robust extraction of implicit constraints in domains with high combinatorial complexity
- Scaling to more general classes of planning problems beyond grid-based environments
Continued development of dual-model simulation-to-rule pipelines promises to increase the autonomy, transferability, and transparency of visual-to-symbolic planning systems.