Analysis of Optimality of Large Language Models on Planning Problems

Published 3 Apr 2026 in cs.AI and cs.CL | (2604.02910v1)

Abstract: Classic AI planning problems have been revisited in the LLM era, with a focus of recent benchmarks on success rates rather than plan efficiency. We examine the degree to which frontier models reason optimally versus relying on simple, heuristic, and possibly inefficient strategies. We focus on the Blocksworld domain involving towers of labeled blocks which have to be moved from an initial to a goal configuration via a set of primitive actions. We also study a formally equivalent task, the generalized Path-Star ($P^*$) graph, in order to isolate true topological reasoning from semantic priors. We systematically manipulate problem depth (the height of block towers), width (the number of towers), and compositionality (the number of goal blocks). Reasoning-enhanced LLMs significantly outperform traditional satisficing planners (e.g., LAMA) in complex, multi-goal configurations. Although classical search algorithms hit a wall as the search space expands, LLMs track theoretical optimality limits with near-perfect precision, even when domain-specific semantic hints are stripped away. To explain these surprising findings, we consider (and find evidence to support) two hypotheses: an active Algorithmic Simulation executed via reasoning tokens and a Geometric Memory that allows models to represent the $P^*$ topology as a navigable global geometry, effectively bypassing exponential combinatorial complexity.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper demonstrates that advanced LLMs, particularly Gemini 3.0 Pro, can generate optimal plans in complex combinatorial domains via serial algorithmic simulation.
It employs a graph-rewrite abstraction to isolate genuine topological reasoning, validating plan optimality across multiple scaling axes such as width, depth, and compositionality.
Comparative analysis reveals that the LLM outperforms classical planners until computational limits induce an abrupt failure mode, underscoring its structured reasoning capabilities.

Analysis of Optimality of LLMs on Planning Problems

Introduction

This work rigorously examines the optimality and reasoning boundaries of LLMs in solving structured combinatorial planning problems. Departing from standard coverage-based benchmarks, the study investigates whether frontier LLMs pursue optimal strategies or resort to suboptimal heuristics when plan space complexity escalates. The principal focus is the Blocksworld domain and its generalized $P^*$ (Path-Star) graph formalization, enabling systematic manipulation of planning depth, width, and compositionality. Notably, the analysis isolates genuine topological reasoning by obfuscating semantic priors via a graph-rewrite abstraction.

Structural Formalization: Blocksworld as $P^*$ Graph

The Blocksworld domain is reformulated as a generalized $P^*$ graph, where the table corresponds to the root node, and each stack of blocks forms a disjoint branch. This structural isomorphism frames the planning challenge as a multibranch path-traversal problem requiring dependency chain reconstruction, branch identification, and sequential clearing/stacking operations.

Figure 1: Structural correspondence between Blocksworld's physical configuration and $P^*$ graph topology, establishing plan composition as structured traversal.

Experimental Methodology

A synthetic benchmark grid is constructed along four complexity axes:

Harvest: Increasing number of goal blocks (width scaling)
High Towers: Deepest tower extraction (depth scaling)
Interleaved Harvest: Entangled goal dependencies (compositionality)
Grand Challenge: Simultaneous, holistic scaling

To rigorously stress reasoning, the LLM is provided with only a single-instruction demonstration, the full PDDL domain, and the query (no chain-of-examples). Plans are validated using the VAL validator, and optimality is measured via analytical cost decompositions. Critically, graph-rewrite versions strip all semantic cues present in Blocksworld, isolating structure-driven reasoning.

Strong Empirical Results and Comparative Analysis

The reasoning-enhanced LLM (Gemini 3.0 Pro) is benchmarked against classical planners: A*/LM-Cut (guaranteed optimal, but computationally infeasible) and LAMA-2011 (satisficing, suboptimal in hard regimes).

Figure 2: Plan cost curves as goal block count and tower height increase, highlighting the LLM's ability to track theoretical optimality well beyond classical planner breakdown.

Across all scenarios, Gemini 3.0 Pro consistently generates optimal solutions up to complexity regimes where both A*/LM-Cut and LAMA either fail or become arbitrarily suboptimal. In the width-scaling Harvest curriculum, Gemini maintains optimality beyond 100 goal blocks/towers, unlike LAMA, which breaks at ~25. In depth-intensive High Towers and the combinatorially extreme Grand Challenge, the LLM sustains near-perfect optimality with a sharp transition to a "zone of instability," where it fails abruptly, rather than gradually degrading.

Presenting the results on compute allocation:

Figure 3: (a) Plan cost vs. problem volume—the LLM sustains optimality as classical planners collapse; (b) Linear scaling of reasoning token utilization with optimal plan length until inference-time failure boundary.

A striking linear proportionality is observed between the number of "thinking tokens" (inference-time reasoning steps) and optimal plan length—roughly 47 tokens per optimal step. No evidence is found that the model trades optimality for reduced reasoning: it either successfully simulates an optimal plan or fails catastrophically due to compute/service constraints.

Mechanistic Analysis: Serial Algorithmic Simulation versus Geometric Memory

Analysis of Gemini's internal thought traces reveals explicit enumeration of dependency chains for target blocks and systematic procedural decomposition—clear evidence of a stepwise Serial Algorithmic Simulation. The model identifies the correct branch of the $P^*$ structure, recursively traverses up the dependency chain, sequentially clears obstacles, and performs stacking operations, all represented in structured intermediate summaries.

Nonetheless, unexplained retrieval efficiency—how the correct dependency chain is extracted from a randomly permuted predicate context—hints at latent geometric or topological representations (Geometric Memory). While these may enable $O(1)$ lookup of target branches in latent space, current evidence predominantly supports a serial, simulation-driven modus operandi.

Robustness to Semantic Decoupling: Graph-Rewrite Experiments

A key aspect is obfuscating domain semantics. In the graph-rewrite formalization, where all physical/semantic cues are stripped, the LLM continues to operate near the optimal bound, albeit with increased instability at higher scales. The model demonstrates authentic topological reasoning, not mere template memorization.

Figure 4: Example of a Grand Challenge plan with scrambled predicate encoding, evidencing resilience to domain obfuscation.

Figure 5: Interleaved Harvest illustrates successful planning with deep cross-tower dependencies, far surpassing classical search feasibility.

Implications for Planning, LLM Capabilities, and AI

The research provides strong evidence that reasoning-augmented LLMs are capable of systematic symbolic plan search at scales where classical, even optimal, planners collapse under combinatorial pressure. The ability to decompose the planning graph, isolate independent subproblems, and simulate optimal strategies strictly serially blurs previous assumptions that LLMs cannot handle true reasoning or must rely on shallow pattern completion. The findings contradict earlier claims about the lack of generalizable planning skills in LLMs even when deprived of domain-specific training priors.

Practically, this work underwrites the feasibility of deploying LLMs in combinatorially intensive planning domains where symbolic search methods become intractable, given sufficient (and dynamically allocated) inference-time compute. Theoretically, it motivates re-examining the internal architectures of deep sequence models to interrogate the interplay of serial processing, latent topological representations, and memory bottlenecks in supporting advanced reasoning.

It also exposes the abrupt, rather than gradual, failure modes of LLMs: as serial simulation exhausts compute, the transition is from strict optimality to outright plan generation failure, not to decline into heuristic suboptimality.

Future Directions

Several new questions arise for the research community:

Extending the $P^*$ structural taxonomy to non-trivial, non-decomposable topologies (e.g., loopy, highly entangled graphs)
Generalizing findings to other state-of-the-art LLMs to ascertain universality of serial token scaling and structural failure boundaries
Probing internal attention, activation, or latent geometry for direct evidence of geometric navigation or $O(1)$ branch retrieval
Designing adaptive compute allocation strategies for planning during inference-time

Conclusion

This study demonstrates that advanced LLMs equipped for reasoning not only rival but often surpass classical planning methods on structurally complex domains, maintaining optimality until hard computational limits are met. The combination of topological generalization, robustness to semantic obfuscation, and systematic simulation provides new insights into the representational and computational capabilities of modern generative models in planning-centric contexts.

Figure 1: Structural isomorphism between Blocksworld and $P^*$ graph; planning reduces to path traversal in a generalized dependency graph.

Figure 2: Gemini 3 Pro matches or beats LAMA on both goal-count and tower-height scaling, with errors only at extreme instance sizes.

Figure 3: LLM sustains linear compute scaling with plan length, failing only at strict inference-step limits; classical planners become immediately intractable as complexity grows.

Figure 4: Grand Challenge initial/goal configuration—Gemini 3 Pro generates plans in regimes unreachable by classical planners.

Figure 5: Interleaved Harvest instance with extensive goal dependencies, highlighting complexity that is challenging for non-LLM planners.

Markdown Report Issue