Papers
Topics
Authors
Recent
Search
2000 character limit reached

The Depth Ceiling: On the Limits of Large Language Models in Discovering Latent Planning

Published 7 Apr 2026 in cs.LG, cs.AI, and cs.CL | (2604.06427v1)

Abstract: The viability of chain-of-thought (CoT) monitoring hinges on models being unable to reason effectively in their latent representations. Yet little is known about the limits of such latent reasoning in LLMs. We test these limits by studying whether models can discover multi-step planning strategies without supervision on intermediate steps and execute them latently, within a single forward pass. Using graph path-finding tasks that precisely control the number of required latent planning steps, we uncover a striking limitation unresolved by massive scaling: tiny transformers trained from scratch discover strategies requiring up to three latent steps, fine-tuned GPT-4o and Qwen3-32B reach five, and GPT-5.4 attains seven under few-shot prompting. Although the maximum latent planning depth models can learn during training is five, the discovered strategy generalizes up to eight latent steps at test-time. This reveals a dissociation between the ability to discover a latent strategy under final-answer supervision alone and the ability to execute it once discovered. If similar limits hold more broadly, strategies requiring multiple coordinated latent planning steps may need to be explicitly taught or externalized, lending credence to CoT monitoring.

Authors (3)

Summary

  • The paper demonstrates that standard next-token supervision enables latent planning up to three steps but fails abruptly beyond four, indicating an inherent depth ceiling.
  • The study shows that successful strategy discovery is linked to high backtracking ratios, while increased branch complexity leads to notable learning challenges.
  • The paper reveals that scaling LLMs improves breadth but barely extends planning depth, and dense chain-of-thought supervision can overcome this limitation.

Depth Ceiling in Latent Planning: Limits of LLMs in Strategy Discovery

Problem Formulation and Task Design

The paper investigates the boundaries of latent planning within LLMs, focusing on their capacity to autonomously discover and execute multi-step strategies when supervised exclusively on the final goal. The chosen experimental domain is path-finding on star graphs, a topology with high symmetry and minimal heuristic shortcuts, where success requires genuine multi-step computation. The task demands internal propagation of information from a source to a target node, with the number of required planning steps tightly controlled via graph depth parameter mm and branch factor kk. The core metric, Latent Planning Capacity (LPC), quantifies whether a model exhibits statistically significant planning above the random baseline at a given depth. Figure 1

Figure 1: Star graph structure with controlled branch factor and depth; left bar plot illustrates latent planning capacity ceilings across model scales.

Experimental Results: From-Scratch Transformers

The study first demonstrates that small transformers (e.g., 1.6M parameter GPT-2 variants) trained via standard next-token prediction can indeed discover latent planning strategies up to three internal steps (m=3m=3), contradicting prior work claiming complete failure in this regime. However, discovery abruptly fails beyond four steps, regardless of increases in depth, number of heads, or hidden dimension. The failure mode diverges along two axes:

  • Breadth Limitation: When branch factor kk increases, the model fails to acquire even basic heuristics, stalling in early learning stages.
  • Depth Limitation: At higher depth mm, success is blocked strictly at the stage of multi-step strategy discovery while local heuristics are mastered.

The learning dynamics reveal a two-phase process: an initial rise in accuracy from random guessing via neighbor prediction, followed by discrete jumps (or stagnation) associated with strategy acquisition. Figure 2

Figure 2: Training loss and validation accuracy delineate two-stage learning: early local heuristic acquisition, late potential for strategy discovery or stagnation.

Attention analysis evidences that successful configurations correspond to high backtracking ratio (BR), indicating the models implement a backtracking strategy that concentrates attention along the target-to-source path. Failed cases show uniformly distributed attention, matching blind search. Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3: Attention maps in successful regimes (m=3m=3); concentration on path edges supports emergence of backtracking strategies.

Scaling Laws and LLMs: The Depth Ceiling Persists

Scaling to frontier LLMs (Qwen, GPT-4o, GPT-5.4) improves breadth handling, enabling models to solve high-branch cases, but provides marginal gains in depth. Fine-tuned LLMs reach an LPC of 5 at most, with GPT-5.4 via few-shot prompting achieving 7—yet none can discover strategies for m8m \geq 8. Intriguingly, models trained on m=5m=5 generalize to m=8m=8 at test time, establishing a gap between discovery (training) and generalization (execution) ceilings. Figure 4

Figure 4

Figure 4: Out-of-distribution generalization reveals extrapolation capacity beyond training depth but systematic performance decay at increasing test depth.

Failure analysis in high-depth regimes shows that LLMs usually locate the correct branch but falter on completing all planning steps—errors concentrate "on-path", rather than random, which reflects a strict depth ceiling in coherent execution. Figure 5

Figure 5

Figure 5

Figure 5

Figure 5

Figure 5

Figure 5

Figure 5

Figure 5

Figure 5

Figure 5

Figure 5

Figure 5: Attention visualization for failed configurations (m=4m=4 and kk0 up to kk1); no structured concentration on path edges, indicating absence of internalized planning.

Dense Supervision Abolishes the Discovery Bottleneck

Explicit chain-of-thought (CoT) supervision drastically alters this landscape. When models are trained to produce the full backtracking trace as output, they solve tasks up to kk2 with little difficulty and rapid convergence. This demonstrates that, in star graph settings, the primary bottleneck arises not from intrinsic complexity or representational limits, but from the sparse nature of standard next-token supervision. Figure 6

Figure 6: Loss convergence under explicit CoT supervision; dense step-wise feedback removes discovery ceiling.

The ICoT framework further shows that explicit reasoning sequences can be incrementally distilled back into a latent-only policy, with perfect performance up to kk3 at kk4 in small transformers, although required capacity increases with graph complexity.

Implications and Future Directions

The discovery of a persistent depth ceiling in strategy discovery—even as model scale approaches frontier—is a strong result. It indicates that depth in latent reasoning is fundamentally shallow under standard training protocols based on next-token prediction and sparse supervision. This suggests that externalized reasoning, as in CoT, is not merely a behavioral artifact, but a necessity for complex multi-step tasks in current LLM architectures.

Practically, this strengthens the safety case for CoT monitoring: oversight strategies depending on explicit reasoning remain effective, as deep reasoning cannot be hidden in latent states. Theoretically, the results invite further investigation into computation and training signals required to overcome this depth bottleneck, including curriculum-based learning, architectural modifications, or alternative objectives. Mapping scaling laws for implicit search under unlimited compute/data and repeating these experiments in less symmetric or locally heuristic-rich domains remains an open challenge.

Conclusion

This paper presents a robust empirical examination of latent planning in LLMs, showing that autonomous discovery of deep internal strategies under sparse supervision is inherently bounded—scaling alone yields marginal improvements. By systematically isolating planning depth, the authors clarify that the “depth ceiling” represents a fundamental bottleneck in implicit strategy discovery, rather than mere representational or optimization limits. Dense supervision can bypass this bottleneck, implying that oversight mechanisms relying on explicit reasoning remain reliably enforceable for complex tasks. The broader implication is that models may be forced to externalize their reasoning to surmount the depth barrier, making chain-of-thought monitoring an enduringly viable safety lever for LLM governance.

Whiteboard

Explain it Like I'm 14

What this paper is about (in simple terms)

This paper asks a straightforward question: Can today’s AI LLMs quietly plan several steps ahead “in their heads” without writing out the steps, and if so, how many steps can they handle? The authors find a clear limit: models can plan a few steps internally, but they hit a “depth ceiling” when the plan gets too long—unless they’re allowed to write down their reasoning.

The big idea and purpose

LLMs often do better when they “show their work” using chain-of-thought (CoT) reasoning. That’s helpful for performance and for safety, because humans can read and monitor those steps. But what if a model could secretly plan many steps without writing them? Then CoT monitoring would miss what the model is really doing. This paper tests how deep that hidden planning can go.

What questions did the researchers ask?

They focused on three easy-to-understand questions:

  • Can models figure out (discover) a multi-step planning strategy on their own if they only get told whether the final answer is right or wrong (no hints about the steps)?
  • How many “in-your-head” steps can they actually carry out correctly in one go?
  • Does making the model bigger or giving it examples help, and in what way?

How they tested it (with simple analogies)

The team used a very controlled puzzle that forces multi-step thinking:

  • Imagine a sun with a center and several equally long “spokes” (branches). One far end is secretly marked as the target. You start at the center and must pick the correct first step—the first node on the right spoke—to eventually reach the target.
  • Because all spokes look the same, there are no shortcuts or clues. You must “plan” along the path that leads from the target back to the center.
  • “Depth” means how many steps long a spoke is. “Breadth” means how many spokes there are.

They trained models only on the final answer (Which first step is correct?), with no step-by-step guidance. That means the model gets a learning signal only if it gets everything right—like being graded only on the final answer to a multi-step math problem.

They tested:

  • A tiny transformer trained from scratch (no prior knowledge).
  • Large pre-trained LLMs (like Qwen, GPT-4o), sometimes fine-tuned on the task, and sometimes given a few examples in the prompt (“few-shot”).
  • A newer, very capable model (GPT-5.4) in few-shot mode.

They also checked what happens if models are allowed to write out a backtracking strategy (like showing their work) and then learn to compress that back into hidden reasoning.

Key terms in everyday language:

  • Latent planning: the model plans internally, without writing out steps.
  • Depth: how many steps ahead the plan needs.
  • Breadth: how many choices you could make at each step (how many spokes).
  • Few-shot prompting: giving a handful of example problems and solutions in the prompt.

What they found (the main results)

Here are the main takeaways, written plainly:

  • Models can discover only shallow hidden plans:
    • A tiny transformer learned to plan up to 3 steps in its head.
    • After fine-tuning, models like GPT-4o and Qwen-32B reached about 5 steps.
    • GPT-5.4 reached about 7 steps with a few examples in the prompt.
    • Going beyond that, discovery fails: performance collapses to guessing.
  • Bigger isn’t enough for deeper planning:
    • Scaling up (bigger models) helps handle more spokes (more choices)—that’s breadth.
    • But it barely increases how many steps ahead they can plan without writing it out—depth stays limited.
  • Discovery vs. execution:
    • Models trained on 5-step problems often could generalize their learned strategy to 6–8 steps at test time.
    • But when trained directly on 6–8 steps (with only final-answer feedback), they failed to discover the strategy.
    • This shows a split: discovering the hidden strategy is hard; once discovered, executing it for a few more steps is easier.
  • How models seem to solve it:
    • Attention patterns suggest a “backtracking” approach (mentally moving from the target back to the center).
    • When pushed to longer depths, models often chose the correct spoke but made mistakes partway along it—evidence they tried the right plan but couldn’t carry it through enough steps.
  • If you let them write out steps, the ceiling disappears:
    • When given chain-of-thought training (explicit backtracking written out), models quickly solved very deep problems (20 steps) with little training.
    • This proves the task isn’t inherently too hard—the main problem is discovering the hidden multi-step strategy from only final answers.

Why this matters

  • For performance: If a task needs many coordinated steps, current models struggle to discover and carry out that plan silently. They often need to write down their reasoning to succeed.
  • For safety and oversight: Because deep hidden planning is limited, requiring models to show their work (chain-of-thought) is not just helpful—it’s often necessary. That supports “CoT monitoring,” where humans review the model’s steps to understand and supervise it.

A short note on limits of the study

  • The tests used “star” graphs to remove shortcuts and precisely control difficulty. Real-world problems can offer hints or patterns that might help. Still, these results show a clear lower bound: even in this clean setup, hidden multi-step planning hits a ceiling.
  • The exact numbers (like 5 or 7 steps) may change with future models, but the overall pattern—discovery is hard, execution is easier, and depth beats scale—was consistent across models tested.

Bottom line

Today’s LLMs can plan a few steps ahead silently, but they hit a “depth ceiling” when the plan gets longer. Making the model bigger helps it handle more choices but doesn’t solve the depth problem. If we want models to solve deeper, multi-step problems reliably, we usually need to let them write out their reasoning—or teach those strategies explicitly. That’s good news for using chain-of-thought as both a performance booster and a safety tool.

Knowledge Gaps

Below is a single, concrete list of unresolved gaps, limitations, and open questions that future work could address:

  • Generality beyond star graphs: test whether the depth ceiling persists on other topologies (e.g., balanced trees, grids/mazes, random graphs, directed/weighted graphs, graphs with cycles) where shortest-path structure and local cues differ.
  • Task formulation: evaluate whether requiring the full path (not just the first hop) or alternative outputs (e.g., next-hops at multiple depths, target verification) changes discovery and execution ceilings.
  • Heuristic-rich settings: introduce graphs with informative local heuristics to quantify how much heuristics reduce discovery difficulty versus genuinely deeper latent planning.
  • Graph encoding choices: assess sensitivity to input format (edge lists vs adjacency matrices, edge ordering/randomization, canonicalization), tokenization schemes, and positional embeddings (RoPE/ALiBi) on discoverability.
  • Architecture ablations at scale: systematically vary depth, width, number of heads, attention patterns (local/global), recurrence (e.g., Transformer-XL), memory modules, and MoE to isolate which components constrain latent planning depth.
  • Inference-time recurrence without external tokens: explore mechanisms that permit latent iteration across layers or time (e.g., iterative refinement with reused hidden states) while still prohibiting externalized tokens, and measure effects on depth.
  • Verification of “single forward pass” constraints: develop stronger guarantees/diagnostics that proprietary models (e.g., GPT-5.4 with “reasoning effort = none”) are not using hidden multi-pass computation.
  • Mechanistic evidence for strategy: go beyond attention-based BR by applying causal interpretability (activation patching, causal scrubbing, head/neuron ablations) to confirm whether backtracking (vs forward BFS) is actually implemented.
  • Failure localization: measure at which internal “step” execution breaks (e.g., identify the specific backtracking hop where errors concentrate) to pinpoint depth-dependent bottlenecks.
  • Breadth vs depth disentanglement: quantify memory/representation demands of branch indexing and track how breadth-related capacity limits interact with depth-related discovery limits.
  • Data scaling laws: map sample complexity vs planning depth (m) across orders of magnitude more data to test whether discovery ceilings shift with data alone and where returns diminish.
  • Optimization dynamics: probe sensitivity to optimizers, schedules, label smoothing, auxiliary regularizers, and initialization; characterize metastability and run-to-run variance in the two-stage learning dynamics.
  • Curriculum minimality: identify the minimal curriculum (fraction and distribution of shallow depths) required to unlock deep latent planning without explicit step supervision.
  • Alternative objectives under implicit-only constraint: test auxiliary losses that do not externalize reasoning (e.g., masked edge prediction, contrastive path consistency, latent-state consistency) for improving discovery.
  • RL without visible scaffold: design RL setups that enforce hidden-state-only planning (e.g., illegible intermediate tokens, auxiliary entropy minimization) and evaluate whether credit assignment improves discovery depth.
  • Few-shot prompting variables: study how demonstration diversity, ordering, and depth in context affect discovery, and whether “in-context CoT but no CoT output” alters the ceiling.
  • Pretraining influences: analyze whether code-heavy or algorithmic pretraining, larger token budgets, or synthetic curricula in pretraining shift discovery vs execution ceilings.
  • Position of the ceiling in stronger frontier models: when fine-tuning becomes available, test whether GPT-5.x-class models surpass the seven-step few-shot ceiling and whether gains come from discovery or only execution.
  • Domain transfer: probe depth ceilings in non-graph tasks with controlled latent-step counts (multi-hop QA without shortcuts, formal languages like an bn, pointer-following, parity, algorithmic string transformations, math proof subgoals).
  • Multimodal latent planning: assess whether analogous depth ceilings emerge in vision-language planning tasks (e.g., instruction-following in image-based mazes) under implicit-only constraints.
  • Statistical thresholds and multiple testing: evaluate robustness of LPC conclusions to different α-levels, multiple-comparison corrections, and alternative skill metrics.
  • Sequence length and context effects: disentangle impacts of input length vs planning depth by fixing token budget and varying m, and test whether longer contexts or memory compression mitigate discovery limits.
  • Decoding strategies: test whether sampling + self-consistency (without external CoT) improves latent planning execution vs greedy decoding, and whether it shifts the ceiling.
  • Distillation and re-internalization at scale: extend ICoT-style compression to larger models and harder graphs to test whether explicit-to-implicit distillation can reliably surpass the discovery ceiling without sacrificing breadth.
  • Safety implications: assess whether adversarial incentives can push models to covertly compute more steps latently despite the ceiling, and develop monitors that can detect partial latent execution plus partial externalized CoT.

Practical Applications

Summary

This paper identifies a “depth ceiling” in current LLMs’ ability to discover and execute multi-step planning strategies entirely within a single forward pass (latent reasoning) when trained only with final-answer supervision. Small transformers discover up to three latent planning steps; fine-tuned GPT-4o/Qwen3-32B reach five; GPT-5.4 achieves seven with few-shot prompting. Once a latent strategy is discovered, models can generalize it a few steps further at test time (up to eight), but they fail to discover deeper strategies under sparse supervision alone. Externalizing reasoning via chain-of-thought (CoT) immediately circumvents this bottleneck (solving 20-step cases with minimal training). The work introduces practical probes and metrics (e.g., Latent Planning Capacity, attention-based backtracking ratio) and shows scaling helps planning breadth more than depth. These findings have direct implications for how to design, deploy, train, evaluate, and regulate LLM-based systems.

Below are practical applications, grouped by deployment horizon.

Immediate Applications

The following applications can be implemented now with existing tools and workflows.

  • CoT-first deployment policies for complex tasks
    • Sector: software, healthcare, finance, education, policy
    • Use: Default to step-by-step reasoning (CoT) or tool-based planning for tasks likely to require more than ~5–7 latent steps (e.g., long-horizon analysis, multi-stage decisions).
    • Tools/workflows: CoT-enabled prompts; “explain your steps” templates; enforced rationale fields in UIs.
    • Dependencies/assumptions: CoT logging is available and not deliberately suppressed; privacy/compliance processes for storing rationales.
  • Depth-aware routing in agent frameworks
    • Sector: software/AI agents, enterprise automation
    • Use: Route tasks predicted to require deeper multi-step planning to explicit planners (e.g., programmatic search, symbolic solvers) or to CoT mode; keep shallow tasks in single-pass mode.
    • Tools/workflows: Simple heuristics (task length/structure), pilot “depth” classifiers calibrated on LPC; integration with planning libraries (A*, BFS), program synthesis modules, or tool-calling workflows.
    • Dependencies/assumptions: Access to tool-execution environment; latency/cost tolerance for multi-step execution.
  • Adopt the paper’s metrics for evaluations
    • Sector: ML evaluation, MLOps
    • Use: Include Latent Planning Capacity (LPC) and empirical skill in eval dashboards; monitor planning breadth vs depth; add attention-based Backtracking Ratio (BR) probes for strategy audits.
    • Tools/workflows: Internal eval suites; implement star-graph tasks across k (breadth) and m (depth); simple attention inspection for smaller models.
    • Dependencies/assumptions: For proprietary LLMs, attention probes may not be available; rely on behavioral metrics and error structure instead.
  • Training curricula that mix depths to bootstrap strategies
    • Sector: model training/finetuning
    • Use: When latent execution is desired, train on a curriculum containing shallow depths alongside deeper instances to enable discovery and subsequent generalization (as prior work suggests).
    • Tools/workflows: Data schedulers; synthetic generators controlling planning depth; progressive difficulty pipelines.
    • Dependencies/assumptions: Access to finetuning; compute budget; curriculum does not leak shortcuts.
  • CoT-to-implicit compression for performance
    • Sector: model optimization
    • Use: Train with explicit backtracking CoT to teach the algorithm, then apply ICoT/implicit-CoT methods to compress parts of the reasoning back into latent space for speed.
    • Tools/workflows: CoT finetuning; ICoT or distillation frameworks; regression tests on LPC.
    • Dependencies/assumptions: Distillation success depends on model capacity and task structure; privacy considerations for CoT traces.
  • Error analytics to diagnose “depth ceiling” failures
    • Sector: MLOps, QA, research
    • Use: Use on-path vs off-path error decomposition to identify when models find the right branch but fail to complete all planning steps; prioritize CoT/tooling for those cases.
    • Tools/workflows: Structured error tagging; dashboards highlighting on-path error ratios as depth increases.
    • Dependencies/assumptions: Task representations that support structural error labeling.
  • Product design for transparency in high-stakes domains
    • Sector: healthcare, finance, legal, public sector
    • Use: Require structured CoT reports for differential diagnosis, risk analysis, or legal reasoning; gate single-pass outputs behind review if estimated depth is high.
    • Tools/workflows: Templates (e.g., differential diagnosis schemas, investment memo outlines); reviewer workflows; red-teaming with controlled-depth tasks.
    • Dependencies/assumptions: Human oversight capacity; regulatory constraints on logging and disclosure.
  • Data and compute prioritization
    • Sector: ML program management
    • Use: Avoid spending large budgets trying to learn deep latent planning at a single fixed depth under only final-answer supervision; instead, invest in CoT training and/or curricula.
    • Tools/workflows: Budget allocation policies; ablation plans that track discovery vs execution.
    • Dependencies/assumptions: Organizational alignment on training goals.
  • Domain-specific scaffolding for planning-heavy tasks
    • Sector: robotics/operations/logistics, software engineering
    • Use: Combine LLMs with explicit planners (task-and-motion planners, routing solvers) and require stepwise plans; in code assistants, enforce plan-of-action and unit-test scaffolds.
    • Tools/workflows: Agent toolchains (e.g., graph search, OR solvers, simulators), test-driven development scaffolds.
    • Dependencies/assumptions: Integration engineering; latency budgets.
  • Benchmarking and auditing kits
    • Sector: academia/industry benchmarks, safety
    • Use: Release star-graph and similar structured latent-depth tasks into eval suites to detect Clever Hans shortcuts and depth ceilings; use as admission tests for new models.
    • Tools/workflows: Public benchmark repos; CI hooks to prevent regressions.
    • Dependencies/assumptions: Community adoption; maintenance of task generators.

Long-Term Applications

These applications require further research, scaling, or development.

  • Architectures with deeper latent planning capacity
    • Sector: ML research
    • Use: Explore recurrence, memory-augmented transformers, neural algorithmic reasoning modules, or differentiable search components to extend latent planning depth without external tokens.
    • Tools/workflows: Recurrent decoding, scratchpad memories, graph neural modules, algorithmic inductive biases.
    • Dependencies/assumptions: Training stability; transfer to open-domain tasks.
  • Self-discovered curricula and latent-state supervision
    • Sector: ML research/training
    • Use: Develop training regimes that autonomously expose intermediate planning depths or latent subgoals (e.g., next-latent-state prediction) while preserving generality.
    • Tools/workflows: Unsupervised subgoal discovery, latent rollouts, contrastive objectives for multi-step consistency.
    • Dependencies/assumptions: Avoiding overfitting to scaffolds; compatibility with large-scale pretraining.
  • Depth-aware orchestration for autonomous agents
    • Sector: software/robotics/enterprise automation
    • Use: Create orchestrators that estimate required planning depth and dynamically choose between single-pass, CoT, search, or programmatic solvers; enforce autonomy gates based on depth.
    • Tools/workflows: Depth estimators, policy engines, safety thresholds, fallback trees.
    • Dependencies/assumptions: Reliable depth signals; acceptance of performance–latency trade-offs.
  • Formal CoT monitoring standards and regulation
    • Sector: policy, compliance, safety
    • Use: Establish standards for logging, storing, and auditing reasoning traces; specify minimum transparency levels for high-stakes deployments; define conformance tests using LPC thresholds.
    • Tools/workflows: Compliance toolkits; cryptographic signing of CoT; third-party audits.
    • Dependencies/assumptions: Regulatory will; managing privacy/IP concerns.
  • General-purpose planning coprocessors
    • Sector: AI infrastructure, software
    • Use: Develop reusable “planning coprocessors” (graph search, theorem provers, constraint solvers) that LLMs can call for deep planning—abstracted behind simple APIs.
    • Tools/workflows: Service-oriented planning modules; cost-aware routers; caching of subplans.
    • Dependencies/assumptions: Interoperability; monitoring of solver reliability.
  • Cross-domain latent-depth benchmarks
    • Sector: academia/benchmarks
    • Use: Build benchmarks that precisely control multi-step depth in domains like theorem proving, code synthesis, math word problems, and robotics, to generalize LPC beyond graphs.
    • Tools/workflows: Synthetic generators with guaranteed depth; evaluation harnesses for execution vs discovery.
    • Dependencies/assumptions: Agreement on measurement protocols; avoiding heuristic shortcuts.
  • Mechanistic interpretability of latent strategies
    • Sector: interpretability research
    • Use: Extend backtracking-ratio-style probes to other domains; identify circuit motifs for latent multi-step computation; develop universal “latent depth” diagnostics.
    • Tools/workflows: Attention flow analyses, probing classifiers, activation patching for depth-tracking.
    • Dependencies/assumptions: Access to model internals; robustness across scales.
  • Adaptive compute and token budgeting
    • Sector: inference systems, cloud AI
    • Use: Build systems that allocate more tokens (CoT) or multiple passes when predicted depth is high; otherwise keep low-latency single-pass generation.
    • Tools/workflows: Dynamic compute policies; budget-aware scheduling; early-exit criteria tied to depth estimates.
    • Dependencies/assumptions: Reliable depth predictors; predictable cost-performance curves.
  • High-assurance workflows for safety-critical decisions
    • Sector: healthcare, finance, critical infrastructure
    • Use: Design workflows that require explicit multi-step reasoning artifacts, tool-assisted plans, and human sign-off for deep-planning tasks; integrate LPC-based gating into SOPs.
    • Tools/workflows: Structured decision templates; plan verifiers; audit trails.
    • Dependencies/assumptions: Institutional buy-in; training of operators.
  • Consumer-facing “transparent mode”
    • Sector: consumer assistants, education
    • Use: Offer a user-selectable mode where the assistant reveals its reasoning for complex requests and allows step-by-step verification or editing.
    • Tools/workflows: UI for rationale viewing and editing; versioned plans; user education.
    • Dependencies/assumptions: UX research; privacy considerations.
  • Training data pipelines tuned to discovery vs execution
    • Sector: ML ops
    • Use: Separate phases for (1) strategy discovery with dense supervision/CoT and (2) execution generalization; schedule data to optimize both goals.
    • Tools/workflows: Two-stage finetuning; automated detection of discovery failure modes; evaluation gates aligned to LPC.
    • Dependencies/assumptions: Sufficient data diversity; careful avoidance of shortcut learning.
  • Robustness and red-teaming against latent-only shortcuts
    • Sector: safety, security
    • Use: Create adversarial tests that look for shallow latent heuristics; enforce CoT monitoring where latent-only success would be risky (e.g., jailbreak defenses, deceptive planning).
    • Tools/workflows: Attack suites that vary depth; detectors for hidden rationales; policy rules for mandatory transparency.
    • Dependencies/assumptions: Continuous updating as models evolve.

Notes on Assumptions and Dependencies

  • Generalization from star-graph findings: While the benchmark cleanly isolates planning depth, real-world tasks may contain heuristics that partially bypass deep planning. Many applications above assume the depth ceiling observed here extends (at least partially) to other domains.
  • Model and API constraints: Some proprietary models do not expose attention or allow finetuning; use behavioral metrics (LPC, error structure) and orchestration instead.
  • Privacy/compliance: CoT logging and storage raise privacy/IP concerns; deployments must implement appropriate governance controls.
  • Cost/latency trade-offs: CoT and tool-calling increase latency and cost; depth-aware routing helps optimize these trade-offs.
  • Organizational incentives: Applying curricula, CoT monitoring, and planning coprocessors requires alignment across product, engineering, and safety teams.

Glossary

  • Attention heads: Parallel attention subcomponents within a transformer layer that allow multiple representation subspaces to be attended concurrently. "Increasing model depth or the number of attention heads does not overcome this limitation (see Appendix~\ref{appendix:hyperparameters})."
  • Attention maps: Visualizations of attention weight patterns that reveal which tokens or graph elements a model focuses on. "We further conduct a qualitative analysis by visualizing the attention maps of the trained transformer (see Appendix~\ref{appendix:attention_visualization})."
  • Autoregressive transformer: A transformer that predicts each next token conditioned on all previous tokens in the sequence. "we train an autoregressive transformer following the standard GPT-2 architecture \citep{radford2019language} from-scratch, using GELU \citep{hendrycks2016gaussian} as the activation function."
  • Backtracking: A strategy that traces the path from target back to source through sequential steps. "Attention analysis on the transformer trained from scratch suggests that successful models learn a backtracking strategy that concentrates attention along the target-to-source path."
  • Backtracking ratio (BR): A metric quantifying how much attention mass is allocated to edges on the true target-to-source path. "we define a backtracking ratio (BR) that measures the fraction of edge-token attention allocated to edges on the path between $v_{\mathrm{target}$ and $v_{\mathrm{source}$ (formal definitions are provided in Appendix~\ref{appendix:attention_probing})."
  • Bootstrapping: Leveraging simpler learned behaviors to build and generalize to more complex strategies. "such curricula allow models to bootstrap simpler strategies to more complex cases."
  • Breadth-first approach: A forward strategy that expands all branches level-by-level in parallel when searching a graph. "The most natural variant is a parallel breadth-first approach, in which the model simultaneously tracks all kk branches, advancing one depth level per computational step and checking at each level whether any branch reaches $v_{\mathrm{target}$."
  • Chain-of-thought (CoT): Explicit, step-by-step intermediate reasoning produced as text tokens. "Chain-of-thought (CoT) reasoning is one of the main drivers of progress in LLMs."
  • Clever Hans cheat: Exploiting superficial cues or shortcuts instead of performing the intended reasoning. "causing models to exploit superficial greedy shortcuts (the Clever Hans cheat), though their analysis does not examine shallow planning depths where success may still be possible."
  • Cross-entropy loss: A standard probabilistic loss for next-token prediction that penalizes divergence from the target distribution. "All models are trained via standard next-token prediction with cross-entropy loss on the final answer, without supervision on intermediate reasoning steps (see Appendix~\ref{appendix:ntp} for details)."
  • Critical threshold: The minimum empirical skill needed to statistically exceed chance performance at a given confidence level. "where $\tau_{\mathrm{crit}(k, \hat{N}, \alpha)$ denotes the minimum empirical skill required to reject random guessing"
  • Depth-first variant: A search strategy that explores one branch fully before backtracking to try others. "A sequential depth-first variant, which fully traverses one branch before returning to try the next, is less plausible: it requires up to O(km)O(k \cdot m) steps in the worst case"
  • Distillation: Transferring behavior or representations from a teacher model into a student model by matching internal signals. "CODI \citep{hao2024training} distills hidden states from a teacher model with full CoT."
  • Empirical skill: A normalized performance measure that adjusts accuracy relative to the chance baseline for a given branch factor. "we therefore first normalize accuracy into an empirical skill score Skill(πθ,k,m)\text{Skill}(\pi_{\theta}, k, m), where a value of $1$ indicates perfect performance and $0$ corresponds to random guessing"
  • Externalize their reasoning: Produce intermediate reasoning steps as explicit tokens rather than keeping them latent. "in a control setting where models are allowed to externalize their reasoning by training with a backtracking strategy in the chain of thought, they successfully solve graphs requiring twenty lookahead steps"
  • Few-shot prompting: Providing a small number of in-context examples to guide model behavior without parameter updates. "attains seven under few-shot prompting."
  • Forward pass: A single execution of a model to produce outputs without iterative internal deliberation across tokens. "execute them latently, within a single forward pass."
  • GELU: Gaussian Error Linear Unit, a smooth nonlinearity used in neural network layers. "using GELU \citep{hendrycks2016gaussian} as the activation function."
  • Generalization ceiling: The maximum depth beyond training at which a discovered strategy continues to work at test time. "This leads to a slightly higher generalization ceiling' at eight steps than thediscovery ceiling' at seven steps (Figure \ref{fig:lpc_bar})."
  • Greedy decoding: Decoding by selecting the highest-probability token at each step without search. "under greedy decoding."
  • Hidden reasoning trace: An internal chain-of-thought mode that some models can produce but which can be disabled. "To ensure a fair comparison, we disable the hidden reasoning trace of GPT-5.4 by setting the reasoning effort parameter to none"
  • ICoT framework: A method for compressing explicit chain-of-thought into latent representations. "using the ICoT framework \citep{deng2023implicit,deng2024explicit}"
  • Illegible intermediate tokens: Non-readable tokens inserted to stand in for hidden reasoning steps without exposing content. "inserting illegible intermediate tokens in place of explicit reasoning traces \citep{bachmann2024pitfalls}."
  • Induction heads: Learned attention patterns that copy or continue patterns from earlier in the sequence. "a shortcut achievable through shallow pattern-matching mechanisms such as induction heads \citep{olsson2022context}."
  • Latent planning: Multi-step planning performed within hidden representations without emitting intermediate tokens. "We study latent planning using path-finding on star graphs."
  • Latent planning capacity (LPC): A binary indicator of whether performance at a given depth exceeds chance by a statistically significant margin. "we define the latent planning capacity (LPC) to capture whether a model exhibits any statistically significant evidence of planning at a given depth mm."
  • Latent reasoning: Reasoning carried out in a model’s internal representations rather than in explicit text. "Yet little is known about the limits of such latent reasoning in LLMs."
  • Lookahead steps: The number of future steps a model must mentally simulate to plan correctly. "models can solve instances requiring many latent lookahead steps"
  • Next-token prediction: Training objective where the model predicts the next token given previous tokens. "We train with next-token prediction because reinforcement learning settings typically permit reasoning to be externalized through intermediate tokens, making latent planning difficult to enforce."
  • Out-of-distribution (OOD) generalization: Performance on test cases that differ from the training distribution, such as longer depths. "Out-of-distribution (OOD) generalization of latent planning across depths."
  • Overfitting: Memorizing training data rather than learning a generalizable strategy. "whereas failure results in the model simply memorizing the training graphs and overfitting."
  • Random baseline: The chance level of accuracy given uniform guessing over available options. "Since tasks with different branch factors kk yield different random baselines, raw accuracy does not allow fair comparison across configurations."
  • Random guessing: Selecting answers uniformly at random, providing a chance-performance reference. "statistically distinguishable from random guessing"
  • Reinforcement learning: Training paradigm where behavior is optimized via reward signals rather than supervised targets. "reinforcement learning settings typically permit reasoning to be externalized through intermediate tokens"
  • Separator token: A special token used to delimit structural segments in the input encoding. "s is a separator token between edges"
  • Significance level: The predefined probability threshold for rejecting a null hypothesis in statistical testing. "at significance level α=105\alpha = 10^{-5}"
  • Star graphs: Graphs with a central hub and multiple equal-length branches emanating from it. "We study latent planning using path-finding on star graphs."
  • Strategy discovery depth ceiling: The maximum planning depth at which a model can learn a strategy under sparse final-answer supervision. "Strategy Discovery Depth Ceiling To efficiently identify the maximum depth at which each model can discover a planning strategy, we adopt a progressive training procedure."
  • Two-stage learning process: A learning dynamic where a simple heuristic emerges first, followed by attempted acquisition of the full multi-step strategy. "the model typically exhibits a two-stage learning process."
  • Zero-shot prompting: Prompting without any in-context examples. "We evaluate LLMs under zero-shot and few-shot prompting"

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 18 tweets with 698 likes about this paper.