Papers
Topics
Authors
Recent
Search
2000 character limit reached

ALPINE: Autoregressive Planning in Networks

Updated 12 February 2026
  • The paper introduces ALPINE, a framework that recasts planning as discrete token sequence generation on graphs to bridge next-token prediction with path-finding.
  • It demonstrates that even simple Transformer models can accurately encode observed adjacency and reachability, yet fail to generalize transitive closure.
  • Empirical evaluations on synthetic DAGs and Blocksworld confirm that while direct path prediction is robust, compositional planning remains a critical limitation.

Autoregressive Learning for Planning In NEtworks (ALPINE) is a theoretical and empirical investigation into how standard Transformer-based LLMs, trained via autoregressive next-token prediction, acquire and execute planning capabilities when such planning is cast as path-finding over networks. The ALPINE framework formalizes planning as network path-finding, characterizes precisely what is learned by Transformers under cross-entropy training, and reveals fundamental limitations—most notably, the inability to infer transitive reachability—from both theoretical and experimental perspectives (Wang et al., 2024).

1. Formalization of Path-Finding as Autoregressive Prediction

ALPINE models planning as a discrete token sequence generation problem: given a directed graph G\mathcal{G} with node set V\mathcal{V}, adjacency matrix Atrue{0,1}V×VA^{\mathrm{true}} \in \{0,1\}^{|\mathcal{V}| \times |\mathcal{V}|} (where Ai,ktrue=1A_{i,k}^{\mathrm{true}} = 1 iff edge (ik)(i \to k) exists), and reachability matrix Rtrue{0,1}V×VR^{\mathrm{true}} \in \{0,1\}^{|\mathcal{V}| \times |\mathcal{V}|} (where Rt,ktrue=1R_{t,k}^{\mathrm{true}} = 1 iff a path ktk \to \ldots \to t exists), the task is to generate a token sequence u=(s,t,s,u1,u2,,up,t,\n)u = (s, t, s, u_1, u_2, \ldots, u_p, t, \text{\textbackslash n}), representing a path from source ss to target tt. The vocabulary size is M=V+1M = |\mathcal{V}| + 1.

Transformers are trained to autoregressively predict each next token using standard architecture:

  • Input encoding: H0=UWt+WpH_0 = U W_t + W_p, where UU is one-hot, WtW_t, WpW_p are learned weights.
  • Layerwise propagation: H=TransformerLayer(H1)H_\ell = \mathrm{TransformerLayer}(H_{\ell-1}).
  • Output: u^n+1=softmax(LNt(HL)(n,:)Wo)\hat{u}_{n+1} = \mathrm{softmax}(\mathrm{LN}_t(H_L)_{(n,:)} W_o).
  • Training loss: =n=1N1U(n+1)logu^n+1\ell = -\sum_{n=1}^{N-1} U_{(n+1)} \cdot \log \hat{u}_{n+1}.

This formulation provides a direct bridge between next-token prediction and the mechanistic implementation of planning as path-finding.

2. Representation of Graph Structure in Self-Attention and Output Probabilities

A core result (Theorem 3.1) demonstrates that a 1-layer, 1-head Transformer with embedding dimension d=O(V)d = O(|\mathcal{V}|) can exactly implement optimal path-finding via appropriately chosen weights:

  • Attention: Weights WQ=dIW^Q = \sqrt{d}\,I, WKW^K ensure all attention mass focuses on the target token position.
  • Values: Weights WVW^V at the target token position embed the row R(t,:)trueR_{(t,:)}^{\mathrm{true}} of the true reachability matrix.
  • Feedforward/MLP: Weights W1,W2W_1, W_2 extract the row A(uk,:)trueA_{(u_k,:)}^{\mathrm{true}}, representing outgoing edges from the current node uku_k.
  • Final logits: At each next-token step,

logitkc1Rt,ktrue+c2Auk,ktrue,\mathrm{logit}_k \propto c_1 R_{t,k}^{\mathrm{true}} + c_2 A_{u_k,k}^{\mathrm{true}},

where c1,c2c_1, c_2 are scaling factors. The softmax over these logits assigns nonzero mass precisely to those kk reachable from uku_k and, ultimately, from tt.

This construction unveils how Transformers, in principle, can internalize both adjacency (AA) and reachability (RR) information in their parameters and use these to make path-finding decisions stepwise.

3. Learning Dynamics: Adjacency and Partial Reachability

The analysis of SGD-driven learning dynamics (Section 4) employs a simplified Transformer (single layer/head, no nonlinearity/normalization, input/output weights as identities) to clarify which graph-theoretic structures are reliably encoded:

  • For each observed triple (i,j,k)(i, j, k) (current node, target, next node), the final logit is Li,j,k=Wi,kM+Wj,kVL_{i,j,k} = W^M_{i,k} + W^V_{j,k}.
  • Gradients with respect to WMW^M (adjacency) drive Wi,kMW^M_{i,k} up if edge (ik)(i \to k) is seen in the training paths, and down or unchanged otherwise—resulting precisely in WMW^M encoding AobsA^{\mathrm{obs}}, the adjacency matrix of the paths present in training.
  • For WVW^V (reachability), the same dynamics cause Wj,kVW^V_{j,k} to reflect RobsR^{\mathrm{obs}}, that is, direct evidence of kk appearing on any path to jj in training.

Crucially, concatenated or composite reachabilities—cases where (t,k)(t, k) is reachable through chaining but never observed as a direct (t,k)(t, k) pair—are not learned. Thus, the trained model does not internalize the transitive closure required for inferential planning beyond observed data.

4. Empirical Evaluation on Synthetic DAGs and Blocksworld

Extensive experiments validate the theoretical predictions using both synthetic graph environments and the Blocksworld planning benchmark:

  • Synthetic DAGs: Training on random graphs (n=100n=100–500 nodes, p=0.05p=0.05 edge prob.), up to 120-dimensional embeddings, varying layers/heads, and evaluating exact-path accuracy.
    • For n200n \leq 200, 1-layer/1-head models achieve >95% accuracy; for larger graphs, accuracy falls modestly.
    • Attention always focuses on the target token.
    • Feedforward matrix WMW^{M'} recovers AtrueA^{\mathrm{true}} on seen edges.
    • Value matrix WVW^{V'} correlates with RobsR^{\mathrm{obs}}; pairs requiring transitive closure are not captured.
    • Partitioning by “degree” (number of required concatenations): accuracy is near-100% for degree 0, falls to 20–50% for degree 2\ge 2.
  • Blocksworld: The state graph with 73 nodes (legal 4-block configurations) is encoded analogously, trained on 80% of (s,t)(s, t) pairs, sampled over 50,000 paths.
    • A 1-layer, 1-head, d=120d=120 Transformer achieves near-100% accuracy on directly observed reachabilities; attention and weights mirror synthetic results.

These results confirm both the successful learning of adjacency and observed reachability, and the consistent failure to generalize to purely compositional inferences.

5. Fundamental Limitation: No Transitive Closure

Theoretical (Section 4) and empirical (Figures 8, 9, 10) evidence demonstrate that autoregressive next-token training on path sequences fails to instantiate a mechanism for learning transitive closure:

  • Gradients for Wt,kVW^V_{t,k} do not increase unless (t,k)(t, k) appears directly in the data—as such, the model has no way to “compose” two or more partial reachabilities and thus cannot infer Rt,ktrue=1R_{t,k}^{\mathrm{true}} = 1 for any (t,k)(t, k) not observed.
  • Empirically, entries in WVW^{V'} corresponding to unobserved (but actually reachable) node pairs remain indistinguishable from truly unreachable pairs.
  • Accordingly, test accuracy collapses on pairs requiring nontrivial concatenation (degree 2\ge 2), even on large models and extensive training sets.

This limitation underscores that current Transformer architectures, when trained exclusively via next-token prediction, do not implement compositional or algorithmic reasoning required for generalizable planning.

6. Insights and Prospects for Enhanced Planning in LLMs

ALPINE’s synthesis of theory and experiment yields several insights for advancing the planning capabilities of LLMs:

  • Transformers can reliably learn and utilize direct adjacency and observed reachability within their weights under next-token cross-entropy training.
  • However, absence of a compositional mechanism precludes the discovery of novel, transitive reachabilities—planning generalizes only across instances seen during training.
  • To instill robust, compositional planning, promising modifications include:
    • Algorithmic or compositional supervision (e.g., chain-of-thought traces incorporating path composition).
    • Architectural augmentation using explicit graph reasoning modules, such as GNN encoders capable of dynamic-programming-style inference.
    • Expansion of the training objective to include new reachability queries, beyond observed instances.
    • Memory or deduction layers designed for transitive inference.
    • Hybrid losses that encourage generalized reasoning across unseen node pairs.

Collectively, these directions point toward future LLMs with genuinely generalizable planning and path-finding abilities, contingent upon integrating architectural, training, or supervision innovations beyond standard autoregressive objectives (Wang et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Autoregressive Learning for Planning In NEtworks (ALPINE).