ALPINE: Autoregressive Planning in Networks

Updated 12 February 2026

The paper introduces ALPINE, a framework that recasts planning as discrete token sequence generation on graphs to bridge next-token prediction with path-finding.
It demonstrates that even simple Transformer models can accurately encode observed adjacency and reachability, yet fail to generalize transitive closure.
Empirical evaluations on synthetic DAGs and Blocksworld confirm that while direct path prediction is robust, compositional planning remains a critical limitation.

Autoregressive Learning for Planning In NEtworks (ALPINE) is a theoretical and empirical investigation into how standard Transformer-based LLMs, trained via autoregressive next-token prediction, acquire and execute planning capabilities when such planning is cast as path-finding over networks. The ALPINE framework formalizes planning as network path-finding, characterizes precisely what is learned by Transformers under cross-entropy training, and reveals fundamental limitations—most notably, the inability to infer transitive reachability—from both theoretical and experimental perspectives (Wang et al., 2024).

1. Formalization of Path-Finding as Autoregressive Prediction

ALPINE models planning as a discrete token sequence generation problem: given a directed graph $\mathcal{G}$ with node set $\mathcal{V}$ , adjacency matrix $A^{\mathrm{true}} \in \{0,1\}^{|\mathcal{V}| \times |\mathcal{V}|}$ (where $A_{i,k}^{\mathrm{true}} = 1$ iff edge $(i \to k)$ exists), and reachability matrix $R^{\mathrm{true}} \in \{0,1\}^{|\mathcal{V}| \times |\mathcal{V}|}$ (where $R_{t,k}^{\mathrm{true}} = 1$ iff a path $k \to \ldots \to t$ exists), the task is to generate a token sequence $u = (s, t, s, u_1, u_2, \ldots, u_p, t, \text{\textbackslash n})$ , representing a path from source $s$ to target $t$ . The vocabulary size is $M = |\mathcal{V}| + 1$ .

Transformers are trained to autoregressively predict each next token using standard architecture:

Input encoding: $H_0 = U W_t + W_p$ , where $U$ is one-hot, $W_t$ , $W_p$ are learned weights.
Layerwise propagation: $H_\ell = \mathrm{TransformerLayer}(H_{\ell-1})$ .
Output: $\hat{u}_{n+1} = \mathrm{softmax}(\mathrm{LN}_t(H_L)_{(n,:)} W_o)$ .
Training loss: $\ell = -\sum_{n=1}^{N-1} U_{(n+1)} \cdot \log \hat{u}_{n+1}$ .

This formulation provides a direct bridge between next-token prediction and the mechanistic implementation of planning as path-finding.

2. Representation of Graph Structure in Self-Attention and Output Probabilities

A core result (Theorem 3.1) demonstrates that a 1-layer, 1-head Transformer with embedding dimension $d = O(|\mathcal{V}|)$ can exactly implement optimal path-finding via appropriately chosen weights:

Attention: Weights $W^Q = \sqrt{d}\,I$ , $W^K$ ensure all attention mass focuses on the target token position.
Values: Weights $W^V$ at the target token position embed the row $R_{(t,:)}^{\mathrm{true}}$ of the true reachability matrix.
Feedforward/MLP: Weights $W_1, W_2$ extract the row $A_{(u_k,:)}^{\mathrm{true}}$ , representing outgoing edges from the current node $u_k$ .
Final logits: At each next-token step,

$\mathrm{logit}_k \propto c_1 R_{t,k}^{\mathrm{true}} + c_2 A_{u_k,k}^{\mathrm{true}},$

where $c_1, c_2$ are scaling factors. The softmax over these logits assigns nonzero mass precisely to those $k$ reachable from $u_k$ and, ultimately, from $t$ .

This construction unveils how Transformers, in principle, can internalize both adjacency ( $A$ ) and reachability ( $R$ ) information in their parameters and use these to make path-finding decisions stepwise.

3. Learning Dynamics: Adjacency and Partial Reachability

The analysis of SGD-driven learning dynamics (Section 4) employs a simplified Transformer (single layer/head, no nonlinearity/normalization, input/output weights as identities) to clarify which graph-theoretic structures are reliably encoded:

For each observed triple $(i, j, k)$ (current node, target, next node), the final logit is $L_{i,j,k} = W^M_{i,k} + W^V_{j,k}$ .
Gradients with respect to $W^M$ (adjacency) drive $W^M_{i,k}$ up if edge $(i \to k)$ is seen in the training paths, and down or unchanged otherwise—resulting precisely in $W^M$ encoding $A^{\mathrm{obs}}$ , the adjacency matrix of the paths present in training.
For $W^V$ (reachability), the same dynamics cause $W^V_{j,k}$ to reflect $R^{\mathrm{obs}}$ , that is, direct evidence of $k$ appearing on any path to $j$ in training.

Crucially, concatenated or composite reachabilities—cases where $(t, k)$ is reachable through chaining but never observed as a direct $(t, k)$ pair—are not learned. Thus, the trained model does not internalize the transitive closure required for inferential planning beyond observed data.

4. Empirical Evaluation on Synthetic DAGs and Blocksworld

Extensive experiments validate the theoretical predictions using both synthetic graph environments and the Blocksworld planning benchmark:

Synthetic DAGs: Training on random graphs ( $n=100$ $n = 100$ –500 nodes, $p=0.05$ $p = 0.05$ edge prob.), up to 120-dimensional embeddings, varying layers/heads, and evaluating exact-path accuracy.
- For $n \leq 200$ , 1-layer/1-head models achieve >95% accuracy; for larger graphs, accuracy falls modestly.
- Attention always focuses on the target token.
- Feedforward matrix $W^{M'}$ recovers $A^{\mathrm{true}}$ on seen edges.
- Value matrix $W^{V'}$ correlates with $R^{\mathrm{obs}}$ ; pairs requiring transitive closure are not captured.
- Partitioning by “degree” (number of required concatenations): accuracy is near-100% for degree 0, falls to 20–50% for degree $\ge 2$ .
Blocksworld: The state graph with 73 nodes (legal 4-block configurations) is encoded analogously, trained on 80% of $(s, t)$ $(s, t)$ pairs, sampled over 50,000 paths.
- A 1-layer, 1-head, $d=120$ Transformer achieves near-100% accuracy on directly observed reachabilities; attention and weights mirror synthetic results.

These results confirm both the successful learning of adjacency and observed reachability, and the consistent failure to generalize to purely compositional inferences.

5. Fundamental Limitation: No Transitive Closure

Theoretical (Section 4) and empirical (Figures 8, 9, 10) evidence demonstrate that autoregressive next-token training on path sequences fails to instantiate a mechanism for learning transitive closure:

Gradients for $W^V_{t,k}$ do not increase unless $(t, k)$ appears directly in the data—as such, the model has no way to “compose” two or more partial reachabilities and thus cannot infer $R_{t,k}^{\mathrm{true}} = 1$ for any $(t, k)$ not observed.
Empirically, entries in $W^{V'}$ corresponding to unobserved (but actually reachable) node pairs remain indistinguishable from truly unreachable pairs.
Accordingly, test accuracy collapses on pairs requiring nontrivial concatenation (degree $\ge 2$ ), even on large models and extensive training sets.

This limitation underscores that current Transformer architectures, when trained exclusively via next-token prediction, do not implement compositional or algorithmic reasoning required for generalizable planning.

6. Insights and Prospects for Enhanced Planning in LLMs

ALPINE’s synthesis of theory and experiment yields several insights for advancing the planning capabilities of LLMs:

Transformers can reliably learn and utilize direct adjacency and observed reachability within their weights under next-token cross-entropy training.
However, absence of a compositional mechanism precludes the discovery of novel, transitive reachabilities—planning generalizes only across instances seen during training.
To instill robust, compositional planning, promising modifications include:
- Algorithmic or compositional supervision (e.g., chain-of-thought traces incorporating path composition).
- Architectural augmentation using explicit graph reasoning modules, such as GNN encoders capable of dynamic-programming-style inference.
- Expansion of the training objective to include new reachability queries, beyond observed instances.
- Memory or deduction layers designed for transitive inference.
- Hybrid losses that encourage generalized reasoning across unseen node pairs.

Collectively, these directions point toward future LLMs with genuinely generalizable planning and path-finding abilities, contingent upon integrating architectural, training, or supervision innovations beyond standard autoregressive objectives (Wang et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

ALPINE: Unveiling the Planning Capability of Autoregressive Learning in Language Models (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Autoregressive Learning for Planning In NEtworks (ALPINE).

ALPINE: Autoregressive Planning in Networks

1. Formalization of Path-Finding as Autoregressive Prediction

2. Representation of Graph Structure in Self-Attention and Output Probabilities

3. Learning Dynamics: Adjacency and Partial Reachability

4. Empirical Evaluation on Synthetic DAGs and Blocksworld

5. Fundamental Limitation: No Transitive Closure

6. Insights and Prospects for Enhanced Planning in LLMs

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

ALPINE: Autoregressive Planning in Networks

1. Formalization of Path-Finding as Autoregressive Prediction

2. Representation of Graph Structure in Self-Attention and Output Probabilities

3. Learning Dynamics: Adjacency and Partial Reachability

4. Empirical Evaluation on Synthetic DAGs and Blocksworld

5. Fundamental Limitation: No Transitive Closure

6. Insights and Prospects for Enhanced Planning in LLMs

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research