All-for-One Subgraph (AF1)

Updated 4 July 2026

AF1 is a design pattern where a sparse, targeted subgraph funnels information from all prior tokens into the last token, enabling efficient decision-making in transformer models.
It encapsulates a family of techniques used in transformer mechanistic interpretability, inductive knowledge graph completion, and zero-shot graph reasoning, highlighting its broad applicability.
Empirical and theoretical studies show that careful control of sharing—via attention routing and ablation studies—can optimize task performance while balancing reuse with relevance.

Searching arXiv for papers explicitly mentioning AF1 or closely related “one subgraph for all” formulations. All-for-One Subgraph (AF1) denotes, in its most explicit current usage, a sparse causal subgraph of transformer computation in which information from multiple preceding tokens is funneled to the last token, and the decisive input-specific computation is then carried out primarily at that last position (Mamidanna et al., 11 Sep 2025). Closely related one-subgraph-for-many-target formulations also appear, sometimes without the AF1 name, in inductive knowledge graph completion, zero-shot LLM-based graph reasoning, graphlet counting, subgraph-universal graph theory, and parity-constrained induced subgraph theory. Across these domains, the recurring technical question is whether one shared subgraph, host structure, or subgraph policy can efficiently support many downstream decisions, or whether such sharing becomes a bottleneck that must be relaxed or replaced.

1. Terminological scope and conceptual variants

The term AF1 is explicit in mechanistic interpretability, but several adjacent literatures instantiate structurally similar ideas. In some cases the relevant paper presents AF1-like behavior as a target architecture; in others it appears as a baseline to be surpassed; in still others the connection is best understood as an inferred all-for-one pattern rather than the paper’s own terminology.

Domain	AF1 formulation	Status
Mechanistic interpretability	Sparse transformer subgraph funneling information to the last token	Explicit (Mamidanna et al., 11 Sep 2025)
Inductive KGC	One opening subgraph for all candidate answers of a query	AF1-style (Xie et al., 2024)
Zero-shot graph reasoning	Fixed task-agnostic $k$ -hop extraction used for every instance	AF1-like bottleneck (Li et al., 3 Mar 2026)
Universal graph theory	One host graph containing every graph in a target family	Subgraph-universal formulation (Bergold et al., 2024)
Graphlet counting	One indexed system and shared transforms for many subgraph statistics	AF1 connection inferred (Floros et al., 2021)
Odd-degree induced subgraphs	One induced subgraph whose vertices all satisfy odd-degree parity	AF1 connection inferred (Ferber et al., 2020)

This distribution of meanings suggests that AF1 is less a single theorem than a family of design patterns organized around common reuse of structural context. The crucial distinctions concern the unit of sharing. In transformer circuits, “all” means all relevant prior tokens and “one” means the last token’s residual stream. In inductive KGC, “all” means all candidate answers for one query and “one” means the opening subgraph centered on the query entity. In universal graph theory, “all” means all graphs in a family and “one” means a single host graph. In zero-shot graph reasoning, by contrast, the AF1-like object is the rejected baseline: one fixed subgraph-extraction recipe for every instance (Xie et al., 2024, Li et al., 3 Mar 2026, Bergold et al., 2024).

2. AF1 as a sparse transformer computation graph

In "All for One: LLMs Solve Mental Math at the Last Token With Information Transferred From Other Tokens" (Mamidanna et al., 11 Sep 2025), AF1 is defined as an intervention-specified subgraph of a transformer sufficient to preserve high performance on direct arithmetic next-token prediction without explicit chain-of-thought. The architecture is organized into three stages: early “waiting” layers, in which tokens do not use input-specific information from other tokens; middle “transfer” layers, in which information from earlier positions is sent to the last token through a restricted set of attention routes; and late layers, in which meaningful computation is forced to continue essentially only at the last token. With input sequence $\vec x=\{x_1,\dots,x_T\}$ , residual states $x_t^{(l)}$ , and prefix computation $m(\vec x,t,l)$ , the central intervention for the waiting phase is Context-Aware Mean Ablation (CAMA),

$\tilde x_t^{(wait)}=\mathbb{E}_{\vec{x}'\sim \mathbb P(\vec{x}\mid x_t)} \!\bigl[\,m(\vec{x}',\,t,\,wait)\bigr],$

which preserves task-general contextual processing conditional on the token itself while averaging out prompt-specific information from other tokens. Communication control in transfer and late stages is imposed by Attention-Based Peeking (ABP): for allowed key set $K_t\subseteq\{1,\dots,t\}$ , pre-softmax attention entries satisfy $M_{q,k}\leftarrow -\infty$ whenever $k\notin K_q$ , with the first token retained because removing BOS attention was reported as devastating. The resulting AF1 is evaluated by faithfulness,

$\mathrm{faith.}(s)=\mathbb E_{x,y}[s(x)=y\mid m(x)=y],$

which measures conditional agreement with the full model on examples the full model already solves.

The mechanistic claim is correspondingly specific. Attention serves as the communication channel that moves operand information across positions, while MLPs remain token-local. If, after a brief transfer window, only the last token continues to receive meaningful information and later even it is reduced to self-peeking, then any successful remaining arithmetic must be encoded in that token’s residual stream and processed primarily by its own later blocks. The paper does not provide a symbolic arithmetic algorithm; its claim is causal and intervention-based rather than a full analytic decomposition of the transformer.

3. Empirical structure, sufficiency, and necessity

For Llama-3-8B on $A+B+C$ , the paper identifies a minimal AF1 in which the model can wait for $\vec x=\{x_1,\dots,x_T\}$ 0 layers, transfer for $\vec x=\{x_1,\dots,x_T\}$ 1 layers, and then continue with last-token self-attention only. The experimentally read constraints are $\vec x=\{x_1,\dots,x_T\}$ 2 and $\vec x=\{x_1,\dots,x_T\}$ 3. Within the two transfer layers, the graph is also head-sparse: accuracy remains at $\vec x=\{x_1,\dots,x_T\}$ 4 after removing $\vec x=\{x_1,\dots,x_T\}$ 5 of the $\vec x=\{x_1,\dots,x_T\}$ 6 heads in layers $\vec x=\{x_1,\dots,x_T\}$ 7 and $\vec x=\{x_1,\dots,x_T\}$ 8, and the remaining important heads include L16H1, L15H13, L15H3, and L16H21 (Mamidanna et al., 11 Sep 2025).

The reported faithfulness values show that the same AF1 structure is highly effective for several direct arithmetic tasks and transfers from Llama-3-8B to Llama-3.1-8B.

Task	Llama-3-8B	Llama-3.1-8B
$\vec x=\{x_1,\dots,x_T\}$ 9	0.995	0.995
$x_t^{(l)}$ 0	0.944	0.974
$x_t^{(l)}$ 1	0.312	0.967
$x_t^{(l)}$ 2	0.995	0.983
$x_t^{(l)}$ 3	0.854	0.771
$x_t^{(l)}$ 4	0.987	0.889
$x_t^{(l)}$ 5	0.710	0.503
$x_t^{(l)}$ 6	0.887	0.779

Necessity is supported by several ablations. Removing the last token’s incoming attention to non-BOS earlier tokens at a single layer produces large drops, especially at layers $x_t^{(l)}$ 7 and $x_t^{(l)}$ 8; the paper states that layer $x_t^{(l)}$ 9 removal causes a large performance drop on all tasks, while layer $m(\vec x,t,l)$ 0 removal affects all but two tasks. The head-level ablation is similarly sharp: after the $m(\vec x,t,l)$ 1-head reduction, removing L15H31 leaves $m(\vec x,t,l)$ 2 accuracy, removing L16H1 leaves $m(\vec x,t,l)$ 3, removing L15H13 leaves $m(\vec x,t,l)$ 4, removing L15H3 leaves $m(\vec x,t,l)$ 5, and removing L16H21 leaves $m(\vec x,t,l)$ 6. Logit-lens analysis further shows that top-3 answer accuracy emerges around layer $m(\vec x,t,l)$ 7 in both the full model and AF1, which the paper interprets as evidence that AF1 preserves the native prediction pathway rather than replacing it with a qualitatively different shortcut. At the same time, the scope is limited: AF1 works on verbal-math, question-answering, and instruction variants of direct arithmetic, but fails on math word problems and Python-program prompts, where reported faithfulness can fall near zero.

4. Query-level AF1 in inductive knowledge graph completion

In inductive KGC, "One Subgraph for All: Efficient Reasoning on Opening Subgraphs for Inductive Knowledge Graph Completion" (Xie et al., 2024) presents a clear AF1-style construction at query level. Instead of extracting one enclosing subgraph per candidate triple $m(\vec x,t,l)$ 8, the method GLAR uses one shared opening subgraph $m(\vec x,t,l)$ 9, defined as the $\tilde x_t^{(wait)}=\mathbb{E}_{\vec{x}'\sim \mathbb P(\vec{x}\mid x_t)} \!\bigl[\,m(\vec{x}',\,t,\,wait)\bigr],$ 0-hop induced neighborhood around the query entity $\tilde x_t^{(wait)}=\mathbb{E}_{\vec{x}'\sim \mathbb P(\vec{x}\mid x_t)} \!\bigl[\,m(\vec{x}',\,t,\,wait)\bigr],$ 1, for all candidate answers to a query $\tilde x_t^{(wait)}=\mathbb{E}_{\vec{x}'\sim \mathbb P(\vec{x}\mid x_t)} \!\bigl[\,m(\vec{x}',\,t,\,wait)\bigr],$ 2. The paper is explicit that the sharing unit is the query rather than the full dataset. This opening subgraph is then combined with entity-independent local anchors and global anchors to produce transferable structure-aware features in the full-inductive setting where $\tilde x_t^{(wait)}=\mathbb{E}_{\vec{x}'\sim \mathbb P(\vec{x}\mid x_t)} \!\bigl[\,m(\vec{x}',\,t,\,wait)\bigr],$ 3.

The local anchor set is

$\tilde x_t^{(wait)}=\mathbb{E}_{\vec{x}'\sim \mathbb P(\vec{x}\mid x_t)} \!\bigl[\,m(\vec{x}',\,t,\,wait)\bigr],$ 4

where $\tilde x_t^{(wait)}=\mathbb{E}_{\vec{x}'\sim \mathbb P(\vec{x}\mid x_t)} \!\bigl[\,m(\vec{x}',\,t,\,wait)\bigr],$ 5 is the query entity and $\tilde x_t^{(wait)}=\mathbb{E}_{\vec{x}'\sim \mathbb P(\vec{x}\mid x_t)} \!\bigl[\,m(\vec{x}',\,t,\,wait)\bigr],$ 6 abstracts one-hop neighbors by relation-to-center type. Structure-aware features are built from anchor reachability, distance from the center, and relational features; global anchors are obtained by clustering relational feature vectors on the training KG and selecting degree-based representatives. The computational motivation is explicit: if a query has $\tilde x_t^{(wait)}=\mathbb{E}_{\vec{x}'\sim \mathbb P(\vec{x}\mid x_t)} \!\bigl[\,m(\vec{x}',\,t,\,wait)\bigr],$ 7 candidates, enclosing-subgraph methods incur $\tilde x_t^{(wait)}=\mathbb{E}_{\vec{x}'\sim \mathbb P(\vec{x}\mid x_t)} \!\bigl[\,m(\vec{x}',\,t,\,wait)\bigr],$ 8, whereas GLAR reasons once on the opening subgraph with overall cost simplified to $\tilde x_t^{(wait)}=\mathbb{E}_{\vec{x}'\sim \mathbb P(\vec{x}\mid x_t)} \!\bigl[\,m(\vec{x}',\,t,\,wait)\bigr],$ 9, where $K_t\subseteq\{1,\dots,t\}$ 0. Empirically, average Hits@10 scores are $K_t\subseteq\{1,\dots,t\}$ 1 on WN18RR-ind, $K_t\subseteq\{1,\dots,t\}$ 2 on FB15k237-ind, and $K_t\subseteq\{1,\dots,t\}$ 3 on NELL995-ind; compared with ConGLR, the paper highlights average gains of about $K_t\subseteq\{1,\dots,t\}$ 4, $K_t\subseteq\{1,\dots,t\}$ 5, and $K_t\subseteq\{1,\dots,t\}$ 6, and compared with QAAR it reports improvements of $K_t\subseteq\{1,\dots,t\}$ 7, $K_t\subseteq\{1,\dots,t\}$ 8, and $K_t\subseteq\{1,\dots,t\}$ 9. The runtime contrast is especially direct: on FB15k237-ind v1 with $M_{q,k}\leftarrow -\infty$ 0 negatives, GraIL takes $M_{q,k}\leftarrow -\infty$ 1 and GLAR takes $M_{q,k}\leftarrow -\infty$ 2; with $M_{q,k}\leftarrow -\infty$ 3 negatives, GraIL takes $M_{q,k}\leftarrow -\infty$ 4 while GLAR remains at $M_{q,k}\leftarrow -\infty$ 5.

A common misconception is to read this as one universal subgraph for the entire emerging graph. The paper does not claim that. It claims one opening subgraph for all candidates of a single query. That distinction is essential to the AF1 interpretation.

5. AF1 as a bottleneck in zero-shot LLM graph reasoning

"Beyond One-Size-Fits-All: Adaptive Subgraph Denoising for Zero-Shot Graph Learning with LLMs" (Li et al., 3 Mar 2026) does not use the term AF1 explicitly, but it is directly organized around the same underlying question: whether zero-shot LLM-based graph reasoning should rely on one universal subgraph extraction rule for every instance and task. The paper formulates the graph as

$M_{q,k}\leftarrow -\infty$ 6

converts the task into conditional text generation,

$M_{q,k}\leftarrow -\infty$ 7

and argues that the extracted subgraph defines the LLM’s effective receptive field. Its critique of prior work, especially Graph-R1, is that they use a uniform and task-agnostic subgraph extraction setting, typically all $M_{q,k}\leftarrow -\infty$ 8-hop neighbors, thereby introducing structural noise through irrelevant neighbors, distracting edges, and oversized semantically mixed neighborhoods.

GraphSSR is proposed as the alternative. Its Sample-Select-Reason pipeline first samples a set of candidate subgraphs $M_{q,k}\leftarrow -\infty$ 9, then selects the best suited subgraph $k\notin K_q$ 0, and finally reasons only on $k\notin K_q$ 1. Candidate diversity during data synthesis is scored by

$k\notin K_q$ 2

and the supervised stage optimizes

$k\notin K_q$ 3

The reinforcement-learning stage adds Authenticity-Reinforced and Denoising-Reinforced RLVR. In Stage I,

$k\notin K_q$ 4

and in Stage II,

$k\notin K_q$ 5

so smaller correct authentic selections receive extra reward.

The empirical message is a direct rejection of one-size-fits-all extraction. GraphSSR improves over Graph-R1 on all reported node-classification benchmarks: Cora $k\notin K_q$ 6, WikiCS $k\notin K_q$ 7, Products-47 $k\notin K_q$ 8, Products-10 $k\notin K_q$ 9, Products-5 $\mathrm{faith.}(s)=\mathbb E_{x,y}[s(x)=y\mid m(x)=y],$ 0, Cora-2 $\mathrm{faith.}(s)=\mathbb E_{x,y}[s(x)=y\mid m(x)=y],$ 1, and WikiCS-5 $\mathrm{faith.}(s)=\mathbb E_{x,y}[s(x)=y\mid m(x)=y],$ 2. Table 4 reports substantial reasoning-time subgraph compression with improved accuracy: Cora $\mathrm{faith.}(s)=\mathbb E_{x,y}[s(x)=y\mid m(x)=y],$ 3, WikiCS $\mathrm{faith.}(s)=\mathbb E_{x,y}[s(x)=y\mid m(x)=y],$ 4, Products $\mathrm{faith.}(s)=\mathbb E_{x,y}[s(x)=y\mid m(x)=y],$ 5, and FB15K237 $\mathrm{faith.}(s)=\mathbb E_{x,y}[s(x)=y\mid m(x)=y],$ 6. The paper’s conclusion is not that the smallest subgraph is always best. Its RL case studies show that authenticity-only training keeps overly large subgraphs, denoising-only training can over-prune to the target node alone, and the full model succeeds by selecting a small but sufficiently informative subgraph.

6. Graph-theoretic and counting-theoretic analogues

In graph theory, AF1-like constructions arise most cleanly as subgraph universality. "Subgraph-universal planar graphs for trees" (Bergold et al., 2024) studies a single host graph $\mathrm{faith.}(s)=\mathbb E_{x,y}[s(x)=y\mid m(x)=y],$ 7 that contains every graph in a target family as a subgraph. For every $\mathrm{faith.}(s)=\mathbb E_{x,y}[s(x)=y\mid m(x)=y],$ 8, it constructs an outerplanar host on $\mathrm{faith.}(s)=\mathbb E_{x,y}[s(x)=y\mid m(x)=y],$ 9 vertices, with $A+B+C$ 0, containing every $A+B+C$ 1-vertex tree as a subgraph. The same paper proves that any three $A+B+C$ 2-vertex trees have a common planar host on $A+B+C$ 3 vertices, with asymptotically matching lower bounds even for caterpillars; it gives an exponential lower bound $A+B+C$ 4 for planar hosts universal for all $A+B+C$ 5-vertex planar graphs; and it proves that an outerplanar host containing all $A+B+C$ 6-vertex outerplanar graphs must have at least $A+B+C$ 7 vertices, while a planar host for all $A+B+C$ 8-vertex outerplanar graphs can be constructed on $A+B+C$ 9 vertices. In this literature, AF1 is exact rather than metaphorical: one host serves an entire graph family.

A different inferred AF1 pattern appears in "A systematic association of subgraph counts over a network" (Floros et al., 2021). The paper does not name AF1, but it explicitly organizes many subgraph counting tasks into one graphlet representation system indexed by $\vec x=\{x_1,\dots,x_T\}$ 00, with exact familywise gross-to-net conversion

$\vec x=\{x_1,\dots,x_T\}$ 01

and inverse

$\vec x=\{x_1,\dots,x_T\}$ 02

This suggests an AF1-like counting regime in which many local subgraph statistics are derived from one indexed family, one lattice of inclusion relations, and one shared algebraic transform rather than one isolated counting task at a time. The same paper’s G-SURF framework outputs $\vec x=\{x_1,\dots,x_T\}$ 03, and on the NotreDame_www graph it reports that about $\vec x=\{x_1,\dots,x_T\}$ 04 of local systems are reduced for quad-node graphlets and about $\vec x=\{x_1,\dots,x_T\}$ 05 for penta-node graphlets.

A further inferred analogue occurs in parity-constrained induced subgraphs. "Every graph contains a linearly sized induced subgraph with all degrees odd" (Ferber et al., 2020) proves that every graph $\vec x=\{x_1,\dots,x_T\}$ 06 on $\vec x=\{x_1,\dots,x_T\}$ 07 vertices with $\vec x=\{x_1,\dots,x_T\}$ 08 satisfies

$\vec x=\{x_1,\dots,x_T\}$ 09

equivalently, that every such graph contains an induced subgraph on at least $\vec x=\{x_1,\dots,x_T\}$ 10 vertices in which every vertex has odd degree. This is not AF1 in the naming sense, but it is an exact all-for-one structural guarantee: one induced subgraph simultaneously satisfies the same local parity constraint at all retained vertices.

7. Misconceptions, limits, and open questions

A first misconception is to treat AF1 as a single settled concept across fields. The explicit term belongs to the transformer-circuit setting of mental math (Mamidanna et al., 11 Sep 2025). In graph learning and graph theory, several papers instantiate closely related ideas but often under different names and with different mathematical objects. A second misconception is to equate AF1 with maximal sharing in every setting. In GLAR, the sharing unit is one query, not the entire KG (Xie et al., 2024). In GraphSSR, the one-size-fits-all policy is presented as the central weakness rather than the solution (Li et al., 3 Mar 2026). In universal graph theory, containment is as a not necessarily induced subgraph unless an induced variant is specified (Bergold et al., 2024).

Another recurring issue is the trade-off between reuse and relevance. GraphSSR shows an inverted-U behavior in denoising intensity: too little denoising leaves structural noise in place, while too much removes critical evidence (Li et al., 3 Mar 2026). The mechanistic AF1 paper makes an analogous point in a different modality: the discovered AF1 preserves direct arithmetic behavior but fails on word problems and Python-like prompts, indicating that one sparse computation subgraph does not capture the full semantic machinery of general mathematical language understanding (Mamidanna et al., 11 Sep 2025). GLAR likewise benefits when candidate sets are large and query-centered neighborhoods are informative, but the paper notes poorer behavior for low-degree entities and a memory/runtime trade-off as $\vec x=\{x_1,\dots,x_T\}$ 11 grows (Xie et al., 2024).

Open questions remain correspondingly heterogeneous. The mechanistic literature has not yet extended AF1 cleanly to models whose tokenizers split numbers into multiple tokens or to richer reasoning tasks. Zero-shot graph reasoning has not yet replaced prompted candidate generation with a fully explicit combinatorial optimization over subgraphs. Universal graph theory leaves open whether the polynomial outerplanar host for all $\vec x=\{x_1,\dots,x_T\}$ 12-vertex trees can be improved further and whether polynomial-size planar universal hosts exist for all $\vec x=\{x_1,\dots,x_T\}$ 13-vertex outerplanar graphs (Bergold et al., 2024). These unresolved points reinforce a general conclusion: AF1 is most informative when the unit of sharing, the admissible host or subgraph class, and the failure mode of over-sharing are all specified precisely.