Graph Tokenization for Transformers Overview

Updated 23 May 2026

Graph tokenization for Transformers is the process of converting non-Euclidean graph data into structured tokens that encode nodes, edges, and subgraphs.
It enables Transformer models to effectively capture complex topological relations by preserving multiscale substructures and relational information.
Recent advances such as quantized and BPE-based tokenizers offer scalable and adaptive methods that balance expressivity, memory efficiency, and computational cost.

Graph tokenization for Transformers is the process of converting a graph $\mathcal{G} = (\mathcal{V}, \mathcal{E})$ into a structured collection of discrete tokens, enabling the ingestion of non-Euclidean graph data by sequence-oriented Transformer architectures. Unlike NLP tokenization, where tokens are words or subwords, graph tokenization must encode node and edge attributes, multiscale substructures, and global topology in ways compatible with Transformer self-attention. The design and choice of tokenization directly govern the expressivity, scalability, and task-specific suitability of graph Transformers. This article surveys foundational principles, main classes of graph tokenization, explicit algorithms and theoretical trade-offs, and recent developments in quantized and adaptive tokenizers.

1. Definitions, Principles, and Theoretical Foundations

Graph tokenization is formally a mapping $T: \mathcal{G} \to \{t_1, \ldots, t_T\}$ where each token $t_i$ encodes node, edge, subgraph, or multiscale structure (Yuan et al., 23 Feb 2025). The central goal is to expose both attributes and relational information, so that the vanilla Transformer self-attention mechanism can model pairwise—and in some schemes, higher-order—topological interactions. The input “sequence” $H^{(0)} \in \mathbb{R}^{T \times d}$ , where $T$ is the number of tokens and $d$ the embedding dimension, forms the backbone for subsequent position or relation-aware attention.

Criteria for a robust graph tokenization include:

One-to-one or many-to-one semantic mapping between graph elements and tokens
Retention of key structural relations (adjacency, substructure memberships, distances)
Sufficient expressivity to match graph isomorphism tests up to $k$ -WL as needed for the downstream task
Compatibility with Transformer computation (fixed or bounded token count, embedding dimension, and positional encodings)

Expressivity is typically analyzed in terms of the Weisfeiler-Lehman (WL) hierarchy: node-level tokenizations often correspond to 1-WL (neighbor-aggregation), node+edge tokenizations to 2-WL, and k-tuple or subgraph-level tokenizations to k-WL (Yuan et al., 23 Feb 2025, Kim et al., 2022). The interplay between tokenization and the depth of the Transformer has fundamental implications for the complexity of various graph functions (Bechler-Speicher et al., 21 May 2026).

2. Taxonomy of Tokenization Schemes

Tokenization methods are categorized by the granularity and semantics of their tokens (Yuan et al., 23 Feb 2025, Yu et al., 2 Jan 2025):

Tokenization Type	Token Definition	Structural Scope
Node-level	$T_{\mathrm{node}}(\mathcal{G})=\{v_i\}$	Single node
Edge-level	$T_{\mathrm{edge}}(\mathcal{G})=\{(u,v)\}$	Pairwise (edge)
Subgraph-level	$T_{\mathrm{sub}}(\mathcal{G})=\{\mathcal{N}^k[v_i]\}$	k-hop ego- or functional subgraph
Hop-level	$T: \mathcal{G} \to \{t_1, \ldots, t_T\}$ 0	All nodes at fixed distance from $T: \mathcal{G} \to \{t_1, \ldots, t_T\}$ 1
Quantized/discrete	Tokens via codebook or vector quantization	Node or subgraph; discrete set
BPE/fragment-based	Learned merges of frequent substructures	Subgraph motifs, multi-scale

More broadly, advanced tokenization includes:

Hierarchical quantized tokens via residual vector quantization towers (e.g., QUIET, GQT) (Xiang et al., 14 Oct 2025, Wang et al., 2024)
BPE-based subgraph tokens (e.g., BiScale-GTR, BPE-serialization frameworks) (Yang et al., 7 Apr 2026, Guo et al., 11 Mar 2026)
Patch/fragment tokens from graph partitioning or structure-guided serialization (Biparva et al., 2024, Guo et al., 11 Mar 2026)
Dynamic, instruction-adaptive tokenizations for LLMs (GraphTokenLLMs) (Zhang et al., 5 May 2026, Yu et al., 2 Jan 2025)

3. Algorithms, Positioning, and Implementation

Token embedding pipelines follow structured steps characteristic to tokenization granularity. A non-exhaustive catalog of key algorithmic patterns includes:

Node-level Tokenization:

For each node $T: \mathcal{G} \to \{t_1, \ldots, t_T\}$ 2, define $T: \mathcal{G} \to \{t_1, \ldots, t_T\}$ 3 via a learnable linear or MLP layer; positional encodings (ORF, Laplacian, degree) are often added (Yuan et al., 23 Feb 2025, Kim et al., 2022). Standard Transformers treat the full node list as sequence input, optionally interleaved with edge or virtual tokens.

Edge-level (TokenGT):

For each edge $T: \mathcal{G} \to \{t_1, \ldots, t_T\}$ 4, form $T: \mathcal{G} \to \{t_1, \ldots, t_T\}$ 5; input is $T: \mathcal{G} \to \{t_1, \ldots, t_T\}$ 6 (Kim et al., 2022). When coupled with strong node identifier encodings, this scheme achieves 2-WL expressivity and surpasses message-passing GNNs.

Subgraph and Hop-level (Hop2Token, NAGphormer):

Extract for each node or “anchor” a local subgraph $T: \mathcal{G} \to \{t_1, \ldots, t_T\}$ 7 or $T: \mathcal{G} \to \{t_1, \ldots, t_T\}$ 8-hop neighborhood $T: \mathcal{G} \to \{t_1, \ldots, t_T\}$ 9. Map it via GNN or pooling function $t_i$ 0 to construct a token embedding (Yuan et al., 23 Feb 2025, Chen et al., 2022). For hop-level, per-node sequences of $t_i$ 1 tokens correspond to successive $t_i$ 2-hop aggregations. Local Transformers operate per node’s token sequence for scalability (Chen et al., 2022).

Quantized and Hierarchical:

Embed nodes using a frozen GNN encoder, then apply residual vector quantization (RVQ), yielding a stack of codebook assignments per node. A lightweight gating network learns task-adaptive affinity scores and assembles the final token (Xiang et al., 14 Oct 2025, Wang et al., 2024). Discrete token ids are mapped into trainable embedding tables.

Fragment/BPE-based Tokens:

Graphs are serialized by deterministic edge-traversal (e.g., frequency-guided Eulerian walk), then Byte Pair Encoding (BPE) iteratively merges common pairs, forming a vocabulary of subgraph tokens. These represent functional motifs or fragments, and are used in conjunction with GNN atom pooling to construct the final embedding (Guo et al., 11 Mar 2026, Yang et al., 7 Apr 2026).

Composite/Multi-stream (Tokenphormer, NTFormer):

Multiple token types (walk-based, hop-based, global pre-trained, attribute-similarity, topology-similarity) are constructed and embedded per node, then collaboratively processed by Transformer architectures (Zhou et al., 2024, Chen et al., 2024).

4. Expressivity, Trade-offs, and Practical Considerations

Explicit theoretical results delineate the regimes and costs of various tokenization choices (Bechler-Speicher et al., 21 May 2026, Yuan et al., 23 Feb 2025):

Scheme	Expressivity	Attention Cost	Practical Notes
Node-level	1-WL	$t_i$ 3	Needs strong PE for topology
Node+edge (TokenGT)	2-WL	$t_i$ 4	Maximizes local pairwise info
Hop-/subgraph-level	k-WL for $t_i$ 5	$t_i$ 6 (local)	Preserves multi-hop context
Quantized/fragment	Task-dependent	$t_i$ 7	Memory/efficiency gains
BPE/motif	Subgraph motifs	Sequence compressed	Multiscale, interpretable

Fundamental trade-offs include:

Expressivity vs. cost: Higher-order tokenization (subgraphs, fragments) matches higher WL tests, but with steep complexity or overlap.
Depth separations: Certain computations (e.g., closed walk detection, connectivity, triangle counting) that are $t_i$ 8-depth under one tokenization (e.g., random-walk, spectral) may require $t_i$ 9 layers or be ill-conditioned under others (adjacency, truncation) (Bechler-Speicher et al., 21 May 2026).
Lossiness and ill-conditioning: Random-walk tokenizations are provably lossy (cannot decide planarity); spectral truncation loses local structure, is ill-conditioned for edge queries; adjacency is lossless but expensive (Bechler-Speicher et al., 21 May 2026).
Scalability: Token counts in node/edge-level schemes scale with $H^{(0)} \in \mathbb{R}^{T \times d}$ 0 or $H^{(0)} \in \mathbb{R}^{T \times d}$ 1; BPE or quantization achieve sequence and memory compression.

Empirical evidence confirms these aspects: node+edge tokens (TokenGT) outperform GNN baselines on molecular regression; hop-based models (NAGphormer, Tokenphormer) achieve strong results on million-scale node classification (Kim et al., 2022, Chen et al., 2022, Zhou et al., 2024).

5. Position Encodings, Attention Biases, and Structural Integration

Position encoding is mandatory for encoding non-sequential, permutation-invariant structure. Strategies include:

Absolute encodings: node degrees, Laplacian eigenvectors, ORF (Kim et al., 2022, Yuan et al., 23 Feb 2025)
Relative encodings: shortest path distance, random-walk distance, personalized PageRank, incorporated as attention bias
Token-type or hierarchy encodings: distinguishing node/edge/subgraph, embedding codebook or quantization stage (Xiang et al., 14 Oct 2025)
BPE token positions and substructure statistics in serialized sequences (Guo et al., 11 Mar 2026, Yang et al., 7 Apr 2026)

After tokenization, encoded tokens are embedded as $H^{(0)} \in \mathbb{R}^{T \times d}$ 2 and, if appropriate, structural biases $H^{(0)} \in \mathbb{R}^{T \times d}$ 3 (e.g., $H^{(0)} \in \mathbb{R}^{T \times d}$ 4, $H^{(0)} \in \mathbb{R}^{T \times d}$ 5 is MLP or lookup) are injected into self-attention logits (Yuan et al., 23 Feb 2025).

Edge- and subgraph-tokens can also carry their own internal positional or structure-level context (e.g., start/end atom, fragment adjacency, ring type) (Yang et al., 7 Apr 2026).

6. Recent Advances: Quantized, BPE, and Task-Adaptive Tokenizations

Recent trends have substantially extended the toolkit:

Quantized tokenizers decouple GNN encoder training from Transformer fine-tuning, allow memory-efficient scaling, and optimize end-to-end quantized codebooks (GQT (Wang et al., 2024), QUIET (Xiang et al., 14 Oct 2025)).
BPE/fragment tokenization (BiScale-GTR (Yang et al., 7 Apr 2026), data-driven sequence tokenization (Guo et al., 11 Mar 2026)) enables the discovery of interpretable, multi-node functional motifs aligned with domain structure, with substantial sequence compression and empirical accuracy gains for molecular tasks.
Composite and multi-token designs (Tokenphormer (Zhou et al., 2024), NTFormer (Chen et al., 2024)) integrate diverse topological and semantic signals, showing state-of-the-art performance across heterophilous and homophilous graphs.
Instruction-oriented LLM tokenization (GraphTokenLLMs (Zhang et al., 5 May 2026)) compresses structural graph information to learned tokens for LLM input, but reveals susceptibility to over-sensitivity and limited semantic grounding unless extensively instruction-tuned.

7. Open Challenges and Prospective Directions

Active challenges and frontiers for graph tokenization research include (Yuan et al., 23 Feb 2025, Zhang et al., 5 May 2026, Xiang et al., 14 Oct 2025):

Scalability and efficiency: Reducing cost of subgraph/hop tokenization on large and dense graphs (requiring downsampling, patchification, or local attention).
Information redundancy and overlap: Heavy token redundancy due to subgraph overlap, especially with ego-centric or motif-based schemes.
Dynamic and heterogeneous graphs: Tokenizers that adapt to temporal, evolving, or typed (multi-relational) structures.
Automated/adaptive tokenization: Learning the optimal token granularity—rather than committing to node, edge, or subgraph a priori—potentially via differentiable token selection or contrastive objectives.
Interpretability and robustness: Designing tokens that map back to human-comprehensible graph features and preserve task-robustness against instruction or structural perturbations.
Foundational graph models: Universal pretraining regimes for graph Transformers (akin to LLMs), necessitating expressive yet efficient tokenizers.

Future work is aimed at integrating task-adaptive quantization, learnable multi-scale fragmentization, meta-instruction tuning for LLMs, and hybrid structure-text token inputs. Addressing these issues is central to unlocking robust, scalable, and general-purpose graph inductive biases in large Transformer architectures.

References:

(Yuan et al., 23 Feb 2025) A Survey of Graph Transformers: Architectures, Theories and Applications (Kim et al., 2022) Pure Transformers are Powerful Graph Learners (Yang et al., 7 Apr 2026) BiScale-GTR: Fragment-Aware Graph Transformers for Multi-Scale Molecular Representation Learning (Guo et al., 11 Mar 2026) Graph Tokenization for Bridging Graphs and Transformers (Xiang et al., 14 Oct 2025) A Hierarchical Quantized Tokenization Framework for Task-Adaptive Graph Representation Learning (Wang et al., 2024) Learning Graph Quantized Tokenizers (Zhou et al., 2024) Tokenphormer: Structure-aware Multi-token Graph Transformer for Node Classification (Chen et al., 2022) NAGphormer: A Tokenized Graph Transformer for Node Classification in Large Graphs (Yu et al., 2 Jan 2025) Graph2text or Graph2token: A Perspective of LLMs for Graph Learning (Bechler-Speicher et al., 21 May 2026) Lost in Tokenization: Fundamental Trade-offs in Graph Tokenization for Transformers (Chen et al., 12 Feb 2025) Rethinking Tokenized Graph Transformers for Node Classification (Biparva et al., 2024) Todyformer: Towards Holistic Dynamic Graph Transformers with Structure-Aware Tokenization (Chen et al., 2024) NTFormer: A Composite Node Tokenized Graph Transformer for Node Classification (Zhang et al., 5 May 2026) Revisiting Graph-Tokenizing LLMs: A Systematic Evaluation of Graph Token Understanding