TokenGT: Tokenized Graph Transformer

Updated 1 March 2026

TokenGT is a framework that tokenizes graph elements to convert nodes and edges into Transformer tokens enriched with structural encodings.
It leverages standard Transformer encoders with self-attention to process graph tokens, enabling robust graph representation and prediction.
Specialized variants like SwapGT and Brain TokenGT achieve state-of-the-art results in molecular property prediction, node classification, and biomedical graph analysis.

TokenGT denotes a family of models and principles for applying Transformers to graph-structured data using explicit tokenization of graph elements. The concept originated in "Pure Transformers are Powerful Graph Learners" (Kim et al., 2022), which established the canonical Tokenized Graph Transformer (TokenGT) architecture for both theoretical and empirical graph learning. Follow-up work extends TokenGT to specialized domains and enhances its core mechanisms, with application areas spanning massive molecular property prediction, node classification, and spatio-temporal biomedical graphs (Chen et al., 12 Feb 2025, Dong et al., 2023). TokenGT frameworks operate by converting graph nodes and edges into transformer tokens, enriching them with structural encodings, and performing self-attention over the resulting token sequence. This approach enables the use of standard Transformer architectures with minimal graph-specific modifications and provably exceeds the representational power of message-passing GNNs.

1. Core Principle: Tokenization of Graph Elements

The fundamental operation in TokenGT is the conversion of a graph $G=(V,E)$ , with node features $X^V\in\mathbb{R}^{n\times C}$ and optionally edge features $X^E\in\mathbb{R}^{m\times C}$ , into a sequence of $(n+m)$ tokens. Each node $v_i\in V$ and edge $(u,v)\in E$ maps to a token embedding. These embeddings concatenate raw feature vectors with two structural components: node identifiers ( $P_i$ ), which are orthogonal or Laplacian eigenvectors acting as positional encodings, and type identifiers ( $E^V, E^E$ ), which denote the token’s nature (node or edge) (Kim et al., 2022).

For example:

Node token: $t_i = [X^V_i,\; P_i,\; P_i,\; E^V] \in \mathbb{R}^{C+2d_p+d_e}$
Edge token: $e_{uv} = [X^E_{uv},\; P_u,\; P_v,\; E^E] \in \mathbb{R}^{C+2d_p+d_e}$

Node identifiers may be constructed via orthogonal random features or as spectral encodings from the first $d_p$ Laplacian eigenvectors, satisfying $P_i^T P_j = \delta_{ij}$ .

2. Transformer Backbone and Readout Mechanisms

TokenGT utilizes standard Transformer encoders in the Vaswani et al. (2017) style, optionally with Pre-LayerNorm. A special [graph] token is prepended to the token list to facilitate graph-level tasks. The token sequence is projected to the model dimension and processed by $L$ layers of multi-head self-attention and two-layer feedforward MLPs, using the canonical Transformer update rules. The attention mechanism is unchanged: $\alpha^h = \mathrm{softmax}\left(\frac{Z W^Q_h (Z W^K_h)^\top}{\sqrt{d_H}}\right)$ where $Z$ is the input matrix and $W^{Q, K, V, O}$ are learned parameters. For prediction tasks, the final [graph] token or specific token subsets are read out by a linear classifier.

No graph-specific inductive biases are injected other than the identifiers. This decoupling allows TokenGT to be flexibly adapted to both graph- and non-graph domains and simplifies implementation.

3. Theoretical Expressivity and Inductive Bias

TokenGT is proven to be strictly more expressive than message-passing GNNs. Formally, any function computed by a 2-IGN (second-order Invariant Graph Network) can be represented by a sufficiently parameterized TokenGT: $\forall X\in\mathbb{R}^{n^2 \times C},\ \forall F_{2\text{-IGN}}(X):\ \exists \Theta \ \text{s.t.}\ f_{\text{TokenGT},\Theta}(X) = F_{2\text{-IGN}}(X)$ This implies TokenGT achieves at least the power of the 2-Weisfeiler–Lehman (2-WL) test and surpasses GNNs limited to first-order message passing (Kim et al., 2022). The critical factors are the use of (i) rich, discriminative identifiers and (ii) flexible self-attention over all tokens: heads can simulate permutation-equivariant bases present in invariant graph models.

4. Specialized Variants and Extensions

TokenGT’s foundational paradigm has been extended in several directions:

SwapGT: For node classification, SwapGT constructs token sequences by sampling $k$ -NN neighborhoods on attribute- and topology-induced similarity graphs, then applies multi-step token swapping to incorporate higher-order semantic locality. The Transformer backbone learns from multiple such augmented sequences per node, and a center-alignment loss penalizes divergence among the learned node representations across augmentations. SwapGT achieves state-of-the-art performance on diverse benchmarks, with pronounced gains under label scarcity (Chen et al., 12 Feb 2025).
Brain TokenGT: Applies TokenGT to dynamic fMRI connectome graphs, implementing specialized modules: (i) a Graph Invariant and Variant Embedding (GIVE) pipeline for spatio-temporal tokenization, and (ii) a Brain-Informed Graph Transformer Readout (BIGTR) using categorical type and node identifiers. Node and edge embeddings are computed via recurrent GCN and dual-hypergraph convolutions, preserving both spatial and temporal connectome information. This structure yields superior performance for Alzheimer’s-related classification tasks, with built-in interpretability mechanisms (Dong et al., 2023).

5. Implementation Considerations and Optimization

TokenGT in its vanilla form incurs $O((n+m)^2)$ complexity due to global attention. Several optimizations and implementation details have emerged:

Linear-cost attention via kernelized methods such as Performer reduces scaling to $O(n+m)$ , enabling application to graphs with tens of thousands of nodes (Kim et al., 2022).
Pre-LayerNorm is used for stability; regularization strategies include eigenvector dropout and random sign flipping for Laplacian-based identifiers.
For large-scale or specialized graph tasks, sparse equivariant residuals may be introduced to preserve certain basis elements exactly.

The orthonormal identifier requirement poses scalability constraints for extremely large graphs, motivating research into approximate or learned positional encodings.

6. Empirical Evaluation and Benchmark Performance

TokenGT demonstrates strong empirical performance across diverse benchmarks:

On PCQM4Mv2 molecular property prediction, TokenGT achieves validation MAE of 0.0910 (Laplacian identifiers) and 0.0935 with linear attention, outperforming all GNN baselines and closely matching the best specialized Graph Transformers (Kim et al., 2022).
In node classification, SwapGT consistently outperforms both GNNs and previous transformers across eight datasets, especially under sparsely-labeled regimes (e.g., on the Photo dataset, SwapGT: 92.93% vs. 92.24% for best GNN) (Chen et al., 12 Feb 2025).
In longitudinal connectomics, Brain TokenGT yields AUC = 90.48% for MCI detection and AUC = 87.14% for AD conversion, outperforming shallow and temporal GNN baselines (Dong et al., 2023).

Ablation studies confirm the essential roles of both identifier components and recent architectural innovations (multi-view center alignment loss, semantic-swap augmentation).

7. Open Directions and Future Work

Active research addresses remaining limitations and open theoretical questions:

Reducing quadratic complexity via sparse, structured, or kernel-based attention mechanisms.
Generalization to hypergraphs with higher-order expressivity (beyond 2-IGN).
Enhanced identifier schemes (e.g., sign-invariant spectral signatures, PDE-based codes).
Integration with in-context learning workflows, pre-training routines, and hybrid forms (e.g., CoAtNet-like convolution-attention pipelines).
Application to autoregressive graph decoding and multi-modal graph representations.

A plausible implication is that continued progress in identifying discriminative, scalable token encodings and efficient attention mechanisms will further consolidate TokenGT’s position as a unifying foundation for graph-transformer models.