Graph Transformer Models: A Survey

Updated 22 February 2026

Graph transformer models are neural architectures that extend self-attention from sequences to graphs by incorporating explicit structural encodings and graph inductive biases.
They leverage mechanisms like sparse attention, edge embeddings, and positional encodings to capture local and global interactions, improving scalability and expressivity.
These models are applied across numerous domains—ranging from molecular property prediction to node and edge classification—delivering enhanced analytical and predictive capabilities.

Graph transformer models are neural architectures that generalize the self-attention principle of Transformers from sequences to graph-structured data. These models integrate graph inductive biases and various mechanisms for representing, propagating, and aggregating information over nodes, edges, and higher-order structures, thereby enabling expressivity beyond classical message-passing GNNs or standard sequence transformers. The field is rapidly advancing, producing both general-purpose frameworks and highly specialized architectures for applications across chemistry, NLP, vision, heterogenous data, and beyond.

1. Architectural Principles and Inductive Biases

The defining feature of graph transformers is their adaptation of self-attention to graphs by explicitly incorporating graph topology, structural relationships, and node/edge attributes (Shehzad et al., 2024). In canonical transformer models, self-attention is computed as

$\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{Q K^\top}{\sqrt{d_k}}\right)V,$

where $Q$ , $K$ , and $V$ are learned projections of the token representations. When applied to a graph $G=(V,E)$ with feature matrix $X\in\mathbb{R}^{n\times d}$ , graph transformer models typically adjust the self-attention mechanism through:

Sparse attention masks or bias matrices: Enforcing that attention is only computed between nodes sharing specific graph-theoretic relationships, e.g. adjacency, k-hop distance, personalized PageRank, anchor connectivity (Park et al., 2022, Fu et al., 2024, Gao et al., 2022, Dwivedi et al., 16 May 2025).
Edge and relation embeddings: Explicitly incorporating edge-type, directionality, label, or even message function into the attention calculation, often via learnable edge or relation matrices (Mohammadshahi et al., 2021, Henderson et al., 2023, Dwivedi et al., 16 May 2025).
Structural or positional encodings: Encoding a node's place in the graph via Laplacian eigenvectors, random-walk statistics, shortest-path distances, motif signatures, or universal covers. These features augment initial representations or bias the attention weights (Chen et al., 2023, Choi et al., 2024, Shehzad et al., 2024).

Key inductive biases employed include enforcing locality (via attention masks or cluster assignment), modeling long-range dependencies (via global tokens, centroid attention, or high-order path features), and reflecting node or edge diversity (by multi-element tokenization or multi-relational attention).

2. Variants of Graph Transformer Blocks

Several principled variants of graph-transformer blocks have emerged, typically differentiated by the mechanism(s) used to encode graph structure and improve scalability:

Sparse or sampled attention: Deformable Graph Transformer (DGT) uses dynamically sampled, ordered node sequences per query node—by BFS distance, PPR, or feature similarity—restricting attention to a small task-relevant window and maintaining linear complexity (Park et al., 2022). VCR-Graphormer similarly builds each node's token set from top- $k$ PPR and virtual connections, enabling efficient mini-batch training (Fu et al., 2024).
Explicit edge and relation injection: Models such as SynG2G-Tr employ learned relation embeddings to bias attention in both key and query spaces, introducing a soft inductive bias for syntactic or semantic edges at every layer and head (Mohammadshahi et al., 2021). Graph-to-Graph Transformers systematically inject explicit input and (optionally) latent/target graphs into attention, allowing for graph refinement and non-autoregressive prediction (Henderson et al., 2023).
Multi-path and propagation attention: GPTrans proposes a three-path Graph Propagation Attention (GPA) scheme, with explicit node-to-node, node-to-edge, and edge-to-node propagation within each transformer block (Chen et al., 2023). TIGT employs dual-path message passing—over both the original adjacency and the cycle-enriched universal cover—coupled with standard global attention and channel-wise recalibration (Choi et al., 2024).
Clustering and Coarsening: PatchGT first applies non-trainable spectral clustering to form patches (clustered node sets), runs GNNs on both nodes and patches, and finally uses a transformer over patches—improving expressivity (beyond 1-WL) while cutting quadratic attention costs (Gao et al., 2022).

3. Position and Structure Encoding

Position and structural encoding is essential for breaking permutation symmetry and endowing the model with a notion of a node's role in graph topology (Shehzad et al., 2024). Approaches include:

Spectral methods: Laplacian eigenvector encodings are widely used, particularly in pre-training and molecular benchmarks, but suffer from cubic cost in graph size and are non-local (Chen et al., 2023, Gao et al., 2022).
Random walk and diffusion: Personalized PageRank, random walk return probabilities, and truncated Katz indices provide multi-scale information with local computation and support for large graphs (Park et al., 2022, Fu et al., 2024).
Motif and universal cover: TIGT's topological positing uses clique adjacency derived from basis cycles, preserving isomorphism information lost by random walks and spectral methods (Choi et al., 2024).
Multi-element tokenization: For heterogeneous or relational graphs, per-node tokens may concatenate type, hop count, timestamp, and a local GNN-PE, enabling rich, scalable, and schema-aware representations (Dwivedi et al., 16 May 2025).

These encodings are often combined or adaptively selected for the target domain.

4. Scalability and Efficiency

Quadratic attention cost ( $O(n^2)$ per layer) remains a primary barrier for scaling graph transformers to large graphs. Thus, advanced architectures exploit:

Neighborhood sampling: Limiting attention to k-hop, top-PPR, cluster, or subgraph neighborhoods, supporting mini-batch processing while retaining expressivity (Park et al., 2022, Fu et al., 2024, Gao et al., 2022).
Token-list strategies: Assigning fixed-size token lists to each node (via offline computation of PPR, virtual connections, or eigenvectors), decouples online computation from graph size (Fu et al., 2024).
Global compression: Using centroid tokens or global queries (by K-means, EM, or virtual tokens), nodes may efficiently pool and attend to database- or batch-wide context (Dwivedi et al., 16 May 2025, Chen et al., 2023).
Clustering and patching: Patch-based transformers build a two-level (node-patch) hierarchy, reducing attention cost to $O(k^2)$ where $k\ll n$ (Gao et al., 2022).
Hybrid approaches: Some models, such as Contextual Graph Transformer (CGT), combine GNN layers for local structure with light transformer layers for global sequence context, achieving parameter efficiency and domain adaptation (Reddy et al., 4 Aug 2025).

Empirical studies confirm that such modifications allow state-of-the-art accuracy while vastly improving memory, speed, and scalability (Park et al., 2022, Chen et al., 2023, Fu et al., 2024, Dwivedi et al., 16 May 2025).

5. Application Domains

Graph transformer models have been successfully applied to a broad range of domains (Shehzad et al., 2024, He et al., 2023, Dwivedi et al., 16 May 2025, Yang et al., 2023):

Task Type	Examples / Domains	Reference
Node Classification	Citation networks, ogbn-arxiv, proteins, relational DB	(Park et al., 2022, Dwivedi et al., 16 May 2025)
Edge Prediction	Link prediction, knowledge graph completion, DTI	(Shehzad et al., 2024)
Graph Classification	Molecular property (ZINC, MolHIV), vision-graphs	(Chen et al., 2023, Gao et al., 2022)
Graph Generation	Molecule synthesis, layout planning, graph GANs	(Yoo et al., 2020, Tang et al., 2024)
RL/Decision Making	Trajectory planning, offline RL, causal decision	(Hu et al., 2023)
NLP/Linguistics	Dependency parsing, SRL, AMR-to-text, coreference	(Mohammadshahi et al., 2021, Cai et al., 2019)
Multimodal QA	Language–Vision fusion, scene graphs	(He et al., 2023)

In molecular property prediction and structure-based design, Graphormer, EGT, GPTrans, PatchGT, and TIGT set state-of-the-art benchmarks (e.g. ZINC, PCQM4M, MolHIV, MolPCBA) (Chen et al., 2023, Gao et al., 2022, Choi et al., 2024). In industrial relational learning, RelGT achieves up to 18% gain over GNNs on the RelBench suite (Dwivedi et al., 16 May 2025). For multimodal QA and scene understanding, Multimodal Graph Transformers leverage joint graph masks for superior reasoning (He et al., 2023). In graph generation, GTGAN, Gransformer, and GRAT demonstrate the feasibility of autoregressive, globally coherent synthesis (Tang et al., 2024, Khajenezhad et al., 2022, Yoo et al., 2020).

6. Theoretical Expressivity and Limitations

Recent work provides rigorous analysis of the expressivity of graph transformers:

Beyond 1-Weisfeiler-Lehman (1-WL): PatchGT and TIGT employ non-learned spectral clustering and topological covers to distinguish non-isomorphic graphs undistinguished by any 1-WL–bound GNN (Gao et al., 2022, Choi et al., 2024).
Limits of positional encodings: TIGT shows that standard random walk or spectral positional encodings can collapse and fail to distinguish cycle-rich graphs, whereas topological covers introduce strictly stronger invariants (Choi et al., 2024).
Global vs. local aggregation: Fully global attention enables efficient modeling of long-range dependencies and reentrant structures (e.g., in AMR or syntax graphs) in a small number of layers; in contrast, deep local GNNs suffer over-smoothing and signal squashing (Cai et al., 2019, Chen et al., 2023).

However, quadratic cost remains a challenge for large graphs (Shehzad et al., 2024). Local or mini-batch methods depend on the quality of precomputed neighborhoods or clusters (e.g., node-sorting criteria in DGT), and performance may drop if these do not capture task-relevant structure (Park et al., 2022). For domains with complex, dynamic, or evolving graphs, the choice and update of structural priors remains an open problem.

7. Open Challenges and Future Directions

Based on recent surveys and the latest research trajectories, active challenges and future topics include (Shehzad et al., 2024):

Further scalability: Linear, sparse, or kernelized attention mechanisms and streaming tokenization strategies are needed for million-node graphs.
Dynamic and heterogeneous graphs: Methods for temporal-causal attention, continual learning on evolving graphs, and extension to multi-relational or federated data are emerging (Dwivedi et al., 16 May 2025).
Interpretability and controllability: Improving explainability of attention weights, developing graph-specific attribution, and integrating symbolic or constraint-based priors into transformers.
Isomorphism-invariance and topological completeness: Incorporating topological and homological invariants (universal covers, persistent homology) directly into architectural modules, as initiated by TIGT (Choi et al., 2024).
Robustness to data quality: Handling noise, incomplete labels, and distributional shifts via robust loss functions and self-supervised or contrastive learning.
Automated design and domain adaptation: Systematic hybridization of GNNs, transformers, and clustering for rapid deployment across new graph-structured domains.

Graph transformers mark a convergence of deep learning, spectral graph theory, relational modeling, and topological data analysis. Ongoing work continues to expand their scalability, expressivity, and applicability across scientific, industrial, and multi-modal settings.