Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Transformers as Graph Neural Networks

Updated 1 July 2025
  • Transformers as graph neural networks are adaptations that restrict self-attention to graph neighborhoods and employ Laplacian-based positional encodings to capture graph topology.
  • They integrate edge features into the attention mechanism and use batch normalization to enhance convergence and performance over traditional message-passing approaches.
  • Empirical studies on molecular and network tasks show that these models effectively combine global dependencies with local structures for competitive graph learning.

Transformers as Graph Neural Networks

Transformers, originally constructed for natural language processing, have been generalized and adapted as a robust class of Graph Neural Networks (GNNs). This direction brings together the global, permutation-invariant aggregation capacity of the transformer architecture with the inductive biases and capabilities required for learning on arbitrary graph-structured data. Recent developments formalize and operationalize a spectrum of approaches, from restricting attention based on graph neighborhoods to extending tokenization, positional encoding, and edge modeling in order to surpass traditional message-passing neural network (MPNN) formulations.

1. Foundations and Motivations

The foundational transformer, as introduced by Vaswani et al., employs a self-attention mechanism wherein each token in a sequence attends to all others, equivalent to defining a fully connected graph on the inputs. This mechanism provides strong expressivity for modelling pairwise, and implicitly higher-order, dependencies among sequence elements. However, in domains where data is naturally structured as graphs—such as molecules, social networks, or knowledge graphs—relying on a fully connected attention scheme neglects the sparsity and inductive biases encoded by actual graph connectivity.

To address this shortcoming, research has focused on adapting the transformer architecture to arbitrary graphs by constraining attention, refining positional encodings beyond sequential signals, replacing normalization strategies, and leveraging edge features. This line of work closes the gap between transformers (which are GNNs operating on cliques or line graphs) and MPNNs, yielding architectures compatible with arbitrary graph topologies and features (2012.09699).

2. Neighborhood-Aware Attention Mechanism

The core adaption in graph transformers is the restriction of the attention mechanism to graph neighborhoods. Rather than attending globally, each node ii computes attention solely over its immediate graph neighbors (jNij \in \mathcal{N}_i), integrating both learned and structural inductive biases:

$\hat{h}_{i}^{\ell+1} = O_h^{\ell} \ \bigparallel_{k=1}^{H} \Big(\sum_{j \in \mathcal{N}_i} w_{ij}^{k,\ell} V^{k,\ell}h_j^{\ell} \Big)$

wijk,=softmaxj(Qk,hiKk,hjdk)w_{ij}^{k,\ell} = \mathrm{softmax}_j \left( \frac{Q^{k, \ell} h_i^{\ell} \cdot K^{k, \ell} h_j^{\ell}}{\sqrt{d_k}} \right)

Here, wijk,w_{ij}^{k,\ell} represents the attention from ii to jj at head kk and layer \ell, focused exclusively on valid graph neighbors. This modification introduces sparsity reflecting actual topology, enabling linear scaling with the number of edges instead of quadratic scaling with node count, and provides strong inductive bias to exploit the underlying graph structure.

3. Laplacian-Based Positional Encodings

Traditional transformers use sinusoidal positional encodings to capture the linear order of sequences. Graphs, in contrast, lack such order; therefore, node positional information is introduced by leveraging spectral properties of the graph Laplacian. Each node ii receives a positional encoding derived from the components of the kk smallest nontrivial Laplacian eigenvectors:

Δ=ID1/2AD1/2=UTΛU\Delta = I - D^{-1/2} A D^{-1/2} = U^T \Lambda U

hi0=h^i0+C0λi+c0h_i^0 = \hat{h}_i^0 + C^0 \lambda_i + c^0

Where AA is the adjacency, DD the degree matrix, UU the matrix of eigenvectors, and λi\lambda_i the Laplacian eigenvector values for node ii. This encoding generalizes sequential positional encoding and provides both local and global structural context, essential for distinguishing node roles within arbitrary graphs. To deal with eigenvector sign ambiguity, random sign flipping is employed in training.

4. Normalization: BatchNorm in Place of LayerNorm

While layer normalization (LayerNorm) is standard in NLP transformers, empirical results suggest that batch normalization (BatchNorm) is preferable for graph learning tasks:

h^i+1^=Norm(hi+h^i+1)\hat{\hat{h}_{i}^{\ell+1}} = \mathrm{Norm}(h_{i}^{\ell} + \hat{h}_{i}^{\ell+1})

Here, “Norm” is implemented via BatchNorm. Batch normalization was found to provide faster convergence and improved generalization performance, consistently outperforming LayerNorm for graph benchmarks, particularly when paired with Laplacian positional encoding.

5. Edge Feature Integration and Parallel Edge Processing

Many real-world graphs contain edge features (e.g., bond type in molecules, relationship type in knowledge graphs). The graph transformer integrates edge features directly into the attention computation:

w^ijk,=(Qk,hiKk,hjdk)Ek,eij\hat{w}_{ij}^{k,\ell} = \left( \frac{Q^{k, \ell} h_i^{\ell} \cdot K^{k, \ell} h_j^{\ell}}{\sqrt{d_k}} \right) \cdot E^{k,\ell} e_{ij}^{\ell}

Where eije_{ij}^\ell are properly projected edge features, so attention coefficients are modulated by edge attributes. For comprehensive edge feature learning, the architecture maintains a parallel edge feature pipeline, updating edge features through independent feed-forward and normalization networks.

This dual-pipeline enables the model to learn both node and edge representations jointly, a mechanism critical for tasks such as molecular property prediction and link prediction, where edge semantics are essential.

6. Empirical Evaluation and Comparative Results

The architecture was benchmarked on tasks such as molecular property regression (ZINC, containing edge features) and stochastic block model-derived node classification (PATTERN, CLUSTER). The following table summarizes representative performance metrics (lower MAE is better, higher accuracy is better):

Model ZINC (MAE) CLUSTER (Accuracy) PATTERN (Accuracy)
GCN 0.367 68.5 71.9
GAT 0.384 70.6 78.3
GatedGCN 0.214 76.1 86.5
Graph Transformer 0.226 73.2 84.8

The results demonstrate that the graph transformer outperforms or matches leading message-passing models (GCN, GAT) and approaches strong edge-feature-aware architectures (GatedGCN), especially when leveraging graph topology and edge features. Crucially, using the original, sparse graph connectivity was essential for achieving optimal performance, underscoring the importance of including inductive bias for topology.

7. Applications and Broader Implications

The model is adaptable to diverse domains requiring joint modeling of node and edge structures:

  • Molecular property prediction, where atomic and bond attributes must be learned for chemistry and materials science applications.
  • Social and information networks, or any system where relationship semantic attributes and structural dependencies play a central role.
  • Knowledge graph modeling and link prediction, where edge types encode multifaceted relations.

The use of Laplacian eigenvector encodings and explicit edge-modulated attention paves the way for further research in spectral/diffusion-based encodings for graph transformers, more expressive edge-aware architectures, and universal “black box” graph learning models that operate across arbitrary domains.


Component Transformer (NLP) Graph Transformer (Generalized)
Attention mechanism Global (all pairs) Sparse (neighbors/graph structure only)
Positional Encoding Sinusoidal (sequence) Laplacian eigenvectors (graph topology)
Normalization LayerNorm BatchNorm
Edge feature handling None Explicit in attention + parallel update
Empirical superiority SOTA on text SOTA/competitive on graph tasks

This generalization of the transformer yields an architecture that captures both global and local graph relationships, enabling competitive performance across a range of graph learning problems, while providing a flexible foundation for future research into expressive, scalable, and structure-aware neural models for graphs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)