A Generalization of Transformer Networks to Graphs (2012.09699v2)

Published 17 Dec 2020 in cs.LG

Abstract: We propose a generalization of transformer neural network architecture for arbitrary graphs. The original transformer was designed for NLP, which operates on fully connected graphs representing all connections between the words in a sequence. Such architecture does not leverage the graph connectivity inductive bias, and can perform poorly when the graph topology is important and has not been encoded into the node features. We introduce a graph transformer with four new properties compared to the standard model. First, the attention mechanism is a function of the neighborhood connectivity for each node in the graph. Second, the positional encoding is represented by the Laplacian eigenvectors, which naturally generalize the sinusoidal positional encodings often used in NLP. Third, the layer normalization is replaced by a batch normalization layer, which provides faster training and better generalization performance. Finally, the architecture is extended to edge feature representation, which can be critical to tasks s.a. chemistry (bond type) or link prediction (entity relationship in knowledge graphs). Numerical experiments on a graph benchmark demonstrate the performance of the proposed graph transformer architecture. This work closes the gap between the original transformer, which was designed for the limited case of line graphs, and graph neural networks, that can work with arbitrary graphs. As our architecture is simple and generic, we believe it can be used as a black box for future applications that wish to consider transformer and graphs.

PDF Abstract

A Generalization of Transformer Networks to Graphs

The paper by Vijay Prakash Dwivedi and Xavier Bresson presents an adaptation of the Transformer architecture for processing arbitrary graphs, diverging from its predominant use in NLP. This work inspects and extends several key elements of the Transformer to effectively manage graph-structured data.

Key Contributions

The paper proposes a number of significant modifications to adapt the Transformer architecture for graph data:

Attention Mechanism: The attention mechanism is shifted from a globally attentive framework to one that respects the local sparse connectivity of graph structures, thereby integrating neighborhood connectivity as a critical aspect of the attention computation.
Positional Encoding: The authors leverage Laplacian eigenvectors for positional encodings, offering a natural generalization of sinusoidal positional embeddings used in traditional Transformers for sequences.
Normalization Layers: Layer normalization typically used in Transformers is substituted with batch normalization, offering improved training stability and generalization performance for graph-based tasks.
Edge Features: The architecture is expanded to incorporate edge features, which is particularly relevant for applications like chemistry and link prediction where edge attributes (e.g., bond types or relationship types) provide essential information.

Implications of Using Graph Transformers

The careful incorporation of graph structure and positional embeddings, along with other modifications mentioned, address several challenges that arise when applying Transformers to arbitrary graphs. This results in a model that effectively leverages the intrinsic properties of graph data:

Improved Inductive Bias: By respecting the sparsity inherent in graph data, the model obtains an inductive bias that aids in learning better representations.
Enhanced Numerical Performance: Benchmarking on datasets like ZINC, PATTERN, and CLUSTER shows competitive and, in certain cases, superior performance to established graph neural networks (GNNs) like GCN and GAT.
Versatility: The extension to handle edge features broadens potential applications, making it suitable for domains requiring detailed relational information.

Numerical Results and Analysis

Empirical results demonstrate the efficacy of the proposed model:

ZINC: Incorporating edge features, the Graph Transformer achieves a competitive mean absolute error (MAE) of 0.226, showing near-parity with state-of-the-art models like GatedGCN.
PATTERN and CLUSTER: The model exhibits strong performance on node classification tasks, significantly outperforming isotropic and anisotropic GNNs, specifically when using BatchNorm and Laplacian positional encodings.

These outcomes underscore the practical applicability and computationally beneficial nature of the changes proposed. The use of Laplacian positional encoding shows marked improvements over traditional methods, confirming its suitability for graph structures.

Future Directions

The implications of this research extend into several promising directions:

Scalability: Future work could explore scaling these techniques to larger graphs, optimizing for efficiency in both computation and memory usage.
Heterogeneous Graphs: Expanding the framework to manage heterogeneous graphs inherently more complex structures and varied node/edge types.
Dynamic Graphs: Adapting the architecture to handle temporal changes within graphs could significantly benefit fields such as network analysis and dynamic recommender systems.

Conclusion

The adaptation of Transformers to accommodate arbitrary graph structures as outlined in this paper successfully bridges a critical gap between NLP-centric models and graph neural networks. The proposed model leverages graph-specific inductive biases, such as local connectivity and Laplacian positional encodings, to deliver strong numerical performance while maintaining simplicity and generality. As such, the Graph Transformer stands as a robust baseline for future research exploring the intersections of Transformer architectures and graph data processing.