Scaling Transformers to very large graphs

Determine how to adapt the Transformer architecture used for graph representation learning to scale effectively to large graphs with potentially millions of nodes while addressing the quadratic time and memory complexity of all-to-all self-attention.

Background

The paper motivates graph transformers as a remedy for limitations of message-passing GNNs in modeling long-range dependencies, but notes that standard full self-attention incurs quadratic complexity in the number of nodes, which prevents application to graphs beyond a few thousand nodes.

Although prior work proposes linearized attention, engineered attention patterns, or subsampling, these approaches can degrade performance or restrict expressive power. The authors introduce k-MIP attention as one approach to improve scalability, but explicitly acknowledge that the broader challenge of adapting Transformers to truly large graphs remains open.

References

However, this comes at the cost of quadratic complexity, and it remains an open question how to adapt the Transformer architecture to scale effectively to large graphs with potentially millions of nodes.