Scaling Transformers to very large graphs
Determine how to adapt the Transformer architecture used for graph representation learning to scale effectively to large graphs with potentially millions of nodes while addressing the quadratic time and memory complexity of all-to-all self-attention.
References
However, this comes at the cost of quadratic complexity, and it remains an open question how to adapt the Transformer architecture to scale effectively to large graphs with potentially millions of nodes.
— k-Maximum Inner Product Attention for Graph Transformers and the Expressive Power of GraphGPS The Expressive Power of GraphGPS
(2604.03815 - Schouwer et al., 4 Apr 2026) in Section 1 (Introduction)