Papers
Topics
Authors
Recent
Search
2000 character limit reached

Generalized PageRank Attention (GPA)

Updated 6 May 2026
  • GPA is a graph representation mechanism that extends Transformer self-attention using a multi-hop, learnable message-passing paradigm inspired by Generalized PageRank.
  • It employs polynomial filtering with adjustable propagation coefficients to balance local and global signal preservation, countering over-smoothing in deep graph models.
  • GPA enables scalable and efficient node and graph-level tasks across diverse datasets, as demonstrated in the ParaFormer framework with significant performance improvements.

Generalized PageRank Attention (GPA) is a graph representation learning mechanism that extends Transformer self-attention by integrating a multi-hop message-passing paradigm inspired by Generalized PageRank (GPR). In this framework, aggregation over a soft, learned attention graph is achieved through polynomial filtering, where the weights governing the propagation across multiple hops are learnable parameters. This approach is specifically designed to mitigate the over-smoothing phenomenon found in deep graph neural networks (GNNs) and global-attention graph Transformers, allowing both low- and high-frequency signals to be preserved in node embeddings. The principal instantiation of this mechanism appears in ParaFormer, a Graph Transformer architecture employing GPA for scalable and efficient node and graph-level tasks across homophilic and heterophilic domains (Yuan et al., 16 Dec 2025).

1. Mathematical Definition and Formulation

GPA generalizes standard self-attention by replacing the single-hop aggregation with a GPR-style multi-hop filter over the attention matrix. Given a node feature matrix H∈Rn×dH \in \mathbb{R}^{n \times d}, query/key/value transformations,

Q=HWQ,K=HWK,V=HWV,Q = HW_Q,\quad K = HW_K,\quad V = HW_V,

yield the soft attention matrix

A^=Softmax(QKTd)∈[0,1]n×n.\hat{A} = \mathrm{Softmax}\left(\frac{QK^T}{\sqrt{d}}\right) \in [0,1]^{n \times n}.

Instead of performing aggregation using only A^V\hat{A}V, GPA computes

Z=∑k=0KγkA^kVZ = \sum_{k=0}^K \gamma_k \hat{A}^k V

where γk∈R\gamma_k \in \mathbb{R} are learnable propagation coefficients and A^k\hat{A}^k denotes the kk-th power of the attention matrix. This operation is interpretable as a polynomial graph filter, with the profile of frequency filtering controlled by {γk}k=0K\{\gamma_k\}_{k=0}^K.

The theoretical basis derives from the GPR framework, which generalizes Personalized PageRank propagation schemes: h=∑k=0∞γkA~kh(0)h = \sum_{k=0}^\infty \gamma_k \tilde{A}^k h^{(0)} where Q=HWQ,K=HWK,V=HWV,Q = HW_Q,\quad K = HW_K,\quad V = HW_V,0 is a normalized adjacency or transition matrix. GPA applies this propagation to the soft, data-dependent attention graph Q=HWQ,K=HWK,V=HWV,Q = HW_Q,\quad K = HW_K,\quad V = HW_V,1 instead of the static input adjacency.

2. Spectral Properties and Over-Smoothing Analysis

GPA acts as an adaptive-pass filter in the graph spectral domain. In contrast to the low-pass behavior of classical GNNs and vanilla self-attention—where stacking many layers or repeated propagation causes node representations to become nearly identical—GPA can preserve high-frequency components by correctly configuring Q=HWQ,K=HWK,V=HWV,Q = HW_Q,\quad K = HW_K,\quad V = HW_V,2. The smoothing rate Q=HWQ,K=HWK,V=HWV,Q = HW_Q,\quad K = HW_K,\quad V = HW_V,3, defined as the minimum ratio of high-frequency (HC) energy across layers, is strictly lower for GPA (with proper Q=HWQ,K=HWK,V=HWV,Q = HW_Q,\quad K = HW_K,\quad V = HW_V,4, Q=HWQ,K=HWK,V=HWV,Q = HW_Q,\quad K = HW_K,\quad V = HW_V,5) than for vanilla self-attention: Q=HWQ,K=HWK,V=HWV,Q = HW_Q,\quad K = HW_K,\quad V = HW_V,6 As Q=HWQ,K=HWK,V=HWV,Q = HW_Q,\quad K = HW_K,\quad V = HW_V,7 increases, Q=HWQ,K=HWK,V=HWV,Q = HW_Q,\quad K = HW_K,\quad V = HW_V,8 converges to rank-one over-smoothed output. The gradient of the classification loss with respect to Q=HWQ,K=HWK,V=HWV,Q = HW_Q,\quad K = HW_K,\quad V = HW_V,9 at high A^=Softmax(QKTd)∈[0,1]n×n.\hat{A} = \mathrm{Softmax}\left(\frac{QK^T}{\sqrt{d}}\right) \in [0,1]^{n \times n}.0 is nonzero, causing such terms to be down-weighted during training.

3. Scalable Implementation and Complexity

Direct computation of A^=Softmax(QKTd)∈[0,1]n×n.\hat{A} = \mathrm{Softmax}\left(\frac{QK^T}{\sqrt{d}}\right) \in [0,1]^{n \times n}.1 scales as A^=Softmax(QKTd)∈[0,1]n×n.\hat{A} = \mathrm{Softmax}\left(\frac{QK^T}{\sqrt{d}}\right) \in [0,1]^{n \times n}.2 for dense graphs. ParaFormer introduces an efficient linear-attention approximation: A^=Softmax(QKTd)∈[0,1]n×n.\hat{A} = \mathrm{Softmax}\left(\frac{QK^T}{\sqrt{d}}\right) \in [0,1]^{n \times n}.3 which enables

A^=Softmax(QKTd)∈[0,1]n×n.\hat{A} = \mathrm{Softmax}\left(\frac{QK^T}{\sqrt{d}}\right) \in [0,1]^{n \times n}.4

This decomposition allows precomputation of key projections, reducing cost to A^=Softmax(QKTd)∈[0,1]n×n.\hat{A} = \mathrm{Softmax}\left(\frac{QK^T}{\sqrt{d}}\right) \in [0,1]^{n \times n}.5—dominated by A^=Softmax(QKTd)∈[0,1]n×n.\hat{A} = \mathrm{Softmax}\left(\frac{QK^T}{\sqrt{d}}\right) \in [0,1]^{n \times n}.6 for A^=Softmax(QKTd)∈[0,1]n×n.\hat{A} = \mathrm{Softmax}\left(\frac{QK^T}{\sqrt{d}}\right) \in [0,1]^{n \times n}.7. An auxiliary two-layer GNN module with cost A^=Softmax(QKTd)∈[0,1]n×n.\hat{A} = \mathrm{Softmax}\left(\frac{QK^T}{\sqrt{d}}\right) \in [0,1]^{n \times n}.8, where A^=Softmax(QKTd)∈[0,1]n×n.\hat{A} = \mathrm{Softmax}\left(\frac{QK^T}{\sqrt{d}}\right) \in [0,1]^{n \times n}.9 is the number of edges, is optionally fused to further enhance local modeling capacity without substantially increasing total complexity.

4. Empirical Evaluation and Results

ParaFormer and its GPA module have been empirically validated across diverse benchmarks:

  • Node classification: On datasets such as Cora, CiteSeer, PubMed (homophily), and Film, Squirrel, Chameleon, Deezer (heterophily), ParaFormer with GPA demonstrates an average relative improvement of 3.4% over strong GNNs (APPNP, GPRGNN, SIGN) and 1.9% over Graph Transformer baselines (SGFormer, NodeFormer, Polynormer). Gains are more pronounced on heterophilic datasets.
  • Graph classification: On KNN-constructed graphs for STL-10 (images) and 20News (text), GPA achieves the highest accuracy compared to conventional and GNN-based methods, especially in low-label regimes (+1.1% at 100 labels).
  • Scalability: ParaFormer matches or outperforms state-of-the-art scalable baselines on ogbn-arxiv, Amazon2M, pokec, and arXiv-year (>1M nodes). Linear scaling and S-GPA approximations enable training on graphs with tens of millions of edges within standard hardware constraints.

Ablation studies confirm that learnable A^V\hat{A}V0 parameters are crucial; fixing A^V\hat{A}V1 reduces performance by 0.3–0.5%. The scalable attention approximation (S-GPA) is essential for large-scale efficiency, and GPA provides more robust smoothing than vanilla self-attention.

5. Hyperparameters, Model Variants, and Integration

Experimental settings fix the maximum hop A^V\hat{A}V2, with fusion weights A^V\hat{A}V3, learning rates in A^V\hat{A}V4, hidden dimensions A^V\hat{A}V5, and dropout in A^V\hat{A}V6. For ParaFormer_GPRGNN, both soft-attention and adjacency-based GPR contributions are summed: A^V\hat{A}V7 Parameter sharing between attention and adjacency paths leverages both soft global and hard local propagation patterns.

6. Interpretation and Significance

GPA unifies Transformer-based global attention with the spectral flexibility of graph polynomial filtering. Adaptive propagation coefficients allow the model to tune the mix of locality and globality in each layer, efficiently counteracting over-smoothing while supporting expressive, multi-scale representation learning. The linear-attention formulation and lightweight fusion module make GPA practical for production on large-scale graphs, without manual tuning of propagation depth or fixed filter coefficients.

7. Open Resources and Implementation

Full pseudocode, hyperparameter settings, and implementation details for GPA and ParaFormer, including all ablations and scalability tests, are publicly available at https://github.com/chaohaoyuan/ParaFormer. The formulation, proofs, spectral analysis, and experimental protocol are detailed in the ParaFormer manuscript (Yuan et al., 16 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Generalized PageRank Attention (GPA).