PageRank Transformer (ParaFormer)

Updated 23 December 2025

The paper presents PageRank Transformer (ParaFormer) which integrates PageRank-inspired filtering into Transformers to control information propagation and mitigate over-smoothing.
It employs a Generalized PageRank Attention block and PPR tokenization schemes to enhance efficiency, scalability, and expressivity in graph and long-form text tasks.
Experimental results on diverse benchmarks demonstrate improved accuracy, efficiency gains, and robustness over conventional global attention mechanisms.

The PageRank Transformer, also known as ParaFormer, refers to a family of models that integrate PageRank or PageRank-inspired mechanisms into the Transformer architecture to address the specific challenges of graph and long-sequence representation learning. Architecturally, these models seek to improve efficiency, expressivity, and robustness—particularly in large graphs or lengthy texts—by using PageRank to select, weight, or propagate information among nodes or tokens. The PageRank Transformer paradigm encompasses a spectrum of implementations, including the Generalized PageRank Attention of ParaFormer for graph learning (Yuan et al., 16 Dec 2025), the PPR tokenization and attention schemes of VCR-Graphormer (Fu et al., 24 Mar 2024), and the match-ignition noise-filtering in long-form text matching (Pang et al., 2021).

1. Motivation and Theoretical Foundations

The primary motivation for incorporating PageRank into Transformers over graphs is to control information propagation in a principled manner and mitigate over-smoothing. Over-smoothing is the phenomenon wherein node representations in deep GNNs and vanilla Graph Transformers become indistinguishable as depth increases; this is due to the inherent low-pass filtering nature of global attention, which rapidly damps high-frequency (heterophilous) signals. Empirical and theoretical evidence demonstrates that standard self-attention exacerbates over-smoothing relative to sparse graph attention mechanisms, due to its all-to-all connectivity (Yuan et al., 16 Dec 2025).

PageRank, classically formulated as

$R = \alpha W R + (1-\alpha)I$

with $W$ the column-normalized adjacency, yields a stationary distribution reflecting long-range, multi-hop node influence with exponentially decaying weights. Its expansion via a Neumann series links it to polynomial graph filters—a direct connection to the spectral interpretation of GNNs and Transformers.

2. PageRank-Enhanced Attention in ParaFormer

ParaFormer (“PageRank Transformer”) replaces or augments the standard global attention module in a Transformer with a Generalized PageRank Attention (GPA) block (Yuan et al., 16 Dec 2025). The strategy is as follows:

Compute multi-head attention as usual to obtain the dense attention matrix $\hat{A} = \mathrm{Softmax}(QK^T/\sqrt{d})$ .
Instead of applying a single-step propagation $\hat{A}V$ , propagate features through a generalized PageRank-inspired filter:

$Z = \sum_{k=0}^K \gamma_k \hat{A}^k V$

where $\{\gamma_k\}$ are learnable coefficients and $V$ is the value matrix.

This polynomial operator can be an adaptive-pass filter, simultaneously preserving low- and high-frequency (heterophilous) signals, unlike the purely low-pass vanilla self-attention. Theorem 1 in (Yuan et al., 16 Dec 2025) specifies coefficient regimes under which this adaptivity occurs.

With appropriately initialized coefficients, GPA’s smoothing rate $\lambda_{\mathrm{GPA}}$ is strictly smaller than that of conventional attention, thus resisting over-smoothing (Theorem 2 in (Yuan et al., 16 Dec 2025)). Gradient dynamics further ensure that, in deep models, coefficients for exceedingly smoothed components decay, automatically safeguarding against rank collapse.

For scalability, ParaFormer implements a linearized power-iteration scheme, using a kernel trick to avoid cubic cost in $n$ (the number of nodes), yielding $O(Kn d)$ memory and runtime complexity.

3. PPR Tokenization and Mini-batch Attention (VCR-Graphormer)

VCR-Graphormer introduces a distinct PageRank Transformer paradigm suitable for large-scale, mini-batch training on graphs (Fu et al., 24 Mar 2024). The key mechanism is PPR tokenization:

For each node $i$ , compute its personalized PageRank vector $r_i$ .
Select the top- $k$ entries of $r_i$ , $\Rcal_i^k = \{j : r_i(j) \text{ is top-}k\}$.
Compose a token list for $i$ : $T_i = \{(x_j, p_{ij}) \mid j \in \Rcal_i^k\}$, with $x_j$ the feature and $p_{ij} = r_i(j)$ .
Offline precompute and save these lists for every node, so that at training time, mini-batches of token lists $T_{i_1}, \ldots, T_{i_B}$ suffice—decoupling topology from message passing.

Within each token list $T_i$ , standard multi-head self-attention operates only on tokens associated with node $i$ , with each token embedding formed as $[\;x_j \;\Vert\; p_{ij}\;]\in\mathbb{R}^{d+1}$ . This approach allows mini-batch training, greatly reducing computational expense from $O(n^2 d)$ to $O(B k^2 d)$ .

The PPR tokenization admits a polynomial filter interpretation: inputting successive random walk aggregates $\{W^\ell X(i,:)\}$ with corresponding weights is algebraically equivalent to a fixed-polynomial-filter GCN with jumping knowledge (multiscale aggregation), as formalized in Theorem 3.1 of (Fu et al., 24 Mar 2024).

4. Virtual Connections and Expressivity Enhancement

Pure PPR token lists may be insufficient to encode long-range or heterophilous signals. VCR-Graphormer overcomes this by introducing Virtual Connection Ranking (VCR) through two forms of super-nodes:

Structure-aware super-nodes: Partition the original graph into clusters, introduce a super-node for each, and connect original nodes to their cluster super-node; then recompute PPR in the augmented graph. Top- $\bar{k}$ entries form structure-aware tokens.
Content-aware super-nodes: For each class or pseudo-class label, add a super-node connected to all nodes of that label; PPR in this augmented topology yields content-aware tokens.

Each node’s final token list $T_i$ thus includes: the center token, $L$ random walk aggregates, $\bar{k}$ structure super-node tokens, and $\hat{k}$ content super-node tokens, for total length $1 + L + \bar{k} + \hat{k}$ (Fu et al., 24 Mar 2024). The PPR scores rank these virtual neighbors by relevance, expanding the receptive field to encompass local, global, and heterophilous contexts.

5. Applications: Graph Representation and Long-form Text Matching

PageRank Transformer architectures have demonstrated utility across several modalities:

Graph representation and node classification: ParaFormer achieves consistent gains on canonical benchmarks (Cora, Citeseer, PubMed, Film, Chameleon, Squirrel, Deezer) and large graphs (Amazon2M, Pokec, ogbn-arxiv) (Yuan et al., 16 Dec 2025). Ablations confirm the necessity of learnable PageRank coefficients and the superiority of the GPA block over baselines.
Inductive graph classification: KNN-constructed text/image graph datasets (20News, STL-10) further demonstrate ParaFormer’s performance edge, with gains up to 2% over specialized GCN/GAT/Transformer architectures.
Long-form text matching: Match-Ignition integrates PageRank at both sentence and token levels: pre-Transformer, a PageRank sentence-similarity graph filters salient sentences; intra-Transformer, PageRank is repeatedly run on self-attention graphs to rank and prune low-importance tokens, reducing computational load and improving accuracy (Pang et al., 2021). PageRank-based word and sentence filters perform superiorly to attention-weight and embedding-norm criteria. Experiments yield an efficiency gain (≈23% speed-up at 10% layerwise pruning) and absolute accuracy improvements exceeding several points over BERT and other strong matchers.

6. Computational Complexity and Scalability

A core advantage of PageRank Transformer regimes is improved scalability relative to dense global attention:

Model	Complexity per forward pass	Memory/Batch
Dense Global Attention	$O(n^2 d)$	$O(n^2)$
ParaFormer w/ GPA	$O(K n d)$ (linearized)	$O(n d)$
VCR-Graphormer	$O(B k^2 d)$ at train time	$O(B k d)$
Match-Ignition (text)	Varies with pruning	Reduced by α

Offline tokenization (PPR computation) is $O(m + n k \log k)$ in sparse graphs, far lower than quadratic in $n$ for large graphs. Mini-batch regimes and attention-reduction techniques further widen the gap for massive data.

7. Limitations and Future Research

While the PageRank Transformer family alleviates over-smoothing, captures both heterophilous and global structure, and achieves linear scalability, the approach is not without caveats:

Hyperparameter sensitivity: performance depends on the number of PPR hops $K$ , the choice and weighting of super-nodes, and the fusion ratio $\beta$ (in ParaFormer’s GNN block).
Linear attention approximations, while accurate in practice, could be further improved for greater faithfulness and efficiency.
Applicability across domains: future investigations aim to systematically benchmark over-smoothing and inductive capacity across multiple Transformer recipes, with special interest in graphs requiring complex multi-hop reasoning (e.g., protein interaction networks).
Additional research directions include learning PageRank coefficients $\gamma_k$ per head/layer and exploring kernelized approximations for repeated attention-matrix powers (Yuan et al., 16 Dec 2025).

A plausible implication is that more generalized forms of adaptive-pass attention, parameter-efficient propagation, and hybrid spectral-spatial architectures will continue to build on the mathematical framework established by PageRank Transformer models.