GLFormer: Efficient Dynamic Graph Modeling

Updated 23 November 2025

GLFormer is an attention-free Transformer-style architecture designed for dynamic graph temporal link prediction using adaptive local token mixing and hierarchical aggregation.
It introduces a learnable local token mixer with positional and temporal components to efficiently fuse recent neighbor information, reducing computational cost.
Empirical evaluations show improved average precision and AUC with 3–10x speedups over traditional self-attention methods on benchmark datasets.

GLFormer is an attention-free, Transformer-style architecture designed for efficient dynamic graph modeling, particularly temporal link prediction. It advances over traditional self-attention-based dynamic graph Transformers by introducing adaptive local token mixing and hierarchical aggregation for scalable modeling of temporally evolving relationships in large or high-frequency graphs. GLFormer’s architecture challenges the presumption that global, full self-attention is essential for state-of-the-art predictive performance in dynamic graphs, instead leveraging context-aware local aggregation mechanisms that fuse information from temporally ordered interactions with high computational efficiency and robustness to noise (Zou et al., 16 Nov 2025).

1. Motivation and Architectural Rationale

Transformer-style models for dynamic graphs such as DyGFormer and TGAT rely on self-attention to capture long-term temporal dependencies. However, self-attention incurs $\mathcal{O}(N^2)$ computational and memory complexity per sequence of length $N$ and may indiscriminately aggregate noisy or irrelevant distant events. Recent analyses in the "MetaFormer" paradigm indicate that the expressive power of Transformers can largely be attributed to their macro-architectural traits—residual connections, layer normalization, and feed-forward subnetworks—rather than strictly to the attention operator.

Empirical studies within dynamic graph contexts show that local mixing strategies, such as pooling or MLP-based token mixers, can match or outperform full attention with substantially lower computational cost. GLFormer therefore adopts attention-free local mixers within a classic Transformer skeleton, combining efficiency with architectural expressivity. The core is a stack of layers, each comprising (i) a learnable local token-mixing sub-block and (ii) a channel-mixing feed-forward module, interleaved with residual connections and layer normalization. Two key innovations distinguish GLFormer: an adaptive, context-aware token mixer, and a hierarchical aggregation scheme enabling progressively enlarged temporal receptive fields in a causal, efficient manner.

2. Adaptive Token Mixer

For each node $u$ with chronological neighbor embeddings $I_u = [X_{u_1}, \ldots, X_{u_N}] \in \mathbb{R}^{N \times d}$ (associated with timestamps $t_1 \leq \cdots \leq t_N$ ), the adaptive token mixer aggregates the $M$ most recent neighbors for every position $i$ :

$H_{i,:} = \sum_{p=0}^{M-1} \alpha_p^i \cdot I_{i-p,:}$

Mixing weights $\alpha^i_p$ combine two factors:

Positional importance $w_p$ : learned per-token-offset weight, capturing ordinal significance.
Temporal proximity $\theta_p^i$ : a softmax over exponentially decayed timestamp intervals $\Delta t^i_p = t_i - t_{i-p}$ , favoring temporally proximate events.

The composite mixing coefficient is given by:

$\alpha_p^i = \beta \cdot w_p + (1-\beta) \cdot \theta_p^i$

$\theta_p^i = \frac{\exp(-\Delta t_p^i)}{\sum_{q=0}^{M-1} \exp(-\Delta t_q^i)}$

where $\beta \in [0,1]$ is a learnable scalar. This context-aware token mixing adaptively fuses position and timing, enabling the model to privilege relevant recent information without needing to attend globally over all $N^2$ token pairs.

3. Hierarchical Aggregation and Temporal Receptive Field

To efficiently enlarge the temporal receptive field and capture longer-term patterns, GLFormer stacks $L$ token-mixer layers with dilated offset ranges:

$R_l = \{ p \in \mathbb{Z} \mid s^{l-1} \leq p \leq s^l \}$

with kernel size $K_l = |R_l|$ and layer-wise offset boundaries $s^0=0 < s^1 < \ldots < s^L$ . The $l$ -th layer mixer processes prior outputs $H^{(l-1)}_{TA}$ as:

$H_{i,:}^{(l)} = \sum_{p \in R_l} (\alpha_p^i)^{(l)} \cdot H_{TA, i-p,:}^{(l-1)}$

Causality is preserved by masking out cases where $i-p < 1$ . The hierarchical stacking results in a dilated, causal temporal receptive field reaching up to $s^L$ , while operations remain local in each layer.

4. Computational and Parameter Efficiency

The computational and space complexity of GLFormer compares favorably against self-attention:

Operation	Complexity	Parameters
Self-attention (per layer)	$\mathcal{O}(N^2d)$	$\mathcal{O}(d^2)$
GLFormer token mixer (layer $l$ )	$\mathcal{O}(N K_l d)$	$K_l$ for $w_p$ , few scalars $\beta$

Stacking $L$ layers, the total cost is $\sum_{l=1}^L \mathcal{O}(N K_l d)$ . As $\sum K_l \ll N$ in practice, overall complexity approaches quasi-linear scaling in $N$ . Memory and parameter requirements are also reduced, as GLFormer does not require projection matrices or pairwise attention maps. This confers notable speedups during both training and inference.

5. Empirical Performance and Experimental Protocol

GLFormer was evaluated on six benchmark dynamic-graph datasets—Wikipedia, Reddit, MOOC, LastFM, SocialEvo, and Enron—for transductive temporal link prediction. Data consists of timestamped user-item or user-user sequences, partitioned 70%/15%/15% chronologically. Comparison was performed across five prominent backbone encoders: TGN, TCL, TGAT, CAWN, and DyGFormer. Four token mixing strategies were compared for each backbone:

Vanilla Transformer (self-attention)
Pooling over the $s$ most recent neighbors
MLP Mixer over tokens
GLFormer’s adaptive mixer with hierarchical aggregation

Metrics included Average Precision (AP) and AUC-ROC. GLFormer achieved the best average rank across all datasets, yielding AP improvements of 0.2–1.5 over vanilla attention. For DyGFormer, replacing attention with GLFormer attained +0.35 AP on MOOC and +0.24 AP on Reddit. Inference time analyses demonstrated 3–10 $\times$ speedups over vanilla Transformers, and 1.5–3 $\times$ versus MLP Mixers. Ablation studies confirmed the necessity of learnable positional and temporal components, residual connections, and non-linearity choices (GELU preferred over ReLU): removal of any degrades AP by up to 1.5.

6. Limitations and Future Directions

A principal limitation is the fixed nature of layerwise offset boundaries $s^l$ , which must be preselected; dynamically learning these or employing attention-like sparsification mechanisms could augment adaptivity. GLFormer is tailored to first-order neighbor sequences, and extension to multi-hop or heterogeneous neighbor types (e.g., relations with attribute-rich edges) remains open. While the present work concentrates on transductive link prediction, additional axes for future research include inductive settings, dynamic node classification, and continuous hypergraph forecasting.

7. Impact and Significance

GLFormer demonstrates that attention-free, local, adaptive token mixing architectures can rival or surpass global self-attention in dynamic graph settings both in accuracy and computational efficiency. Its architectural paradigm, grounded in residual learning, layer normalization, and hierarchical local aggregation, supports robust modeling of evolving networked systems while facilitating scalability to long sequences and high-frequency interaction data. This development questions the necessity of expensive full self-attention in dynamic graphs and signals a movement toward lighter, more scalable temporal graph models (Zou et al., 16 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Global-Lens Transformers: Adaptive Token Mixing for Dynamic Link Prediction (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to GLFormer.

GLFormer: Efficient Dynamic Graph Modeling

1. Motivation and Architectural Rationale

2. Adaptive Token Mixer

3. Hierarchical Aggregation and Temporal Receptive Field

4. Computational and Parameter Efficiency

5. Empirical Performance and Experimental Protocol

6. Limitations and Future Directions

7. Impact and Significance

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

GLFormer: Efficient Dynamic Graph Modeling

1. Motivation and Architectural Rationale

2. Adaptive Token Mixer

3. Hierarchical Aggregation and Temporal Receptive Field

4. Computational and Parameter Efficiency

5. Empirical Performance and Experimental Protocol

6. Limitations and Future Directions

7. Impact and Significance

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research