Papers
Topics
Authors
Recent
2000 character limit reached

Recurrent Structure-Reinforced Graph Transformer

Updated 2 January 2026
  • The paper introduces a recurrent Transformer that explicitly models edge temporal states via a two-stage process and structure-aware attention, significantly advancing dynamic link prediction.
  • The methodology employs a structure-reinforced design combining global self-attention with topological and path-based feature encoding to effectively capture both local and global graph dynamics.
  • Empirical results demonstrate that RSGT outperforms state-of-the-art baselines on various dynamic graph datasets by mitigating over-smoothing and incorporating historical structural cues.

The Recurrent Structure-reinforced Graph Transformer (RSGT) is a framework for discrete dynamic graph representation learning designed to capture the evolving structural and temporal properties of time-evolving graphs. It addresses limitations in previous approaches that combine recurrent neural networks (RNNs) and graph neural networks (GNNs)—notably their inability to adequately encode edge temporal states and their susceptibility to over-smoothing, which collectively hinder the modeling of dynamic node relationships and the extraction of global structural features. RSGT introduces explicit edge temporal-state modeling and an advanced structure-reinforced transformer architecture within a recurrent paradigm, enabling superior local and global feature integration over discrete graph snapshots (Hu et al., 2023).

1. Edge Temporal States Modeling

At the core of RSGT is a two-stage process for each time step tt. First, it converts the current graph snapshot Gt=(Vt,Et,Wt)G_t=(V_t,E_t,W_t) together with the previous snapshot Gt1G_{t-1} into a weighted multi-relation “difference” graph G^t=(Vt,E^t,TPt,W^t)\hat G_t=(V_t,\hat E_t,TP_t, \hat W_t). Here, E^t=Et1Et\hat E_t = E_{t-1} \cup E_t ensures even vanished edges are considered for their residual effects. Each edge (i,j)E^t(i,j) \in \hat E_t receives a temporal-type tpijttp_{ij}^t among \textbf{emerging} (e\mathcal{e}), \textbf{persisting} (p\mathcal{p}), or \textbf{disappearing} (d\mathcal{d}):

tpijt={eif eijEtEt1 pif eijEtEt1 dif eijEt1Et tp_{ij}^t = \begin{cases} \mathcal{e} & \text{if } e_{ij} \in E_t \setminus E_{t-1} \ \mathcal{p} & \text{if } e_{ij} \in E_t \cap E_{t-1} \ \mathcal{d} & \text{if } e_{ij} \in E_{t-1} \setminus E_t \ \end{cases}

Edge weights ωijt\omega_{ij}^t encode long-term interaction memory:

ωijt={αkβif tpijt{e,p} ωijt1if tpijt=d\omega_{ij}^t = \begin{cases} \alpha k^{\beta} & \text{if } tp_{ij}^t \in \{\mathcal{e},\mathcal{p}\} \ \omega_{ij}^{t-1} & \text{if } tp_{ij}^t = \mathcal{d} \end{cases}

where kk is the consecutive persistence count and α,β\alpha,\beta are hyperparameters. This construction yields a multi-relation weighted graph whose topology integrates both dynamic and structural cues, addressing the insufficient edge-state modeling in prior methods.

2. Structure-reinforced Graph Transformer Design

The Structure-reinforced Graph Transformer (SGT) operates at each time step tt on the current G^t\hat G_t and the previous hidden node embeddings Ht1RV×dH^{t-1} \in \mathbb{R}^{|V| \times d}. SGT stacks ll identical encoding layers with the following components:

(a) Global Self-Attention: Standard Transformer attention is computed: At=softmax(Qt(Kt)Td)A^t = \mathrm{softmax}\left (\frac{Q^t (K^t)^{T}}{\sqrt d} \right ) with query, key, and value projections QtQ^t, KtK^t, VtV^t using learnable weights.

(b) Graph Structural Encoding: For every ordered node pair (i,j)(i,j), two sets of features are extracted:

  • Topological attributes: attrsij=[outdeg(vi),indeg(vj),spath(i,j)]attr_s^{ij} = [\mathrm{outdeg}(v_i), \mathrm{indeg}(v_j), \mathrm{spath}(i,j)]
  • Temporal path features along the shortest path pp from iji \to j: ATTRpij=[tp(p);ω(p)]ATTR_p^{ij} = [tp(p); \omega(p)], embedded and encoded with 1D convolution after positional encoding.

These are concatenated to yield rijr_{ij}.

(c) Structure-aware Attention Reinforcement: Raw self-attention scores aijta_{ij}^t are modulated by an affine map dependent on rijr_{ij}: a^ijt=λijtaijt+σijt,where λijt,σijt=affine transforms of rij\hat a_{ij}^t = \lambda_{ij}^t \cdot a_{ij}^t + \sigma_{ij}^t, \quad \text{where } \lambda_{ij}^t, \sigma_{ij}^t = \text{affine transforms of } r_{ij}

(d) Update and Residuals: Updated node values H^t\hat H^t are produced by normalizing A^t\hat A^t and multiplying by VtV^t; standard residual and feed-forward connections apply. After ll layers, an outer residual is added: Ht=H^t+Ht1H^t = \hat H^t + H^{t-1}.

This architecture enables the transformer to capture both semantic and structure/path-aware dependencies, directly incorporating dynamic edge information into the self-attention mechanism.

3. Recurrent Learning Over Snapshots

RSGT models dynamic graph representation as a shallow recurrence across TT discrete graph snapshots. With H0=XH^0 = X (initial features), the recurrence is:

H^t=FSGT(G^t,Ht1),Ht=Ht1+H^t\hat H^t = \mathcal{F}_{SGT}( \hat G_t, H^{t-1} ), \quad H^t = H^{t-1} + \hat H^t

This sum accumulates past structural-temporal updates, allowing each HtH^t to encode the full dynamic context up to snapshot tt. The approach ensures both historical persistence and adaptation to new graph structures.

4. Training Objective, Algorithm, and Complexity

The primary supervised task is dynamic link prediction. For each candidate edge (i,j)(i,j) at step t+1t+1, its feature vector is hijt+1=hithjth_{ij}^{t+1} = |h_i^t - h_j^t|, with prediction via a shallow MLP:

p^ijt+1=σ(Wohijt+1+bo)\hat p_{ij}^{t+1} = \sigma( W_o h_{ij}^{t+1} + b_o )

and binary cross-entropy loss with L2L_2 regularization:

J=E(i,j)[pijlogp^ij(1pij)log(1p^ij)]+λΘ22J = \mathbb{E}_{(i,j)} \left[-p_{ij} \log \hat p_{ij} - (1 - p_{ij}) \log (1 - \hat p_{ij}) \right]+ \lambda\|\Theta\|_2^2

The optimization uses AdamW over all parameters, including α,β\alpha, \beta if learned.

Computational Complexity: For one snapshot, per-layer cost is O(l(V2d+Vspdde))O(l \cdot (|V|^2 d + |V| \cdot spd \cdot d_e)), dominated by attention (V2d|V|^2 d) and path encoding (Vspdde|V| \cdot spd \cdot d_e), where spdspd is the shortest-path length horizon and ded_e the edge embedding dimension. Total runtime is O(T)O(T) in the number of snapshots, and practical scalability is maintained by constraining spdspd, WW (history window), and V|V|.

5. Empirical Performance and Ablation Results

RSGT has been empirically validated on four real-world dynamic graphs:

Dataset V|V| Edges Train/Test
twi-Tennis 1,000 40,839 100/20
CollegeMsg 1,899 59,835 25/63
cit-HepTh 7,577 51,315 77/1
sx-MathOF 24,818 506,550 64/15

On dynamic link prediction, RSGT outperforms ten strong baselines (DeepWalk, node2vec, GraphSAGE, EvolveGCN, CoEvoSAGE, ROLAND, CTDNE, TGAT, CAW, TREND):

  • twi-Tennis: Accuracy 87.6% vs TREND 74.0% (+18.3% absolute)
  • CollegeMsg: 86.8% vs 74.6% (+16.4%)
  • cit-HepTh: 87.2% vs 80.4% (+8.5%)
  • sx-MathOF: 87.9% vs 79.8% (+10.1%)

F1 scores demonstrate commensurate improvements.

Ablation analysis confirms two architectural choices as essential: (a) explicit edge temporal-state modeling (types and weights), (b) structure-aware attention (pairwise topological and path-based features). Removal of either leads to up to 15% performance drop. RSGT maintains robustness across variations in window size, number of transformer layers, attention heads, and shortest-path horizon.

6. Significance, Limitations, and Context

RSGT addresses critical shortcomings of existing dynamic graph embedding algorithms by providing a unified, recurrent, and structure-aware Transformer architecture with explicit modeling of edge temporal states. The integration of dynamic edge types, long-term edge weights, and structure-conditioned attention sets RSGT apart regarding the quality of representations and task performance. This design mitigates GNN over-smoothing, enables extraction of global graph structure, and provides scalable procedures for graphs of moderate to large size.

By consistently outperforming contemporary baselines in dynamic link prediction and demonstrating necessary ablation-verified design advances, RSGT substantiates the importance of fine-grained temporal-state modeling and structure-aware attention in dynamic graph learning. A plausible implication is that further refinements of Transformer-based recurrent paradigms, potentially with deeper recurrence, online inference, or continuous-time extensions, could continue to advance state-of-the-art performance on evolving graph data (Hu et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Recurrent Structure-reinforced Graph Transformers (RSGT).