Recurrent Structure-Reinforced Graph Transformer

Updated 2 January 2026

The paper introduces a recurrent Transformer that explicitly models edge temporal states via a two-stage process and structure-aware attention, significantly advancing dynamic link prediction.
The methodology employs a structure-reinforced design combining global self-attention with topological and path-based feature encoding to effectively capture both local and global graph dynamics.
Empirical results demonstrate that RSGT outperforms state-of-the-art baselines on various dynamic graph datasets by mitigating over-smoothing and incorporating historical structural cues.

The Recurrent Structure-reinforced Graph Transformer (RSGT) is a framework for discrete dynamic graph representation learning designed to capture the evolving structural and temporal properties of time-evolving graphs. It addresses limitations in previous approaches that combine recurrent neural networks (RNNs) and graph neural networks (GNNs)—notably their inability to adequately encode edge temporal states and their susceptibility to over-smoothing, which collectively hinder the modeling of dynamic node relationships and the extraction of global structural features. RSGT introduces explicit edge temporal-state modeling and an advanced structure-reinforced transformer architecture within a recurrent paradigm, enabling superior local and global feature integration over discrete graph snapshots (Hu et al., 2023).

1. Edge Temporal States Modeling

At the core of RSGT is a two-stage process for each time step $t$ . First, it converts the current graph snapshot $G_t=(V_t,E_t,W_t)$ together with the previous snapshot $G_{t-1}$ into a weighted multi-relation “difference” graph $\hat G_t=(V_t,\hat E_t,TP_t, \hat W_t)$ . Here, $\hat E_t = E_{t-1} \cup E_t$ ensures even vanished edges are considered for their residual effects. Each edge $(i,j) \in \hat E_t$ receives a temporal-type $tp_{ij}^t$ among \textbf{emerging} ( $\mathcal{e}$ ), \textbf{persisting} ( $\mathcal{p}$ ), or \textbf{disappearing} ( $\mathcal{d}$ ):

$tp_{ij}^t = \begin{cases} \mathcal{e} & \text{if } e_{ij} \in E_t \setminus E_{t-1} \ \mathcal{p} & \text{if } e_{ij} \in E_t \cap E_{t-1} \ \mathcal{d} & \text{if } e_{ij} \in E_{t-1} \setminus E_t \ \end{cases}$

Edge weights $\omega_{ij}^t$ encode long-term interaction memory:

$\omega_{ij}^t = \begin{cases} \alpha k^{\beta} & \text{if } tp_{ij}^t \in \{\mathcal{e},\mathcal{p}\} \ \omega_{ij}^{t-1} & \text{if } tp_{ij}^t = \mathcal{d} \end{cases}$

where $k$ is the consecutive persistence count and $\alpha,\beta$ are hyperparameters. This construction yields a multi-relation weighted graph whose topology integrates both dynamic and structural cues, addressing the insufficient edge-state modeling in prior methods.

2. Structure-reinforced Graph Transformer Design

The Structure-reinforced Graph Transformer (SGT) operates at each time step $t$ on the current $\hat G_t$ and the previous hidden node embeddings $H^{t-1} \in \mathbb{R}^{|V| \times d}$ . SGT stacks $l$ identical encoding layers with the following components:

(a) Global Self-Attention: Standard Transformer attention is computed: $A^t = \mathrm{softmax}\left (\frac{Q^t (K^t)^{T}}{\sqrt d} \right )$ with query, key, and value projections $Q^t$ , $K^t$ , $V^t$ using learnable weights.

(b) Graph Structural Encoding: For every ordered node pair $(i,j)$ , two sets of features are extracted:

Topological attributes: $attr_s^{ij} = [\mathrm{outdeg}(v_i), \mathrm{indeg}(v_j), \mathrm{spath}(i,j)]$
Temporal path features along the shortest path $p$ from $i \to j$ : $ATTR_p^{ij} = [tp(p); \omega(p)]$ , embedded and encoded with 1D convolution after positional encoding.

These are concatenated to yield $r_{ij}$ .

(c) Structure-aware Attention Reinforcement: Raw self-attention scores $a_{ij}^t$ are modulated by an affine map dependent on $r_{ij}$ : $\hat a_{ij}^t = \lambda_{ij}^t \cdot a_{ij}^t + \sigma_{ij}^t, \quad \text{where } \lambda_{ij}^t, \sigma_{ij}^t = \text{affine transforms of } r_{ij}$

(d) Update and Residuals: Updated node values $\hat H^t$ are produced by normalizing $\hat A^t$ and multiplying by $V^t$ ; standard residual and feed-forward connections apply. After $l$ layers, an outer residual is added: $H^t = \hat H^t + H^{t-1}$ .

This architecture enables the transformer to capture both semantic and structure/path-aware dependencies, directly incorporating dynamic edge information into the self-attention mechanism.

3. Recurrent Learning Over Snapshots

RSGT models dynamic graph representation as a shallow recurrence across $T$ discrete graph snapshots. With $H^0 = X$ (initial features), the recurrence is:

$\hat H^t = \mathcal{F}_{SGT}( \hat G_t, H^{t-1} ), \quad H^t = H^{t-1} + \hat H^t$

This sum accumulates past structural-temporal updates, allowing each $H^t$ to encode the full dynamic context up to snapshot $t$ . The approach ensures both historical persistence and adaptation to new graph structures.

4. Training Objective, Algorithm, and Complexity

The primary supervised task is dynamic link prediction. For each candidate edge $(i,j)$ at step $t+1$ , its feature vector is $h_{ij}^{t+1} = |h_i^t - h_j^t|$ , with prediction via a shallow MLP:

$\hat p_{ij}^{t+1} = \sigma( W_o h_{ij}^{t+1} + b_o )$

and binary cross-entropy loss with $L_2$ regularization:

$J = \mathbb{E}_{(i,j)} \left[-p_{ij} \log \hat p_{ij} - (1 - p_{ij}) \log (1 - \hat p_{ij}) \right]+ \lambda\|\Theta\|_2^2$

The optimization uses AdamW over all parameters, including $\alpha, \beta$ if learned.

Computational Complexity: For one snapshot, per-layer cost is $O(l \cdot (|V|^2 d + |V| \cdot spd \cdot d_e))$ , dominated by attention ( $|V|^2 d$ ) and path encoding ( $|V| \cdot spd \cdot d_e$ ), where $spd$ is the shortest-path length horizon and $d_e$ the edge embedding dimension. Total runtime is $O(T)$ in the number of snapshots, and practical scalability is maintained by constraining $spd$ , $W$ (history window), and $|V|$ .

5. Empirical Performance and Ablation Results

RSGT has been empirically validated on four real-world dynamic graphs:

Dataset	$\|V\|$	Edges	Train/Test
twi-Tennis	1,000	40,839	100/20
CollegeMsg	1,899	59,835	25/63
cit-HepTh	7,577	51,315	77/1
sx-MathOF	24,818	506,550	64/15

On dynamic link prediction, RSGT outperforms ten strong baselines (DeepWalk, node2vec, GraphSAGE, EvolveGCN, CoEvoSAGE, ROLAND, CTDNE, TGAT, CAW, TREND):

twi-Tennis: Accuracy 87.6% vs TREND 74.0% (+18.3% absolute)
CollegeMsg: 86.8% vs 74.6% (+16.4%)
cit-HepTh: 87.2% vs 80.4% (+8.5%)
sx-MathOF: 87.9% vs 79.8% (+10.1%)

F1 scores demonstrate commensurate improvements.

Ablation analysis confirms two architectural choices as essential: (a) explicit edge temporal-state modeling (types and weights), (b) structure-aware attention (pairwise topological and path-based features). Removal of either leads to up to 15% performance drop. RSGT maintains robustness across variations in window size, number of transformer layers, attention heads, and shortest-path horizon.

6. Significance, Limitations, and Context

RSGT addresses critical shortcomings of existing dynamic graph embedding algorithms by providing a unified, recurrent, and structure-aware Transformer architecture with explicit modeling of edge temporal states. The integration of dynamic edge types, long-term edge weights, and structure-conditioned attention sets RSGT apart regarding the quality of representations and task performance. This design mitigates GNN over-smoothing, enables extraction of global graph structure, and provides scalable procedures for graphs of moderate to large size.

By consistently outperforming contemporary baselines in dynamic link prediction and demonstrating necessary ablation-verified design advances, RSGT substantiates the importance of fine-grained temporal-state modeling and structure-aware attention in dynamic graph learning. A plausible implication is that further refinements of Transformer-based recurrent paradigms, potentially with deeper recurrence, online inference, or continuous-time extensions, could continue to advance state-of-the-art performance on evolving graph data (Hu et al., 2023).

PDF Markdown Chat (Pro)

References (1)

Dynamic Graph Representation Learning via Edge Temporal States Modeling and Structure-reinforced Transformer (2023)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Recurrent Structure-reinforced Graph Transformers (RSGT).