Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dynamic Graph Transformer

Updated 5 February 2026
  • Dynamic Graph Transformer is a neural architecture that applies Transformer-based self-attention to model evolving graphs by converting interactions into sequential tokens.
  • The approach uses specialized tokens for history delimitation and time binning, enabling precise temporal alignment without traditional recurrent or message-passing methods.
  • Empirical evaluations show that this method overcomes over-smoothing and vanishing gradients, providing scalable and robust predictions compared to standard RNNs and GNNs.

A Dynamic Graph Transformer is a neural architecture that uses Transformer-based self-attention mechanisms to model time-evolving graphs without relying on recurrent neural networks (RNNs) or traditional message-passing graph neural networks (GNNs). These models convert the temporal evolution and topological structure of a dynamic graph into sequential (or sequence-like) input tokens, enabling the Transformer to directly attend across time and structure. Dynamic Graph Transformers are motivated by the need to handle long-range dependencies, model non-trivial temporal patterns, and provide scalable alternatives to RNN/GNN pipelines, which can suffer from over-smoothing, vanishing gradients, and limited scalability. In contrast, Dynamic Graph Transformers—when formulated correctly—capture both evolution and structure using autoregressive or sequence prediction approaches, with temporal alignment, special tokenization strategies, and masking that respect the non-Euclidean and non-stationary nature of dynamic graphs (Wu et al., 2024).

1. Problem Formulation and Sequence Mapping

The fundamental input to a Dynamic Graph Transformer is a dynamic graph G=(V,E,T,X)G=(V,E,T,X) represented as a chronologically ordered sequence of timestamped interactions:

  • E={(vi,vj,τn)}n=1EE=\{(v_i, v_j, \tau_n)\}_{n=1}^{|E|}, with τnT\tau_n\in T,
  • Each (vi,vj,τ)(v_i, v_j, \tau) is interpreted as a “token” corresponding to an edge event in the evolving structure.

The modeling goal is to learn parameters θ\theta so that, for any node viv_i, and given its entire interaction history up to some cutoff time τˉ\bar \tau, the model predicts which nodes it will interact with in a future interval. This is achieved by transforming the temporal ego-network of each node—that is, all of viv_i’s edges up to τˉ\bar\tau—into a single sequence wiw_i:

wi=vi1,vi2,,vimw_i = \langle v_i^1, v_i^2, \ldots, v_i^m \rangle

where each vikv_i^k denotes a node interacted with at time τkτˉ\tau^k \leq \bar\tau, ordered by increasing τk\tau^k.

To enable effective learning and prediction, the implementation uses special tokens to demarcate history, prediction intervals, and temporal alignment:

  • hist\langle \mathrm{hist} \rangle, endofhist\langle \mathrm{endofhist} \rangle delimit the history section;
  • pred\langle \mathrm{pred} \rangle, endofpred\langle \mathrm{endofpred} \rangle delimit future-prediction targets;
  • Temporal bins (e.g., weeks, months, dialogue turns) are assigned discrete, learnable time  t\langle \mathrm{time}\; t\rangle tokens to partition the sequence into uniform time intervals, providing temporal alignment across different nodes and histories.

This sequence construction converts dynamic graph modeling into standard sequence modeling, leveraging the Transformer’s capability to process arbitrarily long, ordered sequences (Wu et al., 2024).

2. Model Architecture: SimpleDyG

All components utilize standard, off-the-shelf modules from the Transformer framework:

  • Embedding Layer: Each node ID, special token, and time bin token is mapped to a learnable embedding in Rd\mathbb{R}^d. If node side features xvRfx_v \in \mathbb{R}^f are available, they are linearly projected to Rd\mathbb{R}^d and added to the node embedding. Position encodings are optional, and most temporal signal comes from explicit time tokens.
  • Transformer Stack: For LL layers,

Q=Hl1WQ,K=Hl1WK,V=Hl1WV headh=softmax(QhKhd/H)Vh,h=1..H MHA(Hl1)=Concat(head1..headH)WO H~l=LayerNorm(Hl1+MHA(Hl1)) Hl=LayerNorm(H~l+FFN(H~l))Q = H^{l-1}W_Q,\quad K = H^{l-1}W_K,\quad V = H^{l-1}W_V \ \text{head}_h = \operatorname{softmax}\left( \frac{Q_h K_h^\top}{\sqrt{d/H}} \right)V_h,\quad h=1..H \ \operatorname{MHA}(H^{l-1}) = \operatorname{Concat}(\text{head}_1.. \text{head}_H)W_O \ \tilde H^l = \operatorname{LayerNorm}( H^{l-1} + \operatorname{MHA}(H^{l-1}) ) \ H^l = \operatorname{LayerNorm}( \tilde H^l + \text{FFN}(\tilde H^l) )

where FFN is a standard two-layer MLP with ReLU.

  • Decoding & Output Projection: The input and prediction tokens are concatenated, and at each position ii, the conditional token probability is computed:

p(riR<i)=softmax(LayerNorm(H<iL)Wvocab)p(r_i | R_{<i}) = \operatorname{softmax}( \operatorname{LayerNorm}(H^{L}_{<i}) W_{\mathrm{vocab}} )

WvocabRd×(V+#special tokens)W_{\mathrm{vocab}} \in \mathbb{R}^{d \times (|V|+\text{\#special tokens})}.

3. Temporal Alignment and Special Token Ablation

Temporal alignment in SimpleDyG is realized by partitioning the event sequence into equal-length bins and inserting discrete time-binning tokens at the start of each temporal segment. This ensures that histories for all nodes are aligned on a global time grid. Special tokens are empirically found to be critical:

  • Removal of all special tokens leads to catastrophic performance collapse (NDCG@5 drops by 60–90%).
  • Collapsing all time\langle\mathrm{time}\rangle tokens to a single generic “time” marker substantially degrades performance in datasets with bursty event distributions, but is less detrimental in datasets with smoother, more uniform evolution.

4. Training Objective and Implementation

The principal training loss is:

L(θ)=Rtraini=1Rlogpθ(riR<i)L(\theta) = -\sum_{R \in \text{train}} \sum_{i=1}^{|R|} \log p_\theta(r_i | R_{<i})

which is the negative log-likelihood of the joint token sequence. This is a standard autoregressive sequence modeling loss with no additional regularization beyond conventional dropout or weight decay in the optimizer. Training proceeds in batches over tokenized node histories:

1
2
3
4
5
6
for R in batch:
    H0 = EmbedTokens(R) + PosEnc
    for l in range(1, L+1):
        H_l = TransformerLayer(H_{l-1})
    loss += -sum(log p(r_i | R_{<i}) for i in positions)
optimizer.step(gradients)
The approach is end-to-end, and complexity per sequence is O(n2)O(n^2), with nn the length of each node’s ego-history (which is typically small relative to the global graph, thus avoiding quadratic scaling in V|V|).

5. Empirical Evaluation and Comparative Baselines

Experiments span four diverse, real-world datasets:

  • UCI (social messages): 1,781 nodes, 16,743 edges, 13 weekly bins.
  • ML-10M (user–tag ratings): 15,841 nodes, 48,561 edges, 13 monthly bins.
  • Hepth (citations): 4,737 nodes, 14,831 edges, 12 bi-monthly bins.
  • MMConv (multi-turn dialogues): 7,415 entities, 91,986 turns, 16 bins.

Baselines include DySAT, EvolveGCN (discrete), and DyRep, JODIE, TGAT, TGN, TREND, GraphMixer (continuous), all adapted to a ranking-based (BPR) loss for comparability.

Key results:

  • On UCI, NDCG@5 = 0.104 for both SimpleDyG and GraphMixer (tie); Jaccard = 0.092 vs. 0.042 (next best).
  • On ML-10M, SimpleDyG achieves NDCG@5 = 0.092 vs. 0.042 (next best).
  • On Hepth, inductive NDCG@5 = 0.035 for SimpleDyG vs. 0.034 for the best baseline.
  • On MMConv, NDCG@5 = 0.184 vs. 0.172, Jaccard = 0.169 vs. 0.095.

Ablations demonstrate that both special and time-binning tokens are critical for stability and effectiveness. Additionally, SimpleDyG outperforms temporal GNNs in multistep (T+1, T+2, T+3 horizon) forecasting—degrading more smoothly and maintaining state-of-the-art performance at each step. Per-epoch training time is competitive: 6.2s for SimpleDyG vs. 12.2s for DySAT, 18.5s for TGAT, and 6.9s for GraphMixer on UCI (NVIDIA L40).

6. Advantages, Limitations, and Scalability

Advantages

  • Long-range dependency modeling: Self-attention directly connects arbitrary events in the sequence, enabling robust capture of temporal dependencies across hundreds of steps without the truncation effects of RNNs.
  • No GNN over-smoothing: By eschewing message-passing on the global graph, the architecture avoids feature collapse prevalent in deep GNNs as graph diameter increases.
  • Scalability: By constructing per-node “documents” (ego-sequences), the attention cost is localized to the node, not the entire system, allowing efficient parallelization and training even on large graphs.
  • Temporal alignment: Transforming the dynamic-graph prediction task into a time-binned sequence learning problem allows standard Transformers to model both fine and global timescales.

Limitations

  • Bin size hyperparameter: Time discretization (bin size) must be tuned per dataset, as too coarse or too fine binning can harm learning.
  • Dense graphs: For nodes with extremely long histories (e.g., millions of interactions), per-node sequence length nn becomes prohibitively large—requiring subsampling or the use of sliding windows.
  • No native edge attribute encoding: The base model does not handle edge features or edge types unless further embedded as additional tokens.

7. Implementation Guidelines and Practical Insights

  • Sequence Construction: For each node, gather its interaction history, sorted, grouped by bin, and demarcated with time markers and special tokens.
  • Model Build: Use standard, off-the-shelf Transformer stacks; tune architecture depth (2–4 layers), number of heads (2–8), and dimensionality (128–512) based on dataset scale.
  • Tokenization: Node IDs, time bins, and special markers must all be provided with unique learnable embeddings.
  • Prediction & Training: For future link prediction, decode the sequence autoregressively. Loss is next-token likelihood but can be replaced by ranking losses if desired.
  • Performance Profile: Expect that long-term memory or precise temporal alignment requirements favor this approach over heavier GNN structures.

This paradigm demonstrates that, for a significant class of dynamic graph modeling problems, a structurally minimalist Transformer with appropriate time-aware tokenization and temporal alignment suffices to match or outperform more complex architectures, both in prediction quality and efficiency (Wu et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamic Graph Transformer.