Papers
Topics
Authors
Recent
Search
2000 character limit reached

DyGFormer Architecture

Updated 19 January 2026
  • DyGFormer is a transformer-based architecture that integrates a neighbor co-occurrence encoding scheme and a patching mechanism to capture temporal interactions and reduce sequence lengths.
  • It employs both sinusoidal and linear time encoders to efficiently model temporal recency while improving performance and reducing parameters.
  • Empirical evaluations demonstrate state-of-the-art performance in dynamic link prediction and node classification, achieving runtime and memory reductions on large-scale datasets.

DyGFormer is a Transformer-based deep learning architecture developed for continuous-time dynamic graph learning, particularly targeting dynamic link prediction and node classification in temporal networks. The architecture builds upon two innovations: a neighbor co-occurrence encoding (NCoE) scheme that captures source-destination node correlations and a patching mechanism that enables scalable modeling of long interaction histories. DyGFormer achieves state-of-the-art performance on several benchmarks, while a recently proposed linear time encoder simplifies the model and improves efficiency and accuracy in most scenarios (Yu et al., 2023, Chung et al., 10 Apr 2025).

1. Model Architecture and Workflow

DyGFormer processes a query edge event (i,j,t)(i, j, t) by extracting the first-hop temporal interaction histories of nodes ii and jj up to time tt:

S~it={(i,v,t):(i,v,t)E,t<t}{(i,j,t)}\tilde{\mathcal S}_i^t = \{(i, v, t') : (i, v, t') \in \mathcal E, t' < t\} \cup \{(i, j, t)\}

Each event in the sequence is encoded along four parallel “channels”:

  • Neighbor-node attributes xvRdV\mathbf{x}_v \in \mathbb{R}^{d_V}
  • Edge attributes xi,vtRdE\mathbf{x}_{i, v}^{t'} \in \mathbb{R}^{d_E}
  • Co-occurrence counts cN2\mathbf{c} \in \mathbb{N}^2, representing the frequencies of the neighbor in the histories of both ii and jj
  • Time encoding Φ(tt)\Phi(t - t') (dimension dTd_T), representing recency

To improve scalability, DyGFormer aggregates temporally contiguous events into non-overlapping “patches” of size PP, reducing the effective sequence length from S~it|\tilde{\mathcal S}_i^t| to it=S~it/P\ell_i^t = \lceil |\tilde{\mathcal S}_i^t| / P \rceil. Each patch across the four modalities is linearly projected to a channel embedding dimension dchd_{\mathrm{ch}}, and concatenated to form an input matrix XitRit×4dch\mathbf{X}_i^t \in \mathbb{R}^{\ell_i^t \times 4d_{\mathrm{ch}}}. Node jj is processed analogously.

The combined patch sequence [Xit;Xjt][\mathbf{X}_i^t; \mathbf{X}_j^t] is passed through LL layers of a standard pre-norm Transformer encoder (multi-head self-attention, LayerNorm, MLP, residuals). Outputs corresponding to ii and jj are pooled, concatenated, and fed through an output MLP to predict link probabilities (Chung et al., 10 Apr 2025, Yu et al., 2023).

2. Time Encoding Modules: Sinusoidal and Linear

The time encoding Φ(Δt)\Phi(\Delta t) initially followed the learnable sinusoidal encoder of TGAT, mapping time offsets into a multi-frequency cosine embedding:

Φsin(Δt)=[cos(ω1Δt+φ1),,cos(ωdTΔt+φdT)]\Phi_{\mathrm{sin}}(\Delta t) = \left[\cos(\omega_1 \Delta t + \varphi_1), \ldots, \cos(\omega_{d_T} \Delta t + \varphi_{d_T})\right]

with trainable frequencies ωk\omega_k and phases φk\varphi_k. The dot product of time encodings depends only on the time difference, yielding a periodic structure.

A subsequent innovation is the linear time encoder:

Φlin(Δt)=WtΔt+bt\Phi_{\mathrm{lin}}(\Delta t) = \mathbf{W}_{\mathrm{t}} \Delta t + \mathbf{b}_{\mathrm{t}}

where Wt,btRdT\mathbf{W}_{\mathrm{t}}, \mathbf{b}_{\mathrm{t}} \in \mathbb{R}^{d_T} are parameters and Δt\Delta t is standardized using train-set statistics. This form bypasses periodicity, reducing temporal information loss and allowing significant dimensionality and parameter savings. Empirical results demonstrate that self-attention layers can effectively compare and interpret linear time differences within the Transformer.

3. Encoding Schemes and Sequence Preparation

The four feature channels are constructed for each event as follows:

Channel Representation Purpose
Neighbor attributes Xi,Vt\mathbf{X}_{i,V}^t Node-level semantics
Edge attributes Xi,Et\mathbf{X}_{i,E}^t Edge-specific information
Co-occurrence counts Cit\mathbf{C}_i^t Captures overlap in histories
Time encoding Xi,Tt\mathbf{X}_{i,T}^t Temporal recency

To facilitate long-term context, DyGFormer’s patching operation segments sequences into fixed-size blocks. Each channel’s patch is then linearly projected to dchd_{\mathrm{ch}}, followed by concatenation into composite patch embeddings input to the Transformer. This structuring restricts the quadratic cost of self-attention to (it+jt)2(\ell_i^t + \ell_j^t)^2 and decouples memory usage from long raw histories (Yu et al., 2023).

4. Neighbor Co-Occurrence Encoding and Modality Fusion

The Neighbor Co-occurrence Encoding (NCoE) encodes, for each neighbor in a node's sequence, counts of its appearance in both source and destination histories as a vector c=[ni,nj]\mathbf{c} = [n_{i}, n_{j}]. A shared two-layer perceptron (with ReLU activations) is applied across individuals and summed, yielding a per-event co-occurrence embedding. The presence of high co-neighbor ratios correlates strongly with correct link prediction, and ablation studies reveal that removing NCoE produces the largest performance degradation among model components (Yu et al., 2023).

After modality-specific projections, patch embeddings are concatenated, forming a joint modality tensor. This fusion ensures structural, temporal, and interaction information are aligned for attention across modalities.

5. Transformer Layering, Output, and Training Objectives

The stacked patch embeddings for ii and jj are processed by an LL-layer Transformer. Each layer executes:

Q()=LN(Z(1))WQ(),    K()=LN(Z(1))WK(),    V()=LN(Z(1))WV()\mathbf{Q}^{(\ell)} = \mathrm{LN}(\mathbf{Z}^{(\ell - 1)})\mathbf{W}_Q^{(\ell)},\;\; \mathbf{K}^{(\ell)} = \mathrm{LN}(\mathbf{Z}^{(\ell - 1)})\mathbf{W}_K^{(\ell)},\;\; \mathbf{V}^{(\ell)} = \mathrm{LN}(\mathbf{Z}^{(\ell - 1)})\mathbf{W}_V^{(\ell)}

Followed by multi-head attention, MLP, and residual updates. Transformer outputs corresponding to ii and jj are mean (or autoregressively) pooled, concatenated, and passed through an MLP to predict the link probability via a sigmoid activation.

For link prediction, binary cross-entropy loss is computed across positive and negative pairs, with negative examples sampled via random, historical, and inductive strategies (Yu et al., 2023). For node classification, multi-class cross-entropy is used.

6. Empirical Performance and Computational Considerations

On benchmarks such as Wikipedia, Reddit, LastFM, Enron, UCI, and others, DyGFormer achieves leading average ranks for both transductive and inductive link prediction; for node classification, it is highly competitive. Integrating NCoE components into alternative Transformer baselines (e.g., TCL, GraphMixer) yields improvements, indicating the scheme’s generality.

The linear time encoder, as a drop-in replacement for the sinusoidal encoder, achieves statistically superior or comparable average precision in most cases (19/24 model×dataset settings under random negative sampling; 18/24 under historical negative sampling), with gains of 5–15 AP points on some datasets and parameter reductions of 34–43% when reducing encoder dimensions from 100 to 1 or 2. By contrast, low-dimensional sinusoidal encodings lead to substantial performance drops (Chung et al., 10 Apr 2025).

Patching offers 2–5× runtime and memory reductions for long histories. Its efficacy grows when nodes maintain long, dense interaction trails. DyGFormer scales to datasets with tens of millions of edges, where baseline recurrent or GNN-based approaches fail to train due to memory or time constraints.

7. Architectural Hyperparameters and Ablation Analyses

Key hyperparameters include patch size (P{1,2,...,128}P \in \{1, 2, ..., 128\}), channel embedding dimension (dch=50d_{\mathrm{ch}} = 50), output dimension (dout=172d_{\text{out}} = 172), Transformer depth (L=2L = 2), and number of attention heads (I=2I = 2). For the linear encoder, time embedding dimension dT=1d_T = 1 suffices. Dropout rates and Adam optimizer settings are tuned per dataset.

Ablations consistently show that the largest performance drop arises from removing NCoE, followed by the elimination of time encoding or patch mixing. Empirical analysis confirms that DyGFormer corrects prior false-positive cases with high Common Neighbor Ratio (CNR), validating the design’s pairwise correlation emphasis (Yu et al., 2023).


DyGFormer constitutes a modular, scalable architecture for dynamic graph learning, with a flexible time encoder design. The replacement of the sinusoidal time encoding with a linear alternative demonstrates that temporal self-attention mechanisms remain effective, and sometimes superior, when provided simple, standardized time cues, especially under resource constraints (Chung et al., 10 Apr 2025, Yu et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DyGFormer Architecture.