DyGFormer Architecture
- DyGFormer is a transformer-based architecture that integrates a neighbor co-occurrence encoding scheme and a patching mechanism to capture temporal interactions and reduce sequence lengths.
- It employs both sinusoidal and linear time encoders to efficiently model temporal recency while improving performance and reducing parameters.
- Empirical evaluations demonstrate state-of-the-art performance in dynamic link prediction and node classification, achieving runtime and memory reductions on large-scale datasets.
DyGFormer is a Transformer-based deep learning architecture developed for continuous-time dynamic graph learning, particularly targeting dynamic link prediction and node classification in temporal networks. The architecture builds upon two innovations: a neighbor co-occurrence encoding (NCoE) scheme that captures source-destination node correlations and a patching mechanism that enables scalable modeling of long interaction histories. DyGFormer achieves state-of-the-art performance on several benchmarks, while a recently proposed linear time encoder simplifies the model and improves efficiency and accuracy in most scenarios (Yu et al., 2023, Chung et al., 10 Apr 2025).
1. Model Architecture and Workflow
DyGFormer processes a query edge event by extracting the first-hop temporal interaction histories of nodes and up to time :
Each event in the sequence is encoded along four parallel “channels”:
- Neighbor-node attributes
- Edge attributes
- Co-occurrence counts , representing the frequencies of the neighbor in the histories of both and
- Time encoding (dimension ), representing recency
To improve scalability, DyGFormer aggregates temporally contiguous events into non-overlapping “patches” of size , reducing the effective sequence length from to . Each patch across the four modalities is linearly projected to a channel embedding dimension , and concatenated to form an input matrix . Node is processed analogously.
The combined patch sequence is passed through layers of a standard pre-norm Transformer encoder (multi-head self-attention, LayerNorm, MLP, residuals). Outputs corresponding to and are pooled, concatenated, and fed through an output MLP to predict link probabilities (Chung et al., 10 Apr 2025, Yu et al., 2023).
2. Time Encoding Modules: Sinusoidal and Linear
The time encoding initially followed the learnable sinusoidal encoder of TGAT, mapping time offsets into a multi-frequency cosine embedding:
with trainable frequencies and phases . The dot product of time encodings depends only on the time difference, yielding a periodic structure.
A subsequent innovation is the linear time encoder:
where are parameters and is standardized using train-set statistics. This form bypasses periodicity, reducing temporal information loss and allowing significant dimensionality and parameter savings. Empirical results demonstrate that self-attention layers can effectively compare and interpret linear time differences within the Transformer.
3. Encoding Schemes and Sequence Preparation
The four feature channels are constructed for each event as follows:
| Channel | Representation | Purpose |
|---|---|---|
| Neighbor attributes | Node-level semantics | |
| Edge attributes | Edge-specific information | |
| Co-occurrence counts | Captures overlap in histories | |
| Time encoding | Temporal recency |
To facilitate long-term context, DyGFormer’s patching operation segments sequences into fixed-size blocks. Each channel’s patch is then linearly projected to , followed by concatenation into composite patch embeddings input to the Transformer. This structuring restricts the quadratic cost of self-attention to and decouples memory usage from long raw histories (Yu et al., 2023).
4. Neighbor Co-Occurrence Encoding and Modality Fusion
The Neighbor Co-occurrence Encoding (NCoE) encodes, for each neighbor in a node's sequence, counts of its appearance in both source and destination histories as a vector . A shared two-layer perceptron (with ReLU activations) is applied across individuals and summed, yielding a per-event co-occurrence embedding. The presence of high co-neighbor ratios correlates strongly with correct link prediction, and ablation studies reveal that removing NCoE produces the largest performance degradation among model components (Yu et al., 2023).
After modality-specific projections, patch embeddings are concatenated, forming a joint modality tensor. This fusion ensures structural, temporal, and interaction information are aligned for attention across modalities.
5. Transformer Layering, Output, and Training Objectives
The stacked patch embeddings for and are processed by an -layer Transformer. Each layer executes:
Followed by multi-head attention, MLP, and residual updates. Transformer outputs corresponding to and are mean (or autoregressively) pooled, concatenated, and passed through an MLP to predict the link probability via a sigmoid activation.
For link prediction, binary cross-entropy loss is computed across positive and negative pairs, with negative examples sampled via random, historical, and inductive strategies (Yu et al., 2023). For node classification, multi-class cross-entropy is used.
6. Empirical Performance and Computational Considerations
On benchmarks such as Wikipedia, Reddit, LastFM, Enron, UCI, and others, DyGFormer achieves leading average ranks for both transductive and inductive link prediction; for node classification, it is highly competitive. Integrating NCoE components into alternative Transformer baselines (e.g., TCL, GraphMixer) yields improvements, indicating the scheme’s generality.
The linear time encoder, as a drop-in replacement for the sinusoidal encoder, achieves statistically superior or comparable average precision in most cases (19/24 model×dataset settings under random negative sampling; 18/24 under historical negative sampling), with gains of 5–15 AP points on some datasets and parameter reductions of 34–43% when reducing encoder dimensions from 100 to 1 or 2. By contrast, low-dimensional sinusoidal encodings lead to substantial performance drops (Chung et al., 10 Apr 2025).
Patching offers 2–5× runtime and memory reductions for long histories. Its efficacy grows when nodes maintain long, dense interaction trails. DyGFormer scales to datasets with tens of millions of edges, where baseline recurrent or GNN-based approaches fail to train due to memory or time constraints.
7. Architectural Hyperparameters and Ablation Analyses
Key hyperparameters include patch size (), channel embedding dimension (), output dimension (), Transformer depth (), and number of attention heads (). For the linear encoder, time embedding dimension suffices. Dropout rates and Adam optimizer settings are tuned per dataset.
Ablations consistently show that the largest performance drop arises from removing NCoE, followed by the elimination of time encoding or patch mixing. Empirical analysis confirms that DyGFormer corrects prior false-positive cases with high Common Neighbor Ratio (CNR), validating the design’s pairwise correlation emphasis (Yu et al., 2023).
DyGFormer constitutes a modular, scalable architecture for dynamic graph learning, with a flexible time encoder design. The replacement of the sinusoidal time encoding with a linear alternative demonstrates that temporal self-attention mechanisms remain effective, and sometimes superior, when provided simple, standardized time cues, especially under resource constraints (Chung et al., 10 Apr 2025, Yu et al., 2023).