Temporal Graph Benchmark (TGB)
- Temporal Graph Benchmark (TGB) is a comprehensive, open-source suite for evaluating dynamic graph models on link and node prediction tasks.
- It offers large-scale real-world datasets with strict chronological splits, standardized metrics like MRR and NDCG@K, and automated pipelines for reproducibility.
- Extensible to multi-relational graphs and sequential dynamics, TGB addresses challenges like negative sampling, model scalability, and serves as a benchmark for advancing dynamic graph learning.
The Temporal Graph Benchmark (TGB) is a comprehensive, open-source benchmarking suite designed to facilitate fair, realistic, and reproducible evaluation of machine learning models on temporal (dynamic) graphs. Developed in response to the lack of large, domain-diverse, and protocol-standardized resources for evolving-graph learning, TGB offers curated datasets, rigorous evaluation tasks, and an automated experimental pipeline, catalyzing progress in dynamic graph representation learning and inference.
1. Objectives, Scope, and Evolution
TGB aims to provide a unified framework for benchmarking temporal graph learning methods, particularly event-based graph neural networks (TGNNs), across edge prediction (“dynamic link property prediction,” DLPP) and node prediction (“dynamic node property prediction,” DNPP) tasks (Huang et al., 2023, Yu, 2023). The benchmark addresses critical gaps in prior work—insufficient scale, domain homogeneity, inconsistent splits, and overoptimistic protocols—by introducing:
- Large-scale, real-world datasets from varied domains (social, transactional, transportation, interaction).
- Strict temporal splits for train/validation/test to avoid future leakage.
- Standardized evaluation metrics and negative sampling.
- Automated pipelines for loading, experimentation, and reproducibility.
Subsequent TGB releases have extended this foundation: TGB 2.0 adds multi-relational heterogeneous graphs for temporal knowledge graph link extrapolation, while TGB-Seq challenges sequence modeling by suppressing edge repetition and emphasizing sequential dynamics (Gastinger et al., 2024, Yi et al., 5 Feb 2025).
2. Dataset Suite, Task Formulation, and Protocols
TGB data are represented as chronological streams of timestamped edges, optionally with edge/node features. The suite initially comprised nine large datasets, split into tasks as follows (Huang et al., 2023, Yu, 2023, Huang et al., 2023):
| Dataset | |V| | |E| | Domain | Task(s) | Temporal Split | |-----------------|-------|-------------|------------------|-------------------|---------------------| | tgbl-wiki | 9,227 | 157,474 | Wikipedia edits | DLPP | Monthly, 70/15/15 | | tgbl-review | 352K | 4.87M | Amazon/Electronics | DLPP | 21y, 70/15/15 | | tgbl-coin | 638K | 22.8M | Crypto-transfers | DLPP | 8mo, 70/15/15 | | tgbl-comment | 995K | 44.3M | Reddit replies | DLPP | 5y, 70/15/15 | | tgbl-flight | 18K | 67.1M | Airline bookings | DLPP | 3y, 70/15/15 | | tgbn-trade | 255 | 468,245 | UN trade flows | DNPP | 32y, 70/15/15 | | tgbn-genre | 1,505 | 17.8M | User-genre | DNPP | 1mo, 70/15/15 | | tgbn-reddit | 11.8K | 27.2M | User-subreddit | DNPP | 14y, 70/15/15 | | tgbn-token | 61.8K | 72.9M | ERC-20 tokens | DNPP | 1wk, 70/15/15 |
The dynamic link property prediction task is formalized as: for a stream of timestamped events , predict for each source at time which of several candidate targets in a negative pool represents the true next edge. Evaluation is by Mean Reciprocal Rank (MRR), computed as:
where is the position of the true edge for query among candidates (Huang et al., 2023).
The dynamic node property prediction task requires, for each node-time pair , prediction of a real-valued property (e.g., trade volume, subscriber count) of the node; evaluation is by Normalized Discounted Cumulative Gain (NDCG@K), defined as:
where measures gain for the 0 top-ranked predictions (Yu, 2023).
All splits are strictly chronological: first 70% train, next 15% validation, final 15% test, forbidding any test-label leakage.
TGB 2.0 generalizes this to multi-relational temporal knowledge and heterogeneous graphs, using
1
and adapts sampling and metric filtration for edge-type and node-type heterogeneity (Gastinger et al., 2024).
3. Benchmarking Infrastructure and Methodological Standards
TGB is distributed as a unified Python package (“py-tgb”), providing:
- Data loader: Efficient downloading, preprocessing, and conversion to standard PyTorch Geometric or numpy formats.
- Streamed negative sampling: Supports historical and random negatives for filtered-MRR evaluation; inductive scenario handling.
- Consistent trainer and evaluator: Modular training/evaluation loop for all model types, with in-batch vectorized MRR/NDCG computation.
- Config-driven experiment management: Supports YAML/dict-based hyperparameter search, random seed control, and reproducibility.
- Public leaderboard: Standardized result reporting and online comparison.
The DyGLib_TGB fork further standardizes architectural primitives—neighbor sampling, early stopping, and time encoding—enabling exactly comparable cross-method evaluations (Yu, 2023). This addresses substantial “implementation drift” that previously weakened baseline comparisons.
4. Algorithms and Empirical Insights
TGB benchmarks a broad array of models, including memory-based, attention-based, hybrid, and non-parametric heuristics (Yu, 2023). Representative methods and their distinguishing update formulas include:
- JODIE: Node-specific GRU memory: 2.
- DyRep: Joint self-attention plus temporal point process updates.
- TGAT, TCL: Temporal self-attention and contrastive learning with sinusoidal time encoding.
- TGN: GRU-updated memory plus small GNN over temporal neighbors.
- CAWN: Causal Anonymous Walks aggregated by attention over position and scaled time.
- GraphMixer, DyGFormer: MLP-Mixer or Vision Transformer blocks over neighbor sequences.
- EdgeBank: Memorizes most recent u–v interaction timestamp (“non-parametric”).
- Persistent Forecast, Moving Average: Baselines that predict the last or moving-average observed value for each node.
Experimental results consistently show strong model-dataset interaction. Models with deep temporal processing (TCL, DyGFormer) excel on dense and strongly temporal datasets; memory-centric models (CAWN, TGN) outperform on sparse graphs. Notably, on dynamic node property prediction, trivial non-parametric models (persistence, moving average) often exceed complex TGNNs, signaling insufficient focus on node-centric regression design.
5. Protocols, Evaluation Metrics, and Model Pathologies
TGB mandates evaluation using ranking-based metrics directly aligned with end-application decisions, such as filtered MRR (for link prediction) and NDCG@K (for node affinity) (Huang et al., 2023). Negative sampling protocols are tightly controlled, with historical (previously observed) and random negatives mixed; splits ensure strict chronological partitioning.
Critical work has identified that standard negative sampling inflates MRR and enables pathological output saturation (all popular nodes ranked identically), an issue addressed by measures such as MRR3—full ranking among top-recently-popular targets—and “Recently Popular Negative Sampling” (RP-NS) (Daniluk et al., 2023). These improved protocols reveal that “recently popular nodes” baselines (PopTrack: exponential smoothing of node frequencies) can outperform sophisticated models when global temporal dynamics (e.g., mode shifts, social trends) dominate.
Additionally, recent theoretical developments prove that standard permutation-invariant message passing architectures such as TGN cannot recover pairwise statistics required for persistent forecasting or moving-average baselines. This motivates augmentations (TGNv2) that include source–target identification in messages, closing the expressivity gap and substantially improving node-affinity prediction (Tjandra et al., 2024).
6. Limitations, Extensions, and Future Directions
Several extensions underscore the dynamic evolution of TGB:
- TGB-Seq (Yi et al., 5 Feb 2025): Focuses on sequential dynamics, suppressing repeated edges to force models to learn 4-step higher-order interaction sequences. Most extant TGNNs exhibit dramatic performance degradation in this regime, revealing an overreliance on edge repetition memorization and 1-hop aggregation.
- TGB 2.0 (Gastinger et al., 2024): Addresses the multi-relational and heterogeneous graph regime at scale. Most state-of-the-art methods fail to scale to 10M+ nodes/edges or to utilize edge-type semantics effectively; simple heuristics (EdgeBank, RecB) remain competitive on high-recurrence subgraphs.
- BenchTemp and DynBenchmark (Huang et al., 2023, Brisson et al., 3 Oct 2025): Offer further standardization for efficiency, inductive regime stress-testing, node/event classification tasks, and community detection tracking.
Persistent challenges include scalable multi-relational modeling, negative sampling for rare-event and low-recurrence relations, and modeling of fine-grained sequential intention. The empirical dominance of simple heuristics on many TGB tasks, as well as the exposure of architectural limitations (e.g., expressivity barriers and oversquashing), suggests that temporal graph learning remains an open and rapidly evolving field.
7. Software, Resources, and Community Practices
All code, datasets, and public leaderboards for TGB and its extensions are available at the canonical repositories and their associated documentation pages:
- TGB: https://tgb.complexdatalab.com/, PyPI package
py-tgb, GitHub repo (Huang et al., 2023). - DyGLib_TGB: https://github.com/yule-BUAA/DyGLib_TGB (benchmarks, implementations, configuration files) (Yu, 2023).
- TGB 2.0: https://github.com/JuliaGast/TGB2, dataset scripts, negative samples, and evaluation protocols (Gastinger et al., 2024).
- TGB-Seq: https://tgb-seq.github.io (datasets, code, leaderboards) (Yi et al., 5 Feb 2025).
- BenchTemp: https://github.com/qianghuangwhu/benchtemp (modular evaluation pipeline) (Huang et al., 2023).
Adherence to TGB protocols, chronological splitting, recorded negative sampling seeds, and transparent leaderboard tracking are emerging as best practices for dynamic graph learning assessment. The open-source and community-driven nature of TGB ensures ongoing updates, expansion to new domains, and rapid dissemination of benchmark-driven insights.