Temporal Graph Benchmark 2.0 (TGB 2.0)
- Temporal Graph Benchmark 2.0 (TGB 2.0) is a comprehensive framework that provides eight novel, large-scale datasets spanning five domains for temporal link prediction.
- It mitigates longstanding bottlenecks like dataset scarcity, experimental reproducibility, and scalability by offering unified evaluation protocols and curated baselines.
- Empirical insights indicate that while deep models struggle on massive datasets, simple heuristic methods often deliver competitive performance in temporal graph analysis.
Multi-relational temporal graphs are core representations for capturing the time-evolving, heterogeneous relationships among entities in complex real-world systems. The Temporal Graph Benchmark 2.0 (TGB 2.0) provides an extensive, reproducible framework for benchmarking future link prediction on large-scale temporal knowledge graphs (TKGs) and temporal heterogeneous graphs (THGs). This resource addresses longstanding bottlenecks—including dataset scarcity, experimental reproducibility, and scalability challenges—by contributing eight novel datasets spanning five domains, together with a unified evaluation protocol, curated baselines, and comprehensive public artifacts (Gastinger et al., 2024).
1. Dataset Collection and Properties
TGB 2.0 consolidates eight large-scale datasets formulated as temporal multi-relational graphs. Each dataset comprises nodes , relation types , and temporal edges , where each edge is a quadruple indicating a subject, relation, object, and timestamp.
Dataset Summary
| Dataset | Domain | ||||
|---|---|---|---|---|---|
| tkgl-smallpedia | Knowledge | 47,433 | 0.55M | ≈ 37 | 1,826 |
| tkgl-polecat | Political | 150,931 | 1.78M | 16 | 10,224 |
| tkgl-icews | Political | 87,856 | 15.51M | 391 | 2,025 |
| tkgl-wikidata | Knowledge | 1,226,440 | 9.86M | ≈ 3,000 | 689,549 |
| thgl-software | Software | 681,927 | 1.49M | 14 | 2,558,457 |
| thgl-forum | Social | 152,816 | 23.76M | 2 | 2,510,415 |
| thgl-github | Software | 5,856,765 | 17.50M | 14 | 14,828,090 |
| thgl-myket | Interaction | 1,530,835 | 53.63M | 2 | 14,828,090 |
TGB 2.0 datasets greatly exceed prior TKG/THG collections in numerous dimensions—for example, tkgl-wikidata provides 25× more nodes and 6× more edges than previous TKGs, while thgl-github reaches 500× more nodes for THGs (Gastinger et al., 2024).
Application Domains
- Knowledge Graphs: Real-world concepts linked by time-qualified factual relations (tkgl-smallpedia, tkgl-wikidata). Task: prediction of future statements or property changes.
- Political Event Graphs: Socio-political actors with temporally coded cooperative/hostile events (tkgl-polecat, tkgl-icews). Task: forecasting political interactions.
- Software and Online-Forum Networks: User and repository/event traces in GitHub (thgl-software, thgl-github); Reddit user and subreddit interactions (thgl-forum). Task: predicting subsequent actions or conversation edges.
- Mobile-App Marketplace: User-app install/update events (thgl-myket). Task: predicting next app interaction.
2. Preprocessing, Statistics, and Temporal Characteristics
Data Splitting and Filtering
All datasets follow a 70/15/15% chronological split for train/validation/test, with entire timestamps assigned to exactly one split (no temporal leakage). THGs are filtered to remove low-degree nodes ( for GitHub, 0 for Software) (Gastinger et al., 2024).
Negative Sampling and Recurrence
Negative samples are pre-generated to ensure identical test sets across runs. Two temporal recurrence statistics are reported:
- Recurrency Degree: 1
- Direct Recurrence: 2
Edge distributions reveal episodic burstiness in THGs and steady, long-term growth in TKGs such as tkgl-wikidata. Recurrence statistics are closely associated with per-relation Mean Reciprocal Rank (MRR) in TKGs.
3. Benchmarking Task Formulation and Evaluation
Task Definition
The core task is future link prediction (a.k.a. dynamic link-property prediction). The model, given a query of the form 3 or 4, assigns a real-valued score 5, and candidates are ranked by this score.
Objective and Loss Function
Training uses pointwise cross-entropy loss comparing observed positives against sampled negatives:
6
where 7 is the set of sampled negatives.
Evaluation Protocol
- Splits: All edges with a given timestamp are assigned to only one split.
- Negative Sampling:
- 1-vs-all protocol (all candidates) for 8.
- 1-vs-9 for larger graphs, with negatives sampled from nodes previously seen as objects for the same relation 0 ("edge-type–aware sampling").
- Metrics: Time-aware filtered MRR and Hits@K; filtering removes temporally conflicting positives from the ranking pool.
4. Baseline Models and Modeling Methodology
Baseline Families
- Heuristic Baselines:
- Recurrency Baseline (RecB): uses strict/relaxed recurrence patterns.
- EdgeBank: memorizes historical edges, with either windowed (tw) or full (all) memory.
- Temporal Knowledge Graph Methods:
- Temporal Heterogeneous Graph Methods:
- TGN: Continuous-time memory, message passing.
- TGNₑ: TGN with learned edge-type embeddings.
- STHN: Link-encoder architecture with semantic "patch" fusion.
Key Model Components
- Relational Message Passing (e.g., RE-GCN):
1
- Continuous-Time Memory Update (TGN):
2
- Time Encoding (STHN):
3
Scalability
Heuristic baselines (RecB, EdgeBank) operate linearly in 4 and scale to all datasets. Most neural models (RE-GCN, CEN, TLogic, TGN, STHN) cannot scale to the largest graphs due to memory (5 for RE-GCN/CEN) or time complexity (6 for TLogic), and OOM or OOT failures are reported on datasets such as tkgl-wikidata, thgl-github, and thgl-myket.
5. Experimental Results and Empirical Insights
Observations
- Edge-type (relation) information is essential for high predictive performance; TGNₑ (with edge-type embeddings) outperforms TGN.
- Simple heuristic approaches (RecB, EdgeBank) are often competitive with deep models. On tkgl-smallpedia and tkgl-icews, RecB achieves the highest MRR.
- Deep learning models generally do not scale to the largest TGB 2.0 datasets, suggesting a research gap for highly scalable methods.
- On THGs, STHN achieves the best result (MRR 0.731 on thgl-software) but is not executable on larger datasets. EdgeBank remains a consistent and scalable baseline.
- In TKGs, relation recurrence degree strongly predicts per-relation MRR.
Performance Summaries
Temporal Knowledge Graphs (MRR, 1-vs-all or 1-vs-q)
| Method | Smallpedia | Polecat | ICEWS | Wikidata |
|---|---|---|---|---|
| EdgeBankₜʷ | 0.457 | 0.058 | 0.020 | 0.633 |
| EdgeBankₐₗₗ | 0.401 | 0.048 | 0.009 | 0.632 |
| RecBₒₚₜ | 0.694 | 0.203 | OOT | OOT |
| RecB₀ₚₜ₁ | 0.640 | 0.170 | 0.206 | OOT |
| RE-GCN | 0.631 | 0.191 | 0.182 | OOM |
| CEN | 0.646 | 0.204 | 0.187 | OOM |
| TLogic | 0.631 | 0.236 | 0.287 | OOT |
Temporal Heterogeneous Graphs (MRR, 1-vs-q)
| Method | Software | Forum | GitHub | Myket |
|---|---|---|---|---|
| EdgeBankₜʷ | 0.279 | 0.534 | 0.355 | 0.248 |
| EdgeBankₐₗₗ | 0.399 | 0.612 | 0.403 | 0.430 |
| RecB₀ₚₜ₁ | 0.099 | 0.561 | OOT | OOT |
| TGN | 0.324 | 0.649 | OOM | OOM |
| TGNₑ | 0.424 | 0.729 | OOM | OOM |
| STHN | 0.731 | OOM | OOM | OOM |
6. Reproducibility, Artifacts, and Community Resources
The end-to-end TGB 2.0 evaluation pipeline is fully automated, including dataset download, preprocessing, chronological splitting, negative-sample generation, training, evaluation, and leaderboard serving. Identical negative samples and fixed random seeds across all splits and model initializations ensure strict reproducibility.
Public research artifacts are hosted as follows:
- Source Code & Docker: https://github.com/JuliaGast/TGB2
- Datasets & Metadata: https://huggingface.co/datasets/andrewsleader/TGB
- Live Leaderboard & API: https://tgb.complexdatalab.com/
- Permanent Data Storage: Digital Research Alliance of Canada (DOI forthcoming)
TGB 2.0 constitutes a significant step toward robust, fair, and reproducible evaluation of temporal graph learning techniques at previously unattainable scale (Gastinger et al., 2024).