Temporal Graph Benchmark 2.0 (TGB 2.0)

Updated 18 May 2026

Temporal Graph Benchmark 2.0 (TGB 2.0) is a comprehensive framework that provides eight novel, large-scale datasets spanning five domains for temporal link prediction.
It mitigates longstanding bottlenecks like dataset scarcity, experimental reproducibility, and scalability by offering unified evaluation protocols and curated baselines.
Empirical insights indicate that while deep models struggle on massive datasets, simple heuristic methods often deliver competitive performance in temporal graph analysis.

Multi-relational temporal graphs are core representations for capturing the time-evolving, heterogeneous relationships among entities in complex real-world systems. The Temporal Graph Benchmark 2.0 (TGB 2.0) provides an extensive, reproducible framework for benchmarking future link prediction on large-scale temporal knowledge graphs (TKGs) and temporal heterogeneous graphs (THGs). This resource addresses longstanding bottlenecks—including dataset scarcity, experimental reproducibility, and scalability challenges—by contributing eight novel datasets spanning five domains, together with a unified evaluation protocol, curated baselines, and comprehensive public artifacts (Gastinger et al., 2024).

1. Dataset Collection and Properties

TGB 2.0 consolidates eight large-scale datasets formulated as temporal multi-relational graphs. Each dataset $G = (V, R, \mathcal{E})$ comprises nodes $V$ , relation types $R$ , and temporal edges $\mathcal{E}$ , where each edge is a quadruple $(s, r, o, t)$ indicating a subject, relation, object, and timestamp.

Dataset Summary

Dataset	Domain	$\|V\|$	$\|\mathcal{E}\|$	$\|R\|$	$\|T\|$
tkgl-smallpedia	Knowledge	47,433	0.55M	≈ 37	1,826
tkgl-polecat	Political	150,931	1.78M	16	10,224
tkgl-icews	Political	87,856	15.51M	391	2,025
tkgl-wikidata	Knowledge	1,226,440	9.86M	≈ 3,000	689,549
thgl-software	Software	681,927	1.49M	14	2,558,457
thgl-forum	Social	152,816	23.76M	2	2,510,415
thgl-github	Software	5,856,765	17.50M	14	14,828,090
thgl-myket	Interaction	1,530,835	53.63M	2	14,828,090

TGB 2.0 datasets greatly exceed prior TKG/THG collections in numerous dimensions—for example, tkgl-wikidata provides 25× more nodes and 6× more edges than previous TKGs, while thgl-github reaches 500× more nodes for THGs (Gastinger et al., 2024).

Application Domains

Knowledge Graphs: Real-world concepts linked by time-qualified factual relations (tkgl-smallpedia, tkgl-wikidata). Task: prediction of future statements or property changes.
Political Event Graphs: Socio-political actors with temporally coded cooperative/hostile events (tkgl-polecat, tkgl-icews). Task: forecasting political interactions.
Software and Online-Forum Networks: User and repository/event traces in GitHub (thgl-software, thgl-github); Reddit user and subreddit interactions (thgl-forum). Task: predicting subsequent actions or conversation edges.
Mobile-App Marketplace: User-app install/update events (thgl-myket). Task: predicting next app interaction.

2. Preprocessing, Statistics, and Temporal Characteristics

Data Splitting and Filtering

All datasets follow a 70/15/15% chronological split for train/validation/test, with entire timestamps assigned to exactly one split (no temporal leakage). THGs are filtered to remove low-degree nodes ( $\leq2$ for GitHub, $V$ 0 for Software) (Gastinger et al., 2024).

Negative Sampling and Recurrence

Negative samples are pre-generated to ensure identical test sets across runs. Two temporal recurrence statistics are reported:

Recurrency Degree: $V$ 1
Direct Recurrence: $V$ 2

Edge distributions reveal episodic burstiness in THGs and steady, long-term growth in TKGs such as tkgl-wikidata. Recurrence statistics are closely associated with per-relation Mean Reciprocal Rank (MRR) in TKGs.

3. Benchmarking Task Formulation and Evaluation

Task Definition

The core task is future link prediction (a.k.a. dynamic link-property prediction). The model, given a query of the form $V$ 3 or $V$ 4, assigns a real-valued score $V$ 5, and candidates are ranked by this score.

Objective and Loss Function

Training uses pointwise cross-entropy loss comparing observed positives against sampled negatives:

$V$ 6

where $V$ 7 is the set of sampled negatives.

Evaluation Protocol

Splits: All edges with a given timestamp are assigned to only one split.
Negative Sampling:
- 1-vs-all protocol (all candidates) for $V$ 8.
- 1-vs- $V$ 9 for larger graphs, with negatives sampled from nodes previously seen as objects for the same relation $R$ 0 ("edge-type–aware sampling").
Metrics: Time-aware filtered MRR and Hits@K; filtering removes temporally conflicting positives from the ranking pool.

4. Baseline Models and Modeling Methodology

Baseline Families

Heuristic Baselines:
- Recurrency Baseline (RecB): uses strict/relaxed recurrence patterns.
- EdgeBank: memorizes historical edges, with either windowed (tw) or full (all) memory.
Temporal Knowledge Graph Methods:
- RE-GCN: R-GCN message passing over snapshots with RNN-based historical summary.
- CEN: Evolving GCNs with a curriculum schedule from short-to-long temporal patterns.
- TLogic: Backward random walks for temporal-logic rule learning.
Temporal Heterogeneous Graph Methods:
- TGN: Continuous-time memory, message passing.
- TGNₑ: TGN with learned edge-type embeddings.
- STHN: Link-encoder architecture with semantic "patch" fusion.

Key Model Components

Relational Message Passing (e.g., RE-GCN):

$R$ 1

Continuous-Time Memory Update (TGN):

$R$ 2

Time Encoding (STHN):

$R$ 3

Scalability

Heuristic baselines (RecB, EdgeBank) operate linearly in $R$ 4 and scale to all datasets. Most neural models (RE-GCN, CEN, TLogic, TGN, STHN) cannot scale to the largest graphs due to memory ( $R$ 5 for RE-GCN/CEN) or time complexity ( $R$ 6 for TLogic), and OOM or OOT failures are reported on datasets such as tkgl-wikidata, thgl-github, and thgl-myket.

5. Experimental Results and Empirical Insights

Observations

Edge-type (relation) information is essential for high predictive performance; TGNₑ (with edge-type embeddings) outperforms TGN.
Simple heuristic approaches (RecB, EdgeBank) are often competitive with deep models. On tkgl-smallpedia and tkgl-icews, RecB achieves the highest MRR.
Deep learning models generally do not scale to the largest TGB 2.0 datasets, suggesting a research gap for highly scalable methods.
On THGs, STHN achieves the best result (MRR 0.731 on thgl-software) but is not executable on larger datasets. EdgeBank remains a consistent and scalable baseline.
In TKGs, relation recurrence degree strongly predicts per-relation MRR.

Performance Summaries

Temporal Knowledge Graphs (MRR, 1-vs-all or 1-vs-q)

Method	Smallpedia	Polecat	ICEWS	Wikidata
EdgeBankₜʷ	0.457	0.058	0.020	0.633
EdgeBankₐₗₗ	0.401	0.048	0.009	0.632
RecBₒₚₜ	0.694	0.203	OOT	OOT
RecB₀ₚₜ₁	0.640	0.170	0.206	OOT
RE-GCN	0.631	0.191	0.182	OOM
CEN	0.646	0.204	0.187	OOM
TLogic	0.631	0.236	0.287	OOT

Temporal Heterogeneous Graphs (MRR, 1-vs-q)

Method	Software	Forum	GitHub	Myket
EdgeBankₜʷ	0.279	0.534	0.355	0.248
EdgeBankₐₗₗ	0.399	0.612	0.403	0.430
RecB₀ₚₜ₁	0.099	0.561	OOT	OOT
TGN	0.324	0.649	OOM	OOM
TGNₑ	0.424	0.729	OOM	OOM
STHN	0.731	OOM	OOM	OOM

6. Reproducibility, Artifacts, and Community Resources

The end-to-end TGB 2.0 evaluation pipeline is fully automated, including dataset download, preprocessing, chronological splitting, negative-sample generation, training, evaluation, and leaderboard serving. Identical negative samples and fixed random seeds across all splits and model initializations ensure strict reproducibility.

Public research artifacts are hosted as follows:

Source Code & Docker: https://github.com/JuliaGast/TGB2
Datasets & Metadata: https://huggingface.co/datasets/andrewsleader/TGB
Live Leaderboard & API: https://tgb.complexdatalab.com/
Permanent Data Storage: Digital Research Alliance of Canada (DOI forthcoming)

TGB 2.0 constitutes a significant step toward robust, fair, and reproducible evaluation of temporal graph learning techniques at previously unattainable scale (Gastinger et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

TGB 2.0: A Benchmark for Learning on Temporal Knowledge Graphs and Heterogeneous Graphs (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Temporal Graph Benchmark 2.0 (TGB 2.0).