Papers
Topics
Authors
Recent
Search
2000 character limit reached

Temporal Graph Benchmark 2.0 (TGB 2.0)

Updated 18 May 2026
  • Temporal Graph Benchmark 2.0 (TGB 2.0) is a comprehensive framework that provides eight novel, large-scale datasets spanning five domains for temporal link prediction.
  • It mitigates longstanding bottlenecks like dataset scarcity, experimental reproducibility, and scalability by offering unified evaluation protocols and curated baselines.
  • Empirical insights indicate that while deep models struggle on massive datasets, simple heuristic methods often deliver competitive performance in temporal graph analysis.

Multi-relational temporal graphs are core representations for capturing the time-evolving, heterogeneous relationships among entities in complex real-world systems. The Temporal Graph Benchmark 2.0 (TGB 2.0) provides an extensive, reproducible framework for benchmarking future link prediction on large-scale temporal knowledge graphs (TKGs) and temporal heterogeneous graphs (THGs). This resource addresses longstanding bottlenecks—including dataset scarcity, experimental reproducibility, and scalability challenges—by contributing eight novel datasets spanning five domains, together with a unified evaluation protocol, curated baselines, and comprehensive public artifacts (Gastinger et al., 2024).

1. Dataset Collection and Properties

TGB 2.0 consolidates eight large-scale datasets formulated as temporal multi-relational graphs. Each dataset G=(V,R,E)G = (V, R, \mathcal{E}) comprises nodes VV, relation types RR, and temporal edges E\mathcal{E}, where each edge is a quadruple (s,r,o,t)(s, r, o, t) indicating a subject, relation, object, and timestamp.

Dataset Summary

Dataset Domain V|V| E|\mathcal{E}| R|R| T|T|
tkgl-smallpedia Knowledge 47,433 0.55M ≈ 37 1,826
tkgl-polecat Political 150,931 1.78M 16 10,224
tkgl-icews Political 87,856 15.51M 391 2,025
tkgl-wikidata Knowledge 1,226,440 9.86M ≈ 3,000 689,549
thgl-software Software 681,927 1.49M 14 2,558,457
thgl-forum Social 152,816 23.76M 2 2,510,415
thgl-github Software 5,856,765 17.50M 14 14,828,090
thgl-myket Interaction 1,530,835 53.63M 2 14,828,090

TGB 2.0 datasets greatly exceed prior TKG/THG collections in numerous dimensions—for example, tkgl-wikidata provides 25× more nodes and 6× more edges than previous TKGs, while thgl-github reaches 500× more nodes for THGs (Gastinger et al., 2024).

Application Domains

  • Knowledge Graphs: Real-world concepts linked by time-qualified factual relations (tkgl-smallpedia, tkgl-wikidata). Task: prediction of future statements or property changes.
  • Political Event Graphs: Socio-political actors with temporally coded cooperative/hostile events (tkgl-polecat, tkgl-icews). Task: forecasting political interactions.
  • Software and Online-Forum Networks: User and repository/event traces in GitHub (thgl-software, thgl-github); Reddit user and subreddit interactions (thgl-forum). Task: predicting subsequent actions or conversation edges.
  • Mobile-App Marketplace: User-app install/update events (thgl-myket). Task: predicting next app interaction.

2. Preprocessing, Statistics, and Temporal Characteristics

Data Splitting and Filtering

All datasets follow a 70/15/15% chronological split for train/validation/test, with entire timestamps assigned to exactly one split (no temporal leakage). THGs are filtered to remove low-degree nodes (2\leq2 for GitHub, VV0 for Software) (Gastinger et al., 2024).

Negative Sampling and Recurrence

Negative samples are pre-generated to ensure identical test sets across runs. Two temporal recurrence statistics are reported:

  • Recurrency Degree: VV1
  • Direct Recurrence: VV2

Edge distributions reveal episodic burstiness in THGs and steady, long-term growth in TKGs such as tkgl-wikidata. Recurrence statistics are closely associated with per-relation Mean Reciprocal Rank (MRR) in TKGs.

3. Benchmarking Task Formulation and Evaluation

Task Definition

The core task is future link prediction (a.k.a. dynamic link-property prediction). The model, given a query of the form VV3 or VV4, assigns a real-valued score VV5, and candidates are ranked by this score.

Objective and Loss Function

Training uses pointwise cross-entropy loss comparing observed positives against sampled negatives:

VV6

where VV7 is the set of sampled negatives.

Evaluation Protocol

  • Splits: All edges with a given timestamp are assigned to only one split.
  • Negative Sampling:
    • 1-vs-all protocol (all candidates) for VV8.
    • 1-vs-VV9 for larger graphs, with negatives sampled from nodes previously seen as objects for the same relation RR0 ("edge-type–aware sampling").
  • Metrics: Time-aware filtered MRR and Hits@K; filtering removes temporally conflicting positives from the ranking pool.

4. Baseline Models and Modeling Methodology

Baseline Families

  • Heuristic Baselines:
    • Recurrency Baseline (RecB): uses strict/relaxed recurrence patterns.
    • EdgeBank: memorizes historical edges, with either windowed (tw) or full (all) memory.
  • Temporal Knowledge Graph Methods:
    • RE-GCN: R-GCN message passing over snapshots with RNN-based historical summary.
    • CEN: Evolving GCNs with a curriculum schedule from short-to-long temporal patterns.
    • TLogic: Backward random walks for temporal-logic rule learning.
  • Temporal Heterogeneous Graph Methods:
    • TGN: Continuous-time memory, message passing.
    • TGNₑ: TGN with learned edge-type embeddings.
    • STHN: Link-encoder architecture with semantic "patch" fusion.

Key Model Components

  • Relational Message Passing (e.g., RE-GCN):

RR1

  • Continuous-Time Memory Update (TGN):

RR2

  • Time Encoding (STHN):

RR3

Scalability

Heuristic baselines (RecB, EdgeBank) operate linearly in RR4 and scale to all datasets. Most neural models (RE-GCN, CEN, TLogic, TGN, STHN) cannot scale to the largest graphs due to memory (RR5 for RE-GCN/CEN) or time complexity (RR6 for TLogic), and OOM or OOT failures are reported on datasets such as tkgl-wikidata, thgl-github, and thgl-myket.

5. Experimental Results and Empirical Insights

Observations

  • Edge-type (relation) information is essential for high predictive performance; TGNₑ (with edge-type embeddings) outperforms TGN.
  • Simple heuristic approaches (RecB, EdgeBank) are often competitive with deep models. On tkgl-smallpedia and tkgl-icews, RecB achieves the highest MRR.
  • Deep learning models generally do not scale to the largest TGB 2.0 datasets, suggesting a research gap for highly scalable methods.
  • On THGs, STHN achieves the best result (MRR 0.731 on thgl-software) but is not executable on larger datasets. EdgeBank remains a consistent and scalable baseline.
  • In TKGs, relation recurrence degree strongly predicts per-relation MRR.

Performance Summaries

Temporal Knowledge Graphs (MRR, 1-vs-all or 1-vs-q)

Method Smallpedia Polecat ICEWS Wikidata
EdgeBankₜʷ 0.457 0.058 0.020 0.633
EdgeBankₐₗₗ 0.401 0.048 0.009 0.632
RecBₒₚₜ 0.694 0.203 OOT OOT
RecB₀ₚₜ₁ 0.640 0.170 0.206 OOT
RE-GCN 0.631 0.191 0.182 OOM
CEN 0.646 0.204 0.187 OOM
TLogic 0.631 0.236 0.287 OOT

Temporal Heterogeneous Graphs (MRR, 1-vs-q)

Method Software Forum GitHub Myket
EdgeBankₜʷ 0.279 0.534 0.355 0.248
EdgeBankₐₗₗ 0.399 0.612 0.403 0.430
RecB₀ₚₜ₁ 0.099 0.561 OOT OOT
TGN 0.324 0.649 OOM OOM
TGNₑ 0.424 0.729 OOM OOM
STHN 0.731 OOM OOM OOM

6. Reproducibility, Artifacts, and Community Resources

The end-to-end TGB 2.0 evaluation pipeline is fully automated, including dataset download, preprocessing, chronological splitting, negative-sample generation, training, evaluation, and leaderboard serving. Identical negative samples and fixed random seeds across all splits and model initializations ensure strict reproducibility.

Public research artifacts are hosted as follows:

TGB 2.0 constitutes a significant step toward robust, fair, and reproducible evaluation of temporal graph learning techniques at previously unattainable scale (Gastinger et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Temporal Graph Benchmark 2.0 (TGB 2.0).