Link Prediction Benchmarks

Updated 9 May 2026

Link prediction benchmarks are systematically constructed datasets with clear protocols for graph splits, negative sampling, and evaluation, enabling rigorous comparisons.
They highlight key pitfalls such as suboptimal baseline tuning, inconsistent data splits, and unrealistic negative sampling that skew performance metrics.
Recent advances introduce personalized, heuristic-informed approaches like HeaRT and domain-specific benchmarks to improve robustness and real-world relevance.

Link prediction benchmarks are systematically constructed datasets and evaluative protocols enabling rigorous, reproducible, and meaningful comparison of link prediction algorithms across network types, modalities, and tasks. A high-quality benchmark defines clear protocols for graph splitting, negative sampling, metric reporting, and baseline tuning, and exposes the subtleties—structural, statistical, and algorithmic—that influence model rankings and real-world generalization. Recent research highlights the need for unified standards, challenging negative sampling protocols, domain-specific benchmarks, and open evaluation practices in order to advance methodologically robust link prediction research.

1. Fundamental Principles and Pitfalls in Link Prediction Benchmarking

Link prediction benchmarking hinges on rigorous comparison between models, requiring strict control of experimental variables. Historical practice has suffered from three key pitfalls:

Suboptimal Baseline Tuning: Many “classical” methods (e.g., GCN, GAE) have been reported with sub-par performance due to inconsistent or inadequate hyperparameter search. When baselines are retuned in a unified grid, they often match or outperform newer, more complex methods (e.g., SAGE, Neo-GNN improving by up to 8–9 points in Hits@K on ogbl-collab) (Li et al., 2023).
Inconsistent Data Splits and Evaluation Metrics: Variations in splitting schemes (e.g., 85/5/10% vs. 70/10/20% train/val/test for Cora/Citeseer), non-standard inclusion/exclusion of validation edges (e.g., ogbl-collab), and nonuniform use of metrics (AUC, MRR, Hits@K) make cross-paper comparisons unreliable (Li et al., 2023).
Unrealistic Negative Sampling: Evaluations typically draw easy, global negative edges—randomly sampled, mostly with common-neighbors count zero—a scenario not representative of realistic recommendation or completion queries. Heuristic baselines (e.g., Common Neighbors) then appear artificially strong (Li et al., 2023).

These flaws confound ranking of methods, overstating progress, and frequently mask fundamental failure modes or data artifacts.

2. Best Practices: Splits, Metrics, Negative Sampling

To enable fair, unified comparison, the following benchmark design conventions are recommended:

Splitting Protocols: Use dataset-specific, fixed splits: for Planetoid graphs, a single 85/5/10% split; for OGB graphs, the official splits (e.g., ogbl-collab 92/4/4, ogbl-ppa 70/20/10). Always include or exclude validation edges consistently across methods for each dataset (Li et al., 2023).
Evaluation Metrics:
- AUC:
$\text{AUC} = \frac{1}{|\mathcal{D}^+||\mathcal{D}^-|}\sum_{i\in\mathcal{D}^+}\sum_{j\in\mathcal{D}^-}\mathbf{1}[f(x_i) > f(x_j)]$

where $f$ is the predictor, $\mathcal{D}^+$ / $\mathcal{D}^-$ the positive/negative test edges. - MRR:

$\mathrm{MRR} = \frac{1}{N} \sum_{i=1}^N \frac{1}{\mathrm{rank}_i}$

with $\mathrm{rank}_i$ the position of the true edge among all negatives for the $i$ th query. - Hits@K:

$\mathrm{Hits@}K = \frac{1}{N}\sum_{i=1}^N\mathbf{1}(\text{rank}_i\le K)$

Report all three, with MRR and Hits@K emphasized for highly imbalanced settings where AUC may be uninformative (Li et al., 2023, Mara et al., 2020).
Negative Sampling: Avoid random, global pools. Adopt personalized, hard negative sets, as in HeaRT—drawing negatives from the corruption set for each test edge and using multiple heuristics (Resource Allocation, Personalized PageRank, cosine similarity) to sample those most likely to be confounded with positives (Li et al., 2023).

3. Advances in Realism: Hard Negative Sampling (HeaRT) and Modern Evaluation

A step-change in benchmark robustness is realized via protocols such as HeaRT ("Heuristic Related Sampling Technique"). The protocol involves, for each positive edge $(u,v)$ , constructing a personalized corruption set $S(u,v)$ of all 1-hop-mutated edges and ranking negative samples using a combination (e.g., Borda count) of strong local (Resource Allocation), global (PPR), and feature-driven heuristics (Li et al., 2023). Empirically, this shift dramatically reduces SOTA test metrics and changes model rankings:

Dataset	Standard AUC (avg top-3)	HeaRT MRR (avg top-3)
Cora	∼96%	∼17%
Citeseer	∼97%	∼25%
Pubmed	∼99%	∼10%

The HeaRT approach decreases run-to-run metric variance by 50–90% relative to random sampling, thus yielding both more difficult and more stable benchmarks (Li et al., 2023).

4. Specialized and Emerging Domains: Hierarchical, Bipartite, Hypergraph, and Fairness-Aware Benchmarks

Domain-specific challenges have driven the creation and adoption of benchmarks with distinct topology, attribute, and task properties:

Hierarchical/Treelike Graphs: The TeleGraph benchmark tests link prediction in sparse, nearly acyclic hierarchies (density ≈ $f$ 0, Gromov hyperbolicity ≈ 0, nodes per level $f$ 1– $f$ 2), with attributes of dimension 240. Heuristic baselines (CN, AA) collapse to random, while GNNs leveraging node attributes and subgraph structure (SEAL, GCN-AE) show significant gains (Zhou et al., 2022).
Bipartite Networks: Protocols designed for bipartite graphs (e.g., Setur, MovieLens, LastFM) must avoid per-link symmetry assumptions and apply per-node stratified train/test splits. GCN-based recommenders (DiffRec, NGCF, LightGCN) excel in these settings, well above random-walk or structural heuristics, but speed-accuracy tradeoffs favor Local Paths or Katz for large-scale, denser datasets (Özer et al., 2024).
Relational Hypergraphs: In link prediction with $f$ 3-ary (with $f$ 4) interactions, the Gold Standard is filtered ranking protocols (MRR, Hits@K) with hard-masked positives, and evaluation using models capturing higher-order symmetries. The Hypergraph Conditional Network (HCNet) architecture achieves record MRR and Hits@K over both inductive and transductive benchmarks, outperforming prior GNN and tensor-factorization baselines (e.g., on WP-IND, HCNet: 0.414 MRR vs. best comparator 0.200) (Huang et al., 2024).
Fairness and Topological Bias: The TopoFair suite explicitly quantifies topological bias (assortativity, neighborhood heterogeneity, structural flow) in link-prediction benchmarks and enables testing of classical/fairness-aware models (e.g., Node2Vec, Fairwalk, CrossWalk) in settings with controlled variations of structural disparities. This exposes sensitivity of fairness metrics (statistical parity, equal opportunity) to biases beyond homophily, and demonstrates the need for systematic reporting of structural-bias statistics as part of benchmark evaluation (Marey et al., 12 Feb 2026).

5. Statistical Rigor, Dataset Hygiene, and Reproducibility

Benchmark reliability depends on avoiding artifacts such as out-of-vocabulary (OOV) contamination, split leakage, and batch effects:

OOV Entities: Benchmarks such as WN18RR, FB15K-237, YAGO3-10 often included test/validation triples with entities missing from training, giving unpredictable advantages/disadvantages to different models. Removing OOV triples (yielding starred datasets, e.g., WN18RR★) increased mean MRR by 3.29% absolute, and altered reported relative model rankings, with statistical significance $f$ 5 (Demir et al., 2021).
Publishing and Standardization: Modern practice requires release of cleaned datasets, split scripts, negative-sampling routines, and hyperparameter grids for every reported experiment. Examples include the HeaRT code/data (Li et al., 2023) and starred OOV-cleaned KGs (Demir et al., 2021).
Variance Reporting: Reliable methods report mean ± std over seeds/runs and make metric variance a first-class figure of merit, not just mean values (Li et al., 2023).

6. Cross-Benchmark Insights, Progress, and Open Questions

Meta-analyses of network embedding and GNN link-prediction methods have demonstrated that:

Unified, Standardized Pipelines (e.g., EvalNE (Mara et al., 2020)) are essential for controlling implementation, hyperparameter, and negative-sampling sources of variation, enabling “apples-to-apples” assessment.
No Single Method is Universally Superior: Heuristic or matrix-factorization methods often match or surpass deep neural designs, especially under robust negative sampling or on certain graph structures (Mara et al., 2020, Vlaskin et al., 2024).
True Progress is Thin: Apparent state-of-the-art gains often reduce to narrowed hyperparameter ranges, evaluation on easy negatives, or OOV contamination. When robust protocols (personalized negatives, OOV-cleaned data, strict filtered splits) are applied, many complex methods lose their edge (Li et al., 2023, Demir et al., 2021).
Synthetic Benchmarks with Theoretical Predictability Bounds enable principled comparison of algorithmic efficiency and task hardness. For motif- and block-structured graphs, theoretical AUC upper bounds allow contextualization of any method's performance relative to the information-theoretic limit (Vlaskin et al., 2024).

7. Recommendations for Future Link Prediction Benchmarks

Research synthesized from leading works yields the following practice guidelines:

Standardize Splits and Metrics: Single fixed public splits (with scripts), strict filtered MRR, Hits@K, and AUC; always specify inclusion of validation edges (Li et al., 2023, Demir et al., 2021).
Challenging Negative Sampling: Use personalized, heuristic-informed negatives (e.g., HeaRT), or, in knowledge-graph settings, type-matched or full-entity ranking (Ott et al., 2024); release code/sets.
Baselines and Variance: Tune all baselines under identical hyperparameter settings; report mean ± std over seeds; include classical and trivial baselines to expose protocol flaws (Li et al., 2023, Mara et al., 2020).
OOV/Leakage Checks: Guarantee that all test and validation entities appear in training; implement automated OOV filters (Demir et al., 2021).
Domain Coverage: Actively develop and evaluate on hierarchical, bipartite, higher-order, and fairness-focused benchmarks in addition to classical social/citation graphs (Zhou et al., 2022, Özer et al., 2024, Huang et al., 2024, Marey et al., 12 Feb 2026).
Transparency and Reproducibility: Publicly release full code, pre/postprocessed data, split policies, and negative-sampling modules for every experiment (Li et al., 2023, Vlaskin et al., 2024).
Statistical Reporting: Report per-run variation, ablation/sensitivity studies (especially for hard-negative protocols), and, where possible, compare against theoretical performance limits (Li et al., 2023, Vlaskin et al., 2024).

A benchmark adhering to these principles more accurately illuminates the strengths and shortcomings of link prediction algorithms, provides a high-fidelity assessment of real-world readiness, and lays a reproducible foundation for robust research advances.