RelBench Datasets Benchmark

Updated 9 October 2025

RelBench Datasets are a standardized benchmark suite for machine learning on multi-table, temporally rich relational databases, ensuring reproducible evaluations.
It defines 30 predictive tasks across varied domains using clear metrics like AUROC, MAE, and MAP@K to benchmark performance rigorously.
The framework supports diverse input representations—tabular, homogeneous and heterogeneous graphs—for comparing GNNs, RDL methods, LLMs, and relational transformers.

RelBench is a standardized benchmark suite and testbed for machine learning on relational databases, with particular emphasis on predictive tasks encountered in multi-table, temporally structured, real-world data. Its central function is to enable rigorous, reproducible, and interpretable evaluation of deep learning models—especially graph neural networks (GNNs), relational deep learning (RDL) methods, LLMs, and emerging foundation models—on tasks that fully leverage the relational and temporal signal inherent in interconnected databases. RelBench provides a suite of real-world datasets spanning diverse domains, precisely defined predictive tasks, and standardized evaluation metrics, thus serving as a foundational infrastructure for research on deep learning in relational data environments.

1. Composition and Domain Coverage of RelBench Datasets

RelBench comprises seven publicly available relational databases from domains such as e-commerce (rel-amazon, rel-hm, rel-avito), social/activity platforms (rel-stack, rel-event), clinical trials (rel-trial), and sports analytics (rel-f1). The databases are characterized by:

Entity scales ranging from approximately 74,000 objects (rel-f1, motorsport) to over 41 million entities (rel-amazon, rel-hm, e-commerce scenarios).
Relational complexity comprising 3–15 tables and 15–140 columns per dataset, with rich foreign–primary key linkages modeling real transactional or event histories.
Temporal richness: Data spans from periods of weeks (e.g., social events) to multiple decades (e.g., longitudinal e-commerce or medical data), supporting temporally rigorous forecasting and recommendation tasks.
Heterogeneous data types: Including numerical, categorical, textual, and datetime fields, mirroring the diversity and non-uniformity typical of operational databases.

Each dataset is released in a fully normalized relational format, preserving original schema semantics and all inter-table dependencies needed to perform multi-hop predictive modeling.

2. Predictive Task Definitions and Metric Suite

RelBench defines thirty benchmarking tasks over its databases, classified into three principal types:

Entity-wise binary classification: For flagging future occurrence of discrete events (e.g., customer churn detection), evaluated using Area Under the Receiver Operating Characteristic Curve (AUROC).
Entity-wise regression: For predicting continuous targets (e.g., customer/item lifetime value, race placements), assessed with mean absolute error (MAE).
Link prediction / recommendation: Where the objective is to produce a ranked list of target entities given a source entity—canonical in event, recommendation, or matching systems—evaluated via mean average precision at cutoff $K$ (MAP@K).

All tasks are aligned to a seed time ( $t_{seed}$ ) for each prediction instance, enforcing temporal splits (train/val/test) that prevent leakage—i.e., only information strictly prior to $t_{seed}$ may be used as input. Task definitions are anchored to a “training table” that exposes the entity ID, temporal context, and label(s), while all relational information is accessed via deterministic graph traversal linked to the seed entity.

3. Data Representation and Graph Construction

RelBench explicitly supports multiple machine learning paradigms by representing each relational database using three inter-compatible modalities:

Interface Type	Input Representation	Supported Methods
Tabular	Single-table (“flattened”)	Tabular models (e.g., LightGBM)
Homogeneous Graph	Uniform node/edge types	Plain GNNs, basic message passing
Heterogeneous Graph	Typed nodes and edges	RDL, relational GNNs, graph-based LLMs

Heterogeneous graph construction: Each table row is mapped to a node, and all F–P key relationships are mapped to (typed) directed edges. This allows GNNs and RDL models to propagate information according to schema-defined structure.
Temporal-aware subgraph extraction: For each prediction, the input subgraph contains only entities and events preceding $t_{seed}$ .
Tabular “flattening”: For comparison with classical ML, denormalization is performed via relational joins, though this process may lose multi-hop relational dependencies.

This flexibility enables extensive comparison between approaches that exploit relational structure and those that merely aggregate features.

4. Modeling Approaches and Baseline Evaluations

RelBench enables comparative analysis of both traditional and deep learning algorithms:

Feature engineering + tabular models: As a gold-standard prior, extensive manual feature engineering (via SQL and pandas pipelines) is combined with models such as LightGBM.
Relational Deep Learning (RDL): RelBench is designed to evaluate GNN–deep tabular model hybrids. Initial node embeddings are generated via deep tabular architectures (e.g., ResNet-style encoders), then refined using heterogeneous GNNs (e.g., GraphSAGE with sum aggregation) across the constructed relational graph.
LLMs: Tasks can be recast by serializing relational context as JSON or text for LLMs. Metric-aware inference, e.g., using token probabilities for $y=1$ and training lightweight MLP heads for regression, adapts the scoring procedure for AUROC and MAE.
Relational Transformers: Emerging architectures, such as the Relational Transformer (RT), leverage cell-level tokenization, schema-aware embeddings, and multiple relational attention modes.

A detailed user paper demonstrates that RDL achieves competitive or superior predictive accuracy to expert-engineered tabular pipelines (winning on 11/15 tasks), reduces human workload by >96%, and minimizes code overhead by >94% (Robinson et al., 29 Jul 2024). For LLMs, competitive or state-of-the-art results are achieved on entity classification and regression when documents are appropriately denormalized and metric-aware inference is applied (Wydmuch et al., 18 Nov 2024). RT achieves 94% of fully supervised AUROC in zero-shot transfer with a compact 22M parameter model, substantially outperforming massive LLMs in zero-shot resource efficiency (Ranjan et al., 7 Oct 2025).

5. Technical Innovations in Representation and Attention

RelBench facilitates research into advanced architectures via its explicit and detailed data representations:

Cell-level tokenization (Editor’s term): In RT, each cell is a token embedding combining value, column, and table identifiers, with semantic enrichment by frozen text encodings. This supports fine-grained schema generalization and facilitates masked token prediction objectives for both pretraining and finetuning.
Relational attention operators: RT extends vanilla self-attention to include:
- Column attention: Token attends over same-column tokens, encoding intra-attribute statistics.
- Feature attention: Token attends over co-row tokens and parent-entity feature clusters (i.e., across foreign-key links).
- Neighbor attention: Token attends over child entities (reverse links).
- These are encoded via binary masks $M$ within scaled dot-product attention:

$\text{Attention}(Q, K, V; M) = \operatorname{Softmax}\left( \frac{QK^\top \odot M}{\sqrt{d_K}} \right) V$

facilitating schema-structured message passing and supporting transfer across heterogeneous databases.

Temporal subgraphs: All graph-based methods are explicitly restricted to use only information preceding $t_{seed}$ to prevent any causal leakage.

6. Practical Significance and Research Directions

The release of RelBench establishes a rigorous, reproducible foundation for evaluating advances in deep relational modeling:

Elimination of manual feature engineering: End-to-end learned RDL and RT remove the need for expert-crafted SQL, drastically lowering the barrier to experimentation on new datasets and tasks.
Sample-efficient adaptation: Pretrained RT models fine-tune to new tasks and schemas with 10–100× fewer samples and steps than baseline methods (Ranjan et al., 7 Oct 2025).
User impact studies: Extensive user studies demonstrate both labor reduction and the ability to match/exceed expert baselines.
Extensibility: RelBench facilitates multi-task representation learning, pretraining, and potentially federated benchmarking on complex and diverse relational sources.

Current research includes optimizing output heads for regression, developing alternative temporal and contextual sampling regimes, integrating LLMs with graph and tabular representations, and systematizing the evaluation of cross-domain generalization.

7. Comparative Tools and Impact on the Broader Benchmark Ecosystem

While RelBench is purpose-built for relational databases and predictive modeling in operational business and science settings, there exist complementary tools and benchmarks:

BenchMake (Barnard, 29 Jun 2025) applies nonnegative matrix factorization to partition statistical “edge cases” for robust evaluation across modalities; its methodology offers improvements in reproducibility and challenge-level for test sets, which could synergize with RelBench to further stress-test deep relational models.
RecBench-MD (Liu et al., 29 Aug 2025) evaluates foundation models for recommendation across domain-diverse datasets, providing an orthogonal framework focusing on user–item matching and cross-domain generalization in less normalized, recommendation-oriented benchmarks.

RelBench’s rigorously defined splits, clear temporal ordering, and relational representation standards distinguish it for foundational research, especially where schema, temporal context, and multi-table reasoning are critical.

RelBench datasets represent a major advance in the standardization and comprehensiveness of benchmarks for predictive modeling on relational databases, supporting both novel deep learning architectures and classical methods, and catalyzing research at the intersection of relational data science and machine learning.