Graph Classification Benchmark (GCB)

Updated 31 May 2026

Graph Classification Benchmark (GCB) is a suite of datasets, protocols, and baseline models designed for rigorous evaluation of graph-level attributes and representation learning.
It integrates diverse real-world and synthetic graphs, offering standardized splits and strong baselines from simple MLPs to advanced GNNs with pooling and skip-connections.
GCB emphasizes hybrid methodologies that combine engineered features with neural embeddings, guiding best practices for hyperparameter tuning and robust performance assessment.

The Graph Classification Benchmark (GCB) is a suite of datasets, protocols, and baseline models established for rigorous evaluation and comparison of graph classification architectures. GCB frameworks are central in the development and assessment of methods ranging from simple structure-blind multilayer perceptrons to advanced graph neural networks (GNNs) with pooling, skip-connections, and hybrid feature integration. Unlike traditional node-level tasks, these benchmarks foreground the learning and discrimination of graph-level attributes and inter-graph structure, providing controlled experimental settings with both real-world and synthetic data sources (Luzhnica et al., 2019, Dyer et al., 20 Dec 2025, Ferber et al., 2019).

1. Reference Datasets and Benchmark Structure

The primary GCB suite—as defined by Luzhnica et al. (2019)—utilizes established real-world datasets with fixed, shared cross-validation splits for comparability (Luzhnica et al., 2019):

REDDIT-BINARY: Social interaction graphs, with two classes corresponding to different types of Reddit discussion threads.
DD: Protein structure graphs for enzyme/non-enzyme classification.
COLLAB: Co-authorship graphs across three scientific fields.
PROTEINS: Protein function graphs distinguishing enzymes by amino-acid interaction.
Additional synthetic and large-scale datasets: Recent benchmarks extend this suite to controlled synthetic families such as Erdős–Rényi, Watts-Strogatz, Barabási–Albert, Holme–Kim, and Stochastic Block Model graphs with thousands of nodes and edges, providing a setting for stress-testing model generality (Dyer et al., 20 Dec 2025, Ferber et al., 2019).

Label information varies: deterministic classes (graph family/type), regression targets (e.g., planner runtimes on IPC), and multi-output or one-vs-all binary settings.

Canonical dataset statistics are:

| Dataset | Example Size (n) | Example Size (|E|) | Labels | |-----------------|------------------------|----------------------------|------------------------| | REDDIT-BINARY | tens–hundreds | up to thousands | 2 | | DD | hundreds–thousands | up to 10⁴ | 2 | | COLLAB | hundreds | up to 10³ | 3 | | PROTEINS | hundreds | up to 10³ | 2 | | IPC (PDG/ASG) | up to 2.4 × 10⁵ | up to 3.7 × 10⁵ | regression, multi-class| | GCB-synthetic | 5,000–10,000 | up to 1.1 × 10⁵ | 5 |

These datasets include highly imbalanced, large-scale, and directed/directed-acyclic settings (Ferber et al., 2019), challenging typical GNN and kernel-based approaches.

2. Baseline Architectures and Evaluation Protocols

GCB mandates reporting on strong, simple baselines alongside advanced architectures. Required models include:

Structure-blind MLP: Aggregates node features by summation/mean, discards graph topology entirely, and feeds the resulting vector through deep MLP layers. Serves as a lower bound for topology-agnostic attribute modeling (Luzhnica et al., 2019).
Single-layer GCN with Jumping Knowledge (JK-MLP): Incorporates one GCN layer (using D̃^{{−1/2}ÂD̃^{−1/2}} adjacency normalization), with concatenation of layerwise representations and the MLP output head. Tests the efficacy of shallow message propagation plus global pooling or JK aggregation (Luzhnica et al., 2019).
GCN(R)-MLP: As above, but with frozen random weights, highlighting the representational bias imposed solely by the architecture.
Deep Coarsening Architectures (GCN-POOL-JK): Stacks several (L=3) GCN and top-k pooling layers, with layerwise pooling and JK sum, testing depth and hierarchical abstraction, particularly for large graphs (Luzhnica et al., 2019, Dyer et al., 20 Dec 2025).

Evaluation is performed via 10-fold cross-validation with shared splits:

$\mathrm{Acc} = \frac{1}{|\mathcal{D}_\mathrm{test}|} \sum_{i\in\mathcal{D}_\mathrm{test}} \mathbf{1}(\hat{y}_i = y_i)$

Mean and standard deviation over the folds should be reported. For certain synthetic benchmarks, balanced train/validation/test splits are provided (e.g., 80/10/10%) (Dyer et al., 20 Dec 2025). No AUC or F₁ metrics are mandatory in the original GCB, but recent extensions include these for multi-class, imbalanced, or regression settings.

3. Extended Benchmark Methodologies and Feature Engineering

Recent advances in GCB methodology integrate both classical graph metrics and neural embeddings:

Feature Extraction: Includes per-node and per-graph statistics such as degree, eigenvector centrality, closeness centrality, degree variance, clustering coefficient, and assortativity. Features are selected via Random Forest ensemble importance to maximize discriminative power while reducing redundancy (Dyer et al., 20 Dec 2025).
Hybrid Models: GNN representations (via GCN, GAT, GraphSAGE, GIN, GTN) are concatenated with selected global features before final classification using a multilayer perceptron. This approach systematically outperforms pure message-passing or feature-only baselines (Dyer et al., 20 Dec 2025).
Hyperparameter Optimization: Automated search (e.g., via Optuna) is recommended, sweeping hidden dimensionalities, learning rates, and dropout ratios, with strict early stopping criteria (Dyer et al., 20 Dec 2025).
Visualization: Embedding separability is confirmed via t-SNE or UMAP, with confusion matrices elucidating family or class separation and error patterns.

4. Key Empirical Findings and Model Comparison

Empirical results on GCB reveal nuanced modeling dynamics:

Architecture	Accuracy (%)	F1 (mean/class)	Notable Failure Modes
GraphSAGE	98.5	0.9858	Some confusion ER vs SBM
GTN	98.5	0.9858	Matches SAGE with higher computational cost
GCN	97.5	0.9759	Comparable, but slightly lower accuracy
GIN	96.3	0.9643	Misses modular SBM structure
GATv2/GAT	92.3/82.0	0.926/0.822	Struggles with global patterns, high ER↔SBM error
SVM (features)	64.7	0.6293	Weak separation except for Holme–Kim

Hybrid models combining GNN message-passing and select features consistently surpass SVMs (pure features) and vanilla GNNs, especially on large, difficult-to-separate synthetic families (Dyer et al., 20 Dec 2025). Trade-offs appear in training time and parameter count: SAGE achieves GTN-level accuracy with approximately one-third the resource footprint.

Deep architectures benefit primarily from skip-connections/jumping knowledge, which act as bypasses for both activations and gradients. Deeper stacks of GCN/pooling only improve performance if initialization schemes (e.g., REINIT) explicitly maintain activation/gradient flow; otherwise, most learning occurs via shallow sub-networks or direct input features (Luzhnica et al., 2019).

5. Dataset-Specific Characteristics and Challenges

Modern GCB datasets, notably the IPC (AI Planning) corpus, introduce several technical challenges (Ferber et al., 2019):

Heavy-tailed size distribution: Up to 238,909 nodes (ASG), with ∼40–60% of graphs in the thousands. This exceeds the scope of conventional datasets (tens–hundreds) and complicates batching, memory utilization, and graph-to-graph similarity metrics.
Directed and acyclic structure: IPC’s grounded (PDG) and lifted (ASG) graphs are both directed; only ASG is guaranteed acyclic. Many standard GNN architectures implicitly assume undirected graphs, motivating DAG-aware models.
Varied sparsity, connectivity, and diameter: The range in mean degree (≪5 in ASG, ≈12 in PDG) and connectivity (dominance of largest connected component) tests model robustness to graph fragmentation and long-range dependency.
Automated, scalable labeling: No human labeling—targets are planner runtimes or solver success/failure, facilitating scale but introducing multi-task, regression, and multi-output classification modes.

For synthetic benchmarks, class distributions and parameter overlaps drive confusion (notably Erdős–Rényi vs. SBM), setting an upper bound on separability even for expressive models (Dyer et al., 20 Dec 2025).

6. Recommendations and Best Practices

Best practices for GCB construction and reporting include (Luzhnica et al., 2019, Dyer et al., 20 Dec 2025):

Inclusion of strong, simple baselines (MLP, single-layer GCN-JK, fixed-weight GCN-JK) for fair comparison.
Clear reporting of cross-validation splits, mean ± standard deviation, and—when possible—statistical significance tests.
Hybrid architectures leveraging both GNN embeddings and engineered features, especially when global structure or heterogeneity matters.
Automated hyperparameter optimization, unified training scaffolds, and explicit reporting of model capacity/cost.
Embedding visualization (t-SNE/UMAP) and confusion matrix inclusion for interpretability.
Public release of code and data (with generation pipelines) to facilitate reproducibility and extension to new settings.

A plausible implication is that many advances in deep GNN architecture are only marginally meaningful without careful control of initialization, skipness (JK/REINIT), and strong baselines. Discriminative power often emerges more from algorithmic orchestration (pooling, skips, hybridization) and feature synthesis than from depth per se.

7. Extensions and Open Directions

Contemporary GCB extensions incorporate larger and more heterogeneous graphs, such as malicious DNS datasets (PDNS-Net, 447K nodes, 900K edges), and propose further work on stratified sampling, hierarchical pooling, relation-parameter sharing for heterogeneous graphs, and adversarial robustness (Kumarasinghe et al., 2022). The scalability and domain-adaptivity of GCBs are being stressed by new real-world and synthetic benchmarks, joint regression–classification tasks, and temporal graph structures.

Continued empirical rigor—anchored in the guidelines and datasets defined by GCB—remains fundamental for measuring progress in graph-level representation learning and generalization.

Markdown Report Issue Upgrade to Chat

References (4)

On Graph Classification Networks, Datasets and Baselines (2019)

Feature-Enhanced Graph Neural Networks for Classification of Synthetic Graph Generative Models: A Benchmarking Study (2025)

IPC: A Benchmark Data Set for Learning with Graph-Structured Data (2019)

PDNS-Net: A Large Heterogeneous Graph Benchmark Dataset of Network Resolutions for Graph Learning (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Graph Classification Benchmark (GCB).

Graph Classification Benchmark (GCB)

1. Reference Datasets and Benchmark Structure

2. Baseline Architectures and Evaluation Protocols

3. Extended Benchmark Methodologies and Feature Engineering

4. Key Empirical Findings and Model Comparison

5. Dataset-Specific Characteristics and Challenges

6. Recommendations and Best Practices

7. Extensions and Open Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Graph Classification Benchmark (GCB)

1. Reference Datasets and Benchmark Structure

2. Baseline Architectures and Evaluation Protocols

3. Extended Benchmark Methodologies and Feature Engineering

4. Key Empirical Findings and Model Comparison

5. Dataset-Specific Characteristics and Challenges

6. Recommendations and Best Practices

7. Extensions and Open Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research