ABCD Dataset Overview

Updated 27 November 2025

ABCD Dataset is a multifaceted term referring to benchmark resources that include CAD models for geometric deep learning, synthetic random graphs for community detection, and datasets for Bayesian HEP background estimation.
In geometric deep learning, the ABC-Dataset provides detailed CAD models with precise analytic parameterizations to support tasks such as surface segmentation and normal estimation.
In network science and high-energy physics, the ABCD datasets offer controlled benchmarking via tunable parameters and Bayesian inference, enabling robust community detection and improved background estimation.

The term "ABCD Dataset" refers to multiple distinct objects in contemporary scientific literature, each with specific technical significance in its respective field. It principally denotes either (1) the "ABC-Dataset," a large-scale CAD model repository for geometric deep learning, or (2) the class of datasets underpinning the ABCD (Artificial Benchmark for Community Detection) family of random graph models in network science, as well as (3) event-mixture datasets for ABCD/Bayesian background estimation in high-energy physics. This article addresses these leading contexts, grounded in precise methodology, parameterization, and benchmark conventions from the primary sources.

1. The ABC-Dataset: Composition and Representation

The "ABC-Dataset" (Koch et al., 2018) comprises over one million distinct Computer-Aided Design (CAD) models, principally encoded as B-Rep constructs harvested from Onshape’s public repository. Each model is described by a set of explicitly parametrized surfaces and curves, supporting analytic generation of geometric ground-truth for different tasks.

Surface Patch Types: plane, cylinder, cone, sphere, torus (quadric patches), surfaces of revolution, extrusions, trimmed/full bicubic NURBS.
Curve Types: line, circle, ellipse, parabola, hyperbola, NURBS (rational/non-rational).
Parametric Forms: Each element is represented analytically; for example:
- Plane: $p(u,v) = l + u\,x + v\,y$ , where $l, x, y \in \mathbb{R}^3$ .
- Cylinder: $p(u,v) = l + r\cos u\,x + r\sin u\,y + v\,z$ .
- NURBS surface: $p(u, v) = \frac{\sum_{ij} N_i^p(u) M_j^q(v) w_{ij} P_{ij}}{\sum_{ij} N_i^p(u) M_j^q(v) w_{ij}}$ .

Each model exposes ground-truth differential quantities (normals, curvatures, patch and feature labels), enabling robust controlled comparison between geometric processing algorithms.

2. Sampling, Annotation and Data Representation

Surface and curve parameterizations are loaded (STEP→Open Cascade), then discretized via meshing or point sampling (using Gmsh) at varying resolutions. Modes include uniform edge-length control and curvature-adaptive density.

Sampling strategies:
- Surface patches: For $\{(u_i,v_i)\}\subset U \times V$ sample $x_i = p(u_i, v_i)$ (uniform/adaptive).
- Curves: $t_i \in [t_\mathrm{min}, t_\mathrm{max}],\; c(t_i)$ .
Ground-truth labels:
- Surface normal: $n(u,v) = \frac{p_u \times p_v}{\|p_u \times p_v\|}$ .
- Patch and triangle vertex correspondence, curvature, and feature sharpness tags.
Formats: STEP, Parasolid, STL (raw), OBJ (remeshed, near-equilateral), YAML (patch/curve metadata), point clouds.

Generated datasets can be extracted at multiple resolutions ( $N \in \{512, 1024, 2048\}$ per patch/model), for benchmark scalability.

3. Benchmarking and Use Cases in Geometric Deep Learning

The ABC-Dataset supports a battery of geometric learning benchmarks:

Patch Segmentation/Decomposition: Each CAD patch acts as a canonical segment.
Differential Quantity Regression: Estimation of normals and curvatures with analytic ground truth.
Sharp Feature Detection: Edge tags for curve/vertex singularities.
Shape Reconstruction: Error evaluation for mesh/pointcloud-to-surface fitting.
Normal Estimation Benchmark:
- Point cloud or mesh-based methods.
- Dataset sizes: $M \in \{10\,\text{k}, 50\,\text{k}, 100\,\text{k}, 250\,\text{k}\}$ ; 80% train / 20% test splits.
- Loss: $L(n, \hat n) = 1 - (n^\top \hat n)^2$ ; median angle deviation $\theta = \arccos \left| n \cdot \hat n \right|$ .

Key findings:

Point-based learning methods outperform classical estimators on raw clouds, but mesh connectivity enables simple analytic algorithms to outperform all existing deep network methods; e.g., uniformly weighted face normals achieve near-zero angle error at high densities.
There is a performance gap for deep nets in leveraging connectivity, indicating a challenge for graph-based geometric learning architectures.

4. ABCD Random Graph Dataset: Network Science Benchmarking

In network science, the "ABCD dataset" designates synthetic random graphs with tunable community structure and heterogeneity, providing precise experimental control over degree and community-size distributions (Kaminski et al., 2022, Barrett et al., 5 Jun 2025).

Core Model Parameters:
- $n$ : number of nodes;
- $\gamma$ : degree power-law exponent;
- $\beta$ : community-size power-law exponent;
- $\delta$ , $D$ : min/max degree;
- $s$ , $S$ : min/max community size;
- $\xi$ : mixing parameter (fraction of "background" or inter-community edges).
Variants:
- ABCD+ $o^2$ allows overlapping communities, outliers, and a tunable overlap parameter $\eta$ (average communities per node), plus geometric latent space and degree-community correlation parameter $\rho$ (Barrett et al., 5 Jun 2025).
Generation Pipeline:
- (1) Draw node degrees ( $X_i \sim \mathrm{TP}(\gamma, \delta, D)$ );
- (2) Assign community sizes ( $Y_j \sim \mathrm{TP}(\beta, s, S)$ );
- (3) Assign nodes to communities, enforcing explicit admissibility constraints;
- (4) Split stubs into internal (community) and external (background);
- (5) Generate edges via the configuration model per community and background, rewiring to remove loops/multiedges;
- (6) Optionally, overlay structure for overlaps/outliers via geometric sampling and degree correlation tuning.
Properties:
- The degree and community-size sequences concentrate tightly around their target distributions.
- Global modularity $q(C)$ of the ground-truth partition satisfies $q(C) = (1 \pm o(1))(1-\xi)$ in the large-graph limit; $\xi$ directly tunes community detectability.

Practical utility: These models provide fast, flexible benchmarks for scalable community detection, sharp control over mixing (via $\xi$ ), and naturalistic power-law node and community size distributions.

5. Datasets for ABCD Method in High-Energy Physics

In high-energy physics, particularly at the LHC, "ABCD dataset" colloquially refers to those supporting the ABCD method for background estimation and its Bayesian generalizations (Alvarez et al., 12 Feb 2024). Datasets in this context typically encode mixtures of signal and background events, indexed by multiple observables.

ABCD approach: Hard partitioning of data into four regions (A-D) using two approximately independent observables $\mathcal O_1, \mathcal O_2$ , with the background expectation in signal-dominated region A estimated as $N_A^\mathrm{bkg} = \frac{N_B N_C}{N_D}$ .
Bayesian generalization: Replacing hard region assignment with eventwise soft assignments in a $K$ $K$ -component mixture model:
- Each event $x_n \in \mathbb{R}^D$ assigned a latent class $z_{nk}$ with mixture weights $\pi_k$ ;
- Likelihood: $p(x_n|\pi,\theta) = \sum_{k=1}^K \pi_k \prod_{d=1}^D f_{kd}(x_{nd}|\theta_{kd})$ ;
- Priors: Dirichlet on $\pi$ , weakly-informative on component parameters;
- Posterior: $p(\pi, \theta|\mathbf X) \propto \mathcal L(\pi, \theta) p(\pi) p(\theta)$ .
- Soft assignments: $r_{n,k} = \frac{\pi_k f_k(x_n|\theta_k)}{\sum_{j=1}^K \pi_j f_j(x_n|\theta_j)}$ .
- Inference by variational Bayes or MCMC.
Performance advantage: Bayesian mixture modeling exploits full mutual information among observables, generalizes to $D > 2$ , improves robustness at low signal fractions, and avoids unphysical signal estimates.

6. Implementation, Annotation, and Availability

The ABC-Dataset (CAD) provides open-source code for loading, meshing, point sampling, and extracting ground-truth at arbitrary resolutions. Data access, processing, and workflow pipelines are provided under MIT (dataset) and GPL (processing code) licenses. The dataset is continuously updated as new public models become available (Koch et al., 2018).
The ABCD random-graph generators are efficiently online implementable (e.g., in Julia, Python) using configuration-model and geometric/nearest-neighbor primitives; codebases such as ABCDe facilitate high-throughput experimentation (Kaminski et al., 2022, Barrett et al., 5 Jun 2025).
In experimental physics contexts, producing an "ABCD dataset" involves careful selection and annotation of observables (e.g., jet scores, invariant masses), modeling of class-conditional densities, and Bayesian inference via Pyro or other probabilistic programming frameworks (Alvarez et al., 12 Feb 2024).

7. Significance, Limitations, and Future Directions

The ABC-Dataset for geometric deep learning has established itself as a principal resource for robust, controlled benchmarking of geometric learning pipelines, particularly for tasks requiring analytic ground truth. Its role in highlighting the advantages and limits of learning-based versus analytic geometric estimators is notable.

In network science, ABCD/ABCD+ $o^2$ models have provided theoretical clarity regarding the interplay between network heterogeneity and community detectability, modularity, and mixing, while enabling analytic and large-scale experimental reproducibility.

Bayesian dataset constructions for ABCD background estimation in HEP offer an avenue for principled signal extraction, outperforming classical region-based estimates especially in low-signal/high-background regimes and high-dimensional observable space.

A plausible implication is that these datasets, via explicit annotation, flexible parameterization, and scalable infrastructure, will continue to drive methodological advances in geometric deep learning, network science, and statistical inference for experimental physics. Systematic exploration of deep models capable of leveraging mesh and graph connectivity, or higher-order network overlaps, stand as outstanding open directions across these fields.