LHC Olympics Benchmark Datasets

Updated 7 December 2025

LHC Olympics Benchmark Datasets are standardized, open-access resources that mimic real LHC event topologies and detector effects for anomaly detection.
They incorporate both low-level particle features and high-level jet observables, enabling model-agnostic evaluation across multiple simulation sets.
The datasets support real-time and offline analyses, fostering fair comparisons among unsupervised, weakly supervised, and semi-supervised search strategies.

The LHC Olympics benchmark datasets are standardized, open-access resources constructed to catalyze the development and fair evaluation of novel machine learning algorithms for anomaly detection in collider physics, with direct application to the search for new phenomena at the Large Hadron Collider (LHC). These datasets, including the LHC Olympics 2020 suite and its successors, are specifically designed to mimic the statistical complexity, detector effects, and event topologies encountered in real LHC new-physics searches. They support both pointwise and group anomaly detection and are referenced extensively in recent literature as crucial community benchmarks for model-agnostic searches and the evaluation of real-time analysis strategies (Kasieczka et al., 2021, Govorkova et al., 2021, Kasieczka et al., 2021, Araz et al., 24 Jun 2025).

1. Rationale and Historical Context

The identification of statistically significant anomalies in high-dimensional collider data is central to the search for physics beyond the Standard Model (BSM). Traditional cut-based or supervised search strategies are necessarily model-dependent. The LHC Olympics datasets were established in response to the need for realistic, challenging benchmarks that could enable the development and fair cross-comparison of unsupervised and weakly supervised anomaly detection methods, unconstrained by specific BSM models (Kasieczka et al., 2021, Kasieczka et al., 2021). This initiative emerged after the discovery of the Higgs boson, reflecting a community-wide effort to move beyond searches for "expected" signatures toward methods that can reveal unexpected, rare signals embedded in a complex, high-background environment.

2. Dataset Structure and Composition

LHC Olympics 2020 Datasets

The primary LHC Olympics 2020 benchmark suite consists of:

R&D ("Science") set: 1,100,000 events (1,000,000 QCD dijet background and 100,000 signal events from $Z' \to XY$ , $X,Y \to q\overline{q}$ ) (Kasieczka et al., 2021, Kasieczka et al., 2021, Araz et al., 24 Jun 2025).
Black Box 1: 1,000,000 events, including 834 rare signal events (same topology as R&D set with different resonance masses), resulting in a signal fraction of approximately 0.08%.
Black Box 2: 1,000,000 events of pure QCD background generated with Herwig++, serving as a control for assessing the false-positive rate.
Black Box 3: 1,000,000 events, including 1,200 $G_{KK} \to gg$ and 2,000 $G_{KK} \to gR$ , $R \to gg$ events (fraction ≈ 0.32%).

Each event may be represented both by low-level particle features (up to 700 particles per event, with $(p_T,\eta,\phi)$ per particle and zero-padding as required) and high-level jet features (dijet invariant mass $m_{jj}$ , jet masses, N-subjettiness, etc.). A variety of file formats are provided, all with detailed metadata and provenance (Kasieczka et al., 2021, Kasieczka et al., 2021).

40 MHz Real-Time Dataset

The "LHC physics dataset for unsupervised New Physics detection at 40 MHz" adapts these principles to the context of real-time event selection. Each event is summarized by up to 19 reconstructed objects (MET, 4 electrons, 4 muons, 10 jets, zero-padded), with event arrays of shape $(N_{\text{events}}, 19, 4)$ , and an explicit "particle-type-ID" channel. This structure is tailored for streaming, low-latency processing and encodes the constraints of hardware-based triggers (Govorkova et al., 2021).

3. Simulation, Feature Engineering, and Signal Injection

Event generation relies on established tools:

Simulation: Pythia 8.219 and Herwig++ for hard-scatter and hadronization steps; Delphes 3.4.1 simulates a generic LHC-like detector response, with controlled perturbations in reconstruction/acceptance to emulate systematics (e.g., $\pm10\%$ changes to tracking, calorimeter resolutions) (Kasieczka et al., 2021). No pileup or multiple parton interactions (MPI) are included by construction.
Signal models: Resonant production (e.g., $Z' \to XY$ ; $G_{KK}$ decays) inject rare, sharply localized overdensities in $m_{jj}$ , often at sub-percent levels; masses are shifted between R&D and black box sets to counteract overfitting.

Feature sets include both low-level and high-level physics inputs. Jet clustering employs anti- $k_t$ algorithms (typical $R=0.8$ or $R=1.0$ ), and N-subjettiness is included to probe substructure. Derived features such as invariant masses, $\Delta R_{ij}$ , and multiplicities are common. In the 40 MHz dataset, "Physics Object Arrays" use fixed slots and zero-bias padding, with normalization (standardization or min-max scaling) and masking recommended for effective training (Govorkova et al., 2021).

4. Benchmark Tasks, Evaluation Metrics, and Protocols

Benchmarking is carried out in two phases:

Development: Tune and validate candidate anomaly-detection algorithms on the labeled R&D set; methods are expected to recover the injected signal and quantify sensitivity.
Blind Testing: Apply tuned algorithms to the three black box datasets; report anomalies, estimated significance, and inferred signal parameters, without knowledge of ground truth.

Key metrics and protocols include:

ROC AUC: $\mathrm{AUC} = \int_0^1 \mathrm{TPR}(\mathrm{FPR}^{-1}(u)) du$ .
Significance Improvement Characteristic (SIC): $\mathrm{SIC}(\epsilon_s) = \epsilon_s/\sqrt{\epsilon_b(\epsilon_s)}$ , evaluated at various signal/background efficiencies.
Precision@k: Fraction of true signal events among the top- $k$ ranked by anomaly score.
Null-testing: $p$ -value for no anomaly in signal region; typically a Poisson tail probability based on sideband estimates.

A variety of algorithmic approaches are supported:

Unsupervised: Autoencoders, normalizing flows, density estimation ("ANODE"), topic modeling, graph-based message-passing architectures (Araz et al., 24 Jun 2025).
Weakly Supervised: Classification Without Labels (CWoLa), Tag N Train, and related density-ratio or sideband techniques (Kasieczka et al., 2021).
(Semi-)Supervised: Classifier performance using truth labels as a baseline (not a viable search strategy at the LHC).

Algorithmic best practices, especially for real-time applications, include the use of shallow, quantized networks (e.g., with int8 parameters), FPGA-friendly activations (e.g., ReLU), and resource-pipeline optimization for trigger applications (Govorkova et al., 2021).

5. Group Anomaly Detection and Resonant Searches

The LHCO2020 dataset is explicitly constructed to benchmark group anomaly (collective anomaly) detection, where the goal is to identify local overdensities, especially those resonant in a particular invariant mass variable such as $m_{jj}$ . The group anomaly paradigm is formalized via:

$L(x) = \frac{p(x \mid m \in \text{SR},\;\text{data})}{p(x \mid m \in \text{SR},\;\text{normal})},$

with "signal region" (SR) and "sideband" (SB) windows dynamically defined on $m_{jj}$ . All major evaluation methods, including sliding-window scans, sideband interpolation, and density-ratio approaches, are supported. The design enforces rarity, overlap, and smoothness assumptions on background distributions to rigorously challenge new algorithms (Kasieczka et al., 2021).

6. Advanced Methodologies and Recent Developments

Contemporary research has leveraged the LHC Olympics datasets to develop and test graph-theory–motivated architectures, notably:

Sparse, physically-motivated graph autoencoders: Construction of Laman (locally rigid) and unique- $k$ (globally rigid) graphs among jet constituents or subjets, controlling inductive bias and interpretability (Araz et al., 24 Jun 2025).
Subjet clustering interpolation: Systematic variation of the number of subjets ( $n_{\text{subjets}}$ ) via exclusive $k_t$ to interpolate between low- and high-level representations; anomaly sensitivity peaks for $n_{\text{subjets}}\sim 25-30$ .
Architecture insights: Unique-6 (globally rigid, $k=6$ ) graphs outperform both fully connected and Laman graphs in the anomaly-detection autoencoder task, as quantified by SIC and AUC. Overly dense graphs suffer from redundancy, while locally rigid graphs underperform due to degeneracies in embedding (Araz et al., 24 Jun 2025).
Real-time compatibility: The 40 MHz benchmark, by providing fixed-size object arrays and latency constraints, allows exploration of model architectures suitable for FPGA implementation and streaming analyses (Govorkova et al., 2021).

7. Impact, Community Best Practices, and Recommendations

The LHC Olympics benchmarks are fully FAIR (findable, accessible, interoperable, reusable) and accompanied by documentation, code, and metadata, enabling rigorous cross-method and cross-domain comparisons (Kasieczka et al., 2021). They are unique in combining low-level, high-level, and topology-variable representations under realistic detector and generator systematics. Community recommendations include:

Using these datasets as definitive benchmarks wherever data are indexed by a physical variable (e.g., invariant mass, time) amenable to resonant anomaly searches.
Explicitly comparing density-estimation, classification-without-labels, and generative-model techniques in the prescribed sideband-resonance framework.
Encouraging reproducibility through code and model checkpoint releases.
Adapting architectural choices (e.g., graph sparsity, feature normalization, batch normalization avoidance) to maximize both interpretability and anomaly sensitivity within the constraints of real experiment workflows.

The LHC Olympics benchmark datasets represent the present standard for the rigorous development, evaluation, and comparison of anomaly-detection algorithms in high-energy physics, and serve as model repositories for analogous challenges in other data-intensive scientific domains (Kasieczka et al., 2021, Govorkova et al., 2021, Kasieczka et al., 2021, Araz et al., 24 Jun 2025).