JetClass Dataset: HEP ML Benchmark
- JetClass is a large-scale, publicly available multi-class jet tagging dataset that simulates proton-proton collisions with ten balanced Standard Model processes.
- It provides detailed constituent-level representations and engineered features, enabling advanced transformer-based architectures, anomaly detection, and generative modeling.
- Benchmark results demonstrate high accuracy and improved background rejection, establishing JetClass as a standard for transfer learning and model pre-training in high-energy physics.
JetClass is a large-scale, publicly available multi-class jet tagging dataset specifically designed to advance ML techniques in high-energy physics (HEP). Comprising 100 million simulated jets equally distributed over ten physically motivated categories, JetClass presents a foundational benchmark for transformer-based and set-based approaches to jet classification, anomaly detection, and generative modeling. The dataset’s construction, physics content, high-level feature engineering, and role in model pre-training have set new standards for empirical rigor and statistical power in the HEP ML domain (Qu et al., 2022).
1. Dataset Construction and Physics Processes
JetClass was generated by simulating proton-proton collisions, primarily using MadGraph5_aMC@NLO for hard-process generation, Pythia 8 for parton showering and hadronization, and Delphes for fast detector simulation with CMS-like parameters (Qu et al., 2022, Usman et al., 9 Jun 2024). Jets are clustered from stable final-state particles using the anti- algorithm with radius parameter . Typical jet selection requires and , ensuring high-purity, high- jets (Qu et al., 2022, Birk et al., 2023).
The dataset’s ten classes correspond to key Standard Model processes observed at the LHC:
- Background (QCD): light-quark- and gluon-initiated jets (q/g)
- Higgs boson decays: , , , ,
- Top quark decays: ,
- W and Z boson decays: ,
Each class contributes exactly 10 million jets, ensuring strict balance across splits for unbiased training and benchmarking (Usman et al., 9 Jun 2024). Jets are labeled at generator level by matching final-state decay products within of the jet axis, except for the q/g background (Qu et al., 2022).
2. Input Features, Representation, and Preprocessing
A defining aspect of JetClass is its granular, constituent-level representation. Each jet is encoded as an unordered set (or "cloud") of up to constituents, where typically ranges from 10–100, depending on event structure. For each constituent, an array of real-valued and discrete features captures spatial, kinematic, and identification information (Qu et al., 2022, Birk et al., 2023).
Per-particle features:
- Kinematic: , , , , , ,
- Charge and PID: , one-hot flags for , , , ,
- Impact parameter: , , , (displacement and uncertainties, set to zero for neutrals)
Per-pair features (Lund plane parametrization):
For each pair of constituents, features include , (with ), (with ), and (with ). These features are crucial for transformer-based architectures incorporating pairwise attention (Qu et al., 2022, Usman et al., 9 Jun 2024).
Preprocessing:
Most features are used in raw or log-transformed form, with no dataset-level normalization; activations are normalized within network architectures via LayerNorm or BatchNorm (Usman et al., 9 Jun 2024, Qu et al., 2022). For generative modeling, all continuous features can be standardized to zero mean and unit variance, and constituents are zero-padded or masked up to (Birk et al., 2023).
3. Dataset Splits, Storage, and Access
JetClass is distributed as ROOT TTree objects with 41 branches, each encoding per-jet, per-particle, or per-pair features (Usman et al., 9 Jun 2024). The standardized division is:
| Split | Jets per class | Total jets | Proportion |
|---|---|---|---|
| Training | 10 million | 100 million | 83.3% |
| Validation | 0.5 million | 5 million | 4.2% |
| Test | 2 million | 20 million | 12.5% |
Class frequencies are identical in all splits, obviating the need for further weighting or oversampling (Usman et al., 9 Jun 2024, Qu et al., 2022). Data ingestion for ML frameworks is enabled by direct loading into (x, U) tensor pairs (per-particle, per-pair), or as fixed-size tensors plus masks for set-based equivariant models (Wu et al., 11 Jul 2024, Birk et al., 2023).
4. Applications: Discriminative and Generative Modeling
JetClass underpins multiple research directions in jet tagging:
- Supervised classification: The original Particle Transformer (ParT) achieved accuracy 0.861 and AUC 0.9877 on the JetClass test set, outperforming ParticleNet and plain transformer baselines. Performance scales with training size, with notable gains when moving from 2M to 100M jets (Qu et al., 2022).
- Transfer learning: Pre-training on JetClass followed by downstream fine-tuning (e.g., on top-tagging or quark–gluon discrimination) delivers superior accuracy, rejection power, and data efficiency compared to training from scratch (Qu et al., 2022, Wu et al., 11 Jul 2024).
- Self-supervised and contrastive learning: JetClass can serve as an unlabeled corpus for training encoders via contrastive losses (e.g., JetCLR) with physics-motivated augmentations (random translation in –, particle dropout, momentum jitter, global rotation) (Zhao et al., 18 Aug 2024).
- Generative modeling: Permutation-equivariant continuous normalizing flows (CNFs), trained with flow-matching loss and conditioned on jet type and axis kinematics, produce high-fidelity synthetic jets capturing both kinematics and discrete PID/displacement information (Birk et al., 2023).
5. Rationale, Selection, and Robustness
The design of JetClass addresses several key goals for large-scale HEP ML benchmarks:
- Inclusive physics coverage: Ten classes sample the main LHC resonance processes, allowing studies of both signal vs background and subtype discrimination.
- Realistic detector modeling: The use of full detector simulation (in baseline versions) approximates tracker resolution, calorimeter smearing, and efficiency effects encountered in CMS-like environments (Qu et al., 2022, Usman et al., 9 Jun 2024).
- Feature completeness: Particle flow–like feature vectors include all information needed for state-of-the-art ML models, from kinematics to particle type and trajectory.
- Balanced splits and strict selection: Jets are not only class-balanced, but also pass standard and cuts and containment requirements to ensure statistical control and reproducibility.
- Augmentation for self-supervised robustness: Physics-inspired perturbations are critical for learning representations invariant to pileup, detector resolution, and calibration uncertainties, as demonstrated in contrastive SSL setups (Zhao et al., 18 Aug 2024).
- No hand-tuned normalization: Relying instead on architecture-level normalization (LayerNorm, BatchNorm) lets models adapt optimally to the varied dynamical range of input features (Usman et al., 9 Jun 2024).
6. Benchmarks and Empirical Performance
Quantitative performance of supervised architectures trained on JetClass demonstrates the utility of both the dataset’s scale and its physics-driven structure:
| Model | Accuracy | AUC | Notable rejection (example) |
|---|---|---|---|
| PFN (DeepSets) | 0.772 | 0.9714 | – |
| P-CNN (DeepAK8) | 0.809 | 0.9789 | – |
| ParticleNet | 0.844 | 0.9849 | : |
| ParT | 0.861 | 0.9877 | : |
ParT improves background rejection by up to for -jets, for , etc. Increasing training set size consistently boosts both accuracy and AUC across all benchmark architectures (Qu et al., 2022).
For downstream transfer, pre-trained ParT models attain higher performance in top, Higgs, and quark/gluon discrimination compared to state-of-the-art alternatives, including LorentzNet, particularly when leveraging all PID inputs (Qu et al., 2022, Wu et al., 11 Jul 2024).
A plausible implication is that JetClass’s scale and constituent-level detail render it suitable not only for jet tagging, but also for advanced representation learning and anomaly detection in collider physics (Qu et al., 2022, Birk et al., 2023, Zhao et al., 18 Aug 2024).
7. Position Relative to Other Jet Datasets
JetClass occupies a foundational position in the modern ML-for-HEP landscape:
- It is approximately two orders of magnitude larger than earlier public datasets (e.g., JetNet).
- Its constituent-level information (including impact parameters and PID) surpasses the limited feature sets of prior datasets.
- JetClass-II (Li et al., 21 May 2024), developed subsequently, expands drastically to 188 classes and includes more exotic signatures, refined reweighting protocols, and is designed for foundational multi-class pre-training (e.g., in the Sophon architecture). JetClass, however, remains the standard for large-scale, balanced, 10-class classification and transfer learning tasks in collider jet physics.
JetClass has thus become the reference standard for both benchmarking and developing new ML models (transformers, equivariant flows, contrastive SSL) that target realistic, high-granularity collider environments. Since its release, JetClass has catalyzed progress in model design, empirical jet physics, and scientific discovery pipelines (Qu et al., 2022, Zhao et al., 18 Aug 2024, Usman et al., 9 Jun 2024).