Masked Particle Modeling in High Energy Physics

Updated 23 November 2025

Masked Particle Modeling is a self-supervised technique that reconstructs masked elements in unordered particle sets using permutation-invariant learning.
It leverages transformer-based encoders, volumetric tokenization, and tailored loss functions to capture complex physical dependencies in scientific data.
MPM achieves state-of-the-art results in HEP tasks such as jet tagging and track classification, enabling significant data efficiency with minimal supervision.

Masked Particle Modeling (MPM) refers to a class of self-supervised learning (SSL) frameworks in which a model is trained to recover masked or missing constituents of unordered sets of particle data, primarily in high energy physics (HEP) and scientific domains characterized by non-Euclidean, permutation-invariant, and often sparse data. MPM extends the masked modeling paradigm from language and vision (e.g., BERT, MAE) to sets, point clouds, and trajectories, aiming to learn powerful, generic representations without labeled supervision. Recent advances integrate transformer-based architectures, permutation invariance, volumetric tokenization, and sophisticated generative objectives tailored to the underlying scientific tasks (Young et al., 4 Feb 2025, Leigh et al., 19 Sep 2024, Golling et al., 24 Jan 2024).

1. Foundational Principles and Objectives

Masked Particle Modeling generalizes the principle of masked language modeling to unordered sets, where domain elements are individual particles, point cloud constituents, or trajectory points. Formally, the input is a set $X = \{x_i\}_{i=1}^N$ , where $x_i$ is a feature vector of kinematic, spatial, or energy information. A random subset $M \subset \{1, \dots, N\}$ of elements is replaced by a mask token $m$ , and the learning objective is to reconstruct the features or discrete identities of the masked elements using the remaining (unmasked) context.

The core pre-training objective is permutation-invariant masked reconstruction, fostering representations that capture the statistical dependencies and physical laws governing particle configurations. In HEP, this enables the creation of foundation models able to transfer across tasks such as jet tagging, track/shower classification, secondary vertex detection, and anomaly searches, with high data efficiency and minimal supervision. MPM naturally accommodates the absence of canonical ordering intrinsic to set/point cloud data (Golling et al., 24 Jan 2024, Leigh et al., 19 Sep 2024).

2. Framework Design: Tokenization, Masking, and Losses

Multiple instantiations of Masked Particle Modeling are distinguished by how input elements are tokenized or grouped, the masking protocol, and the reconstruction loss:

Tokenization and Grouping:
- Flat sets: In jet physics, $X$ is a set of particle-level vectors with no explicit structure (Golling et al., 24 Jan 2024).
- Volumetric tokenization: For 3D point cloud data, as in LArTPC detector events, points are grouped into local volumetric patches using farthest point sampling (FPS) and Centrality-based Non-Maximum Suppression (C-NMS), with each patch embedded via a local PointNet then treated as a “token” (Young et al., 4 Feb 2025).
Masking Strategy:
- A fixed fraction (typically 40–60%) of tokens or particles are randomly selected per event.
- The mask is implemented as a binary vector $m\in\{0,1\}^G$ , replacing masked inputs with a learned mask token.
Reconstruction Tasks:
- Discrete token classification: Masked elements are reconstructed by predicting discrete codes from a pre-trained VQ-VAE codebook [(Golling et al., 24 Jan 2024), original MPMv1 in (Leigh et al., 19 Sep 2024)].
- Continuous generative modeling: More recent MPM variants (MPMv2) forgo tokenization and operate directly on continuous kinematic/energy targets using multi-task losses: regression, classification, K-means clustering, conditional normalizing flows (CNF), and conditional flow-matching (CFM) (Leigh et al., 19 Sep 2024).
- Chamfer distance reconstruction: For 3D masked point cloud groups, the per-patch output is scored using the Chamfer distance between predicted and true local point sets, optionally augmented with auxiliary per-point energy regression (Young et al., 4 Feb 2025).
Loss Aggregation:
- Losses from each task are combined, often with task-specific weights to ensure balanced gradients:
$\mathcal{L}_{\rm total} = \sum_k \lambda_k\, \mathcal{L}_k$

where $\mathcal{L}_k$ denotes the loss for task $k$ (e.g., ID, regression, CNF, CFM, energy infilling) (Leigh et al., 19 Sep 2024, Young et al., 4 Feb 2025).

3. Model Architectures and Permutation Equivariance

Masked Particle Modeling leverages transformer-based encoders and decoders that are explicitly designed to respect the set or point cloud structure:

Permutation-Invariant Encoders:
- Transformers with no input positional embeddings (“Normformer” style) maintain full set permutation equivariance, critical for unordered particle sets (Golling et al., 24 Jan 2024).
- For point clouds, each volumetric token is associated with a learned absolute 3D positional embedding; however, the core transformer remains architecture-wise permutation-invariant over the input tokens (Young et al., 4 Feb 2025).
- “Register” tokens can be introduced into the encoder to allow aggregation of global context while preserving individual element information (Leigh et al., 19 Sep 2024).
Decoder Design:
- A “light” ViT (Vision Transformer) decoder processes the visible tokens and mask placeholders jointly, utilizing cross-attention to inform the reconstruction of masked tokens (Young et al., 4 Feb 2025, Leigh et al., 19 Sep 2024).
- The decoder architecture has evolved from shallow MLPs (MPMv1) to deeper multi-headed transformers supporting rich conditional generation.
Prediction Head Symmetry Breaking:
- To avoid collapse in permutation-equivariant networks, the masked prediction head is fed additional orderings or positional encodings within the masked subset—common choices include $p_T$ rank—without breaking equivariance in the backbone (Golling et al., 24 Jan 2024, Leigh et al., 19 Sep 2024).

4. Applications and Experimental Validation

MPM frameworks have achieved state-of-the-art results in an array of scientific and HEP contexts:

LArTPC Particle Trajectory Analysis:
- PoLAr-MAE applies volumetric tokenization and masked modeling to sparse 3D point clouds from liquid argon TPC detectors, achieving 99.4% F₁ for tracks and 97.7% F₁ for showers with no labeled data, matching supervised benchmarks (Young et al., 4 Feb 2025).
- Fine-tuning on only 10,000 labels produces segmentation metrics comparable to fully supervised models requiring 10x more labels.
Jet Physics Foundation Models:
- MPM enables SSL pre-training on $O(10^8)$ -jet datasets, supporting both in-domain (JetClass 10-way) and out-of-domain transfer (BTag, RODEM) (Golling et al., 24 Jan 2024, Leigh et al., 19 Sep 2024).
- Quantitative gains were observed in jet classification (up to 85.3% accuracy in MPMv2 K-means), secondary vertex identification (ARI ~0.45), and heavy-track detection (balanced accuracy ~0.72).
- Ablative studies show all non-tokenized MPMv2 variants outperform VQ-VAE-tokenized MPMv1 by 1–3% on key metrics (Leigh et al., 19 Sep 2024).
Mask Efficiency in Biophysical Systems:
- The “masked-particle” modeling approach has also termed analytic and simulation models of droplet permeation driven through mask fiber networks, focusing on Arrhenius-activated transport and energy barrier tuning (Karmakar et al., 2022).

Domain	Architecture	Downstream Supervision	Key Results
LArTPC point clouds	PoLAr-MAE (volumetric + ViT)	None / few labels	F₁: track 99.4%, shower 97.7%
Jet sets	MPM/MPMv2 (set transformer)	None / few labels	Acc: up to 85.3%
Mask permeation	Langevin+polymer network model	N/A	Efficiency ≥95% for optimal ΔE

5. Advantages, Innovations, and Ablations

Recent developments have yielded several empirical and architectural advantages:

Continuous Modeling Without Tokenization:
- Removing VQ-VAE tokenization yields higher representation expressiveness, eliminating information loss due to codebook quantization, and supports direct regression or generative modeling of physical variables (Leigh et al., 19 Sep 2024).
- A more powerful transformer decoder (as opposed to shallow MLP heads) leverages full cross-attention between masked and observed inputs, markedly improving downstream performance.
Auxiliary Objectives:
- Adding auxiliary tasks such as energy infilling (per-point) improves physical fidelity of point cloud reconstructions, though it does not necessarily boost high-level classification performance (Young et al., 4 Feb 2025).
Task-Weighted Loss Design:
- Ablations on mask ratio, overlap fraction in volumetric grouping, and decoder architecture highlight optimal hyperparameter regimes (e.g., 60% mask, overlap $f=0.73$ ) and stability to architecture depth (Young et al., 4 Feb 2025, Leigh et al., 19 Sep 2024).

6. Challenges, Limitations, and Open Directions

Key limitations and outstanding challenges remain:

Sub-Token and Fine-Scale Structures:
- Phenomena such as short-lived or overlapping trajectories (e.g., Michel electrons, delta rays) are often absorbed into single or few tokens, limiting the model’s ability to disambiguate them given global context (Young et al., 4 Feb 2025).
- Overlapping ionization signals and insufficient local context in mask grouping make fine-grained reconstruction fundamentally ambiguous with current methods.
Scalability and Data Efficiency:
- MPM frameworks natively support training on massive, unlabeled data pools, but memory and computation scale with set cardinality and detector complexity.
- Remedies such as smaller group radii, hierarchical tokenization, or more expressive (diffusion-based) decoders are identified directions for further investigation (Young et al., 4 Feb 2025, Leigh et al., 19 Sep 2024).
Set Generative Modeling:
- Recent proposals for set-to-set flow-matching permit joint generation of masked sets via permutation-invariant diffusion models, further mitigating reliance on arbitrary tokenization (Leigh et al., 19 Sep 2024).

A plausible implication is that as datasets and model sizes increase (e.g., HL-LHC scale), MPM and its generalizations will become the default paradigm for data-efficient scientific foundation models in set- and cloud-based domains (Golling et al., 24 Jan 2024, Leigh et al., 19 Sep 2024).

7. Broader Context and Theoretical Perspectives

Masked Particle Modeling is positioned as a unifying self-supervised pre-training scheme for unordered, continuous scientific data, rendering previous reliance on heavy supervision and manually engineered features largely obsolete in several HEP and biophysical modeling applications. By directly modeling the conditional structure of particle sets, MPM frameworks provide a principled path toward generic, transferable representations, set-permutation invariance, and optimal use of unlabeled scientific data (Young et al., 4 Feb 2025, Leigh et al., 19 Sep 2024, Golling et al., 24 Jan 2024).

The development of PILArNet-M, the largest open labeled LArTPC particle cloud dataset (over 1 million events with 5.2 billion points), is poised to standardize benchmarks and catalyze further advances in masked particle modeling and self-supervised set modeling for physical sciences (Young et al., 4 Feb 2025).