Classification Without Labels (CWoLa)
- Classification Without Labels (CWoLa) is a weakly supervised learning method that trains classifiers on mixed samples with varying class proportions to asymptotically recover the optimal decision boundary.
- It employs standard binary classification techniques and loss functions like cross-entropy, eliminating the need for explicit event-level labels or mixture fraction calibration.
- CWoLa has shown robust performance in fields such as particle physics and astrophysics, achieving high ROC-AUC and effectively addressing challenges in unlabeled or imperfect data environments.
Classification Without Labels (CWoLa) is a statistical machine learning paradigm in which classifiers are trained using only population-level information by differentiating between samples with differing class proportions, absent any event-level (pointwise) labels or explicit mixture fractions. This weak supervision method is especially valuable in domains where high-fidelity, fully labeled data is unavailable or unreliable, but mixed samples with variable class prevalence can be constructed—a scenario common in experimental particle physics, astronomy, and large-scale scientific surveys. The central result underlying CWoLa is that, under mild assumptions, a classifier trained to distinguish mixed samples asymptotically recovers the optimal decision boundary for classifying the underlying pure classes, matching fully-supervised classifiers in the infinite-data limit, all without requiring labels or mixture weights (Metodiev et al., 2017).
1. Theoretical Foundations and Optimality
CWoLa exploits the simple yet powerful structure of statistical mixtures. Suppose two pure classes (signal) and (background) have unknown densities and over feature space . Two mixed samples, and , are drawn according to
for unknown mixture fractions . In the CWoLa protocol, data from and are treated as pseudo-classes, training a classifier by assigning the “class” label to and to .
Critically, the likelihood ratio for distinguishing from ,
is a strictly increasing function of provided . Thus, thresholding the output of any classifier trained to distinguish from produces the same ordering as the optimal likelihood-ratio test between and . No explicit label or mixture fraction knowledge is required (Metodiev et al., 2017, Klein et al., 19 Mar 2025).
2. Training Methodologies and Loss Functions
CWoLa is implemented via standard binary classification pipelines. The empirical risk minimized is the binary cross-entropy
with parameterized by a neural network or other flexible model class.
Key practical steps:
- Assign labels to , to .
- Train to minimize by stochastic gradient descent or other suitable optimizer (Adam is common practice).
- No calibration of mixture fractions or instance-wise labels is needed during training.
- Model choice is flexible and includes fully connected networks, convolutional neural networks for image-like data, transformers for high-dimensional sequences, or boosted decision trees for tabular features (Amaro et al., 28 Jan 2026, Klein et al., 19 Mar 2025, Collins et al., 2021).
The guarantee of optimality is asymptotic—statistical efficiency and convergence depend on having sufficient difference in , sample sizes, and model capacity.
3. Typical Experimental and Data Domains
CWoLa is widely used in experimental high-energy physics, astrophysics, and other fields facing imperfect simulation or unlabeled real data:
- Collider Physics: Quark/gluon jet tagging using mixed-enriched control samples (Metodiev et al., 2017), resonance searches using signal-enriched windows and background-dominated sidebands (Collins et al., 2021, Klein et al., 19 Mar 2025).
- Direct Detection Experiments: Nuclear recoil identification in optical TPC dark matter detectors using neutron-source (signal-enriched) and background samples (Amaro et al., 28 Jan 2026).
- Astrophysics: Stellar stream detection in Gaia via population splits on kinematic coordinates, with sidebands providing control populations (Pettee et al., 2023).
Recent expansions include cross-modal learning (2D/3D vision) with pseudo-label assignment based on teacher confidence in multi-view settings (Dharmasiri et al., 2024), and scenarios with multiple reference samples (“Multi-CWoLa”) for improved anomaly detection (Chen et al., 2022).
4. Extensions: Strong and Multi-CWoLa
Strong CWoLa (sCWoLa)
In sCWoLa, one mixture is a fully simulated pure-signal sample (), and the other is an unlabeled mixture from real data, possibly containing signal at unknown (small) contamination :
- Assign “signal” label to the simulated sample, “background” label to real data (Klein et al., 19 Mar 2025).
- Enables fully data-driven background modeling, eliminating the need for unreliable background simulation.
- Architectures include supervised training on high-level features (with boosted decision trees) or low-level (transformer) representations.
Multi-CWoLa
For mixtures, label each event by its mixture index. For the two-class case, this can improve statistical efficiency and furnish finite-sample guarantees when signal enrichment varies over multiple resonance/sideband regions (Chen et al., 2022). The theoretical extension leverages a graphical model over the noisy mixture-indicator “labels,” using joint agreement patterns to reduce estimation bias as grows.
5. Practical Considerations and Limitations
CWoLa’s assumptions and caveats include:
- Mixture Difference Requirement: Performance degrades as ; the discriminative power is maximized for large signal fraction differentials.
- Sample Consistency: The class-conditional distributions and must be stable across mixtures; systematic differences in feature distributions (other than changing class prevalence) can violate optimality.
- No Instance-level Labeling: Working points (classification thresholds) for given target efficiencies require calibration (mixture fraction estimation, or a small labeled sample) (Metodiev et al., 2017).
- Robustness to Signal Contamination: sCWoLa is robust up to signal contamination in real data but can degrade with larger leakage (Klein et al., 19 Mar 2025).
- Domain Shifts: CWoLa avoids simulation mismodeling for backgrounds but remains dependent on accurate signal modeling in sCWoLa or in applications requiring simulation.
6. Empirical Performance and Metrics
Empirical studies consistently demonstrate that:
- CWoLa achieves ROC-AUC near the fully supervised ceiling when the conditions above are met.
- In HEP resonance searches, CWoLa delivered AUC for ; in CYGNO, performance approached the “mixture ceiling” for signal fraction (Amaro et al., 28 Jan 2026, Collins et al., 2021).
- CWoLa can outperform unsupervised autoencoder anomaly detection at moderate signal fractions, whereas autoencoders provide complementary reach at very low (Collins et al., 2021).
- In astrophysics, CWoLa yielded purity and completeness competitive with more complex methods, with training time and compute requirements remaining modest (Pettee et al., 2023).
Table: CWoLa versus Related Methods in Collider Physics
| Approach | Labels Required | AUC (typical, moderate ) | Comments |
|---|---|---|---|
| Fully supervised | Full eventwise labels | 0.85 | Requires simulation/truth |
| CWoLa (2 mixtures) | Mixture membership only | 0.84–0.85 | Needs , robust to unknown |
| sCWoLa | Simulated S + real mix | 0.84–0.86 | No background simulation needed |
| Autoencoder | None | 0.75–0.80 | Independent of , less sensitive for moderate signals |
7. Generalization and Recent Developments
CWoLa is naturally extensible to diverse domains:
- Multi-class classification: Given pure classes and mixtures with linearly independent fraction vectors, multiclass CWoLa yields identifiability.
- Multiple reference datasets (Multi-CWoLa): Combining information from several sidebands increases sample efficiency and provides finite-sample theoretical guarantees (Chen et al., 2022).
- Cross-modal and open-vocabulary classification: CWoLa is integrated into pseudo-label self-training with joint confidence selection strategies, enabling open-world 3D vision alignment without class annotations (Dharmasiri et al., 2024).
- Anomaly detection: CWoLa forms the foundation for weakly-supervised anomaly search in both high-dimensional collider and astronomical datasets (Collins et al., 2021, Pettee et al., 2023).
Limitations remain, including the need for class-conditional distribution stability, the necessity for sufficient sample size and mixture separation, and (in sCWoLa) reliance on pure signal simulation. Recent work seeks to further reduce dependency on simulation (via data-driven estimation), extend calibration procedures, and combine CWoLa with foundation model pretraining for robust performance in high-dimensional, open-ended scientific data (Klein et al., 19 Mar 2025, Dharmasiri et al., 2024, Amaro et al., 28 Jan 2026).
References
- "Classification without labels: Learning from mixed samples in high energy physics" (Metodiev et al., 2017)
- "Strong CWoLa: Binary Classification Without Background Simulation" (Klein et al., 19 Mar 2025)
- "Trigger Optimization and Event Classification for Dark Matter Searches in the CYGNO Experiment Using Machine Learning" (Amaro et al., 28 Jan 2026)
- "Weakly-Supervised Anomaly Detection in the Milky Way" (Pettee et al., 2023)
- "Comparing Weak- and Unsupervised Methods for Resonant Anomaly Detection" (Collins et al., 2021)
- "Cross-Modal Self-Training: Aligning Images and Pointclouds to Learn Classification without Labels" (Dharmasiri et al., 2024)
- "Resonant Anomaly Detection with Multiple Reference Datasets" (Chen et al., 2022)