Papers
Topics
Authors
Recent
Search
2000 character limit reached

Classification Without Labels (CWoLa)

Updated 4 February 2026
  • Classification Without Labels (CWoLa) is a weakly supervised learning method that trains classifiers on mixed samples with varying class proportions to asymptotically recover the optimal decision boundary.
  • It employs standard binary classification techniques and loss functions like cross-entropy, eliminating the need for explicit event-level labels or mixture fraction calibration.
  • CWoLa has shown robust performance in fields such as particle physics and astrophysics, achieving high ROC-AUC and effectively addressing challenges in unlabeled or imperfect data environments.

Classification Without Labels (CWoLa) is a statistical machine learning paradigm in which classifiers are trained using only population-level information by differentiating between samples with differing class proportions, absent any event-level (pointwise) labels or explicit mixture fractions. This weak supervision method is especially valuable in domains where high-fidelity, fully labeled data is unavailable or unreliable, but mixed samples with variable class prevalence can be constructed—a scenario common in experimental particle physics, astronomy, and large-scale scientific surveys. The central result underlying CWoLa is that, under mild assumptions, a classifier trained to distinguish mixed samples asymptotically recovers the optimal decision boundary for classifying the underlying pure classes, matching fully-supervised classifiers in the infinite-data limit, all without requiring labels or mixture weights (Metodiev et al., 2017).

1. Theoretical Foundations and Optimality

CWoLa exploits the simple yet powerful structure of statistical mixtures. Suppose two pure classes AA (signal) and BB (background) have unknown densities p(xA)p(x|A) and p(xB)p(x|B) over feature space xXx \in \mathcal{X}. Two mixed samples, M1M_1 and M2M_2, are drawn according to

p(xM1)=f1p(xA)+(1f1)p(xB), p(xM2)=f2p(xA)+(1f2)p(xB),p(x|M_1) = f_1 p(x|A) + (1-f_1)p(x|B), \ p(x|M_2) = f_2 p(x|A) + (1-f_2)p(x|B),

for unknown mixture fractions 0f2<f110 \leq f_2 < f_1 \leq 1. In the CWoLa protocol, data from M1M_1 and M2M_2 are treated as pseudo-classes, training a classifier h(x)h(x) by assigning the “class” label y=1y=1 to M1M_1 and y=0y=0 to M2M_2.

Critically, the likelihood ratio for distinguishing M1M_1 from M2M_2,

L1/2(x)=p(xM1)p(xM2)=f1R(x)+(1f1)f2R(x)+(1f2)withR(x)=p(xA)p(xB)L_{1/2}(x) = \frac{p(x|M_1)}{p(x|M_2)} = \frac{f_1 R(x) + (1-f_1)}{f_2 R(x) + (1-f_2)} \,\,\,\, \text{with} \,\, R(x) = \frac{p(x|A)}{p(x|B)}

is a strictly increasing function of R(x)R(x) provided f1>f2f_1 > f_2. Thus, thresholding the output of any classifier trained to distinguish M1M_1 from M2M_2 produces the same ordering as the optimal likelihood-ratio test between AA and BB. No explicit label or mixture fraction knowledge is required (Metodiev et al., 2017, Klein et al., 19 Mar 2025).

2. Training Methodologies and Loss Functions

CWoLa is implemented via standard binary classification pipelines. The empirical risk minimized is the binary cross-entropy

L(θ)=ExM1[loghθ(x)]ExM2[log(1hθ(x))]L(\theta) = -\mathbb{E}_{x\sim M_1} [\log h_\theta(x)] - \mathbb{E}_{x \sim M_2} [\log(1 - h_\theta(x))]

with hθh_\theta parameterized by a neural network or other flexible model class.

Key practical steps:

The guarantee of optimality is asymptotic—statistical efficiency and convergence depend on having sufficient difference in f1f2f_1-f_2, sample sizes, and model capacity.

3. Typical Experimental and Data Domains

CWoLa is widely used in experimental high-energy physics, astrophysics, and other fields facing imperfect simulation or unlabeled real data:

  • Collider Physics: Quark/gluon jet tagging using mixed-enriched control samples (Metodiev et al., 2017), resonance searches using signal-enriched windows and background-dominated sidebands (Collins et al., 2021, Klein et al., 19 Mar 2025).
  • Direct Detection Experiments: Nuclear recoil identification in optical TPC dark matter detectors using neutron-source (signal-enriched) and background samples (Amaro et al., 28 Jan 2026).
  • Astrophysics: Stellar stream detection in Gaia via population splits on kinematic coordinates, with sidebands providing control populations (Pettee et al., 2023).

Recent expansions include cross-modal learning (2D/3D vision) with pseudo-label assignment based on teacher confidence in multi-view settings (Dharmasiri et al., 2024), and scenarios with multiple reference samples (“Multi-CWoLa”) for improved anomaly detection (Chen et al., 2022).

4. Extensions: Strong and Multi-CWoLa

Strong CWoLa (sCWoLa)

In sCWoLa, one mixture is a fully simulated pure-signal sample (f1=1f_1=1), and the other is an unlabeled mixture from real data, possibly containing signal at unknown (small) contamination f2f_2:

  • Assign “signal” label to the simulated sample, “background” label to real data (Klein et al., 19 Mar 2025).
  • Enables fully data-driven background modeling, eliminating the need for unreliable background simulation.
  • Architectures include supervised training on high-level features (with boosted decision trees) or low-level (transformer) representations.

Multi-CWoLa

For k2k \geq 2 mixtures, label each event by its mixture index. For the two-class case, this can improve statistical efficiency and furnish finite-sample guarantees when signal enrichment varies over multiple resonance/sideband regions (Chen et al., 2022). The theoretical extension leverages a graphical model over the noisy mixture-indicator “labels,” using joint agreement patterns to reduce estimation bias as nn grows.

5. Practical Considerations and Limitations

CWoLa’s assumptions and caveats include:

  • Mixture Difference Requirement: Performance degrades as f1f2f_1 \to f_2; the discriminative power is maximized for large signal fraction differentials.
  • Sample Consistency: The class-conditional distributions p(xA)p(x|A) and p(xB)p(x|B) must be stable across mixtures; systematic differences in feature distributions (other than changing class prevalence) can violate optimality.
  • No Instance-level Labeling: Working points (classification thresholds) for given target efficiencies require calibration (mixture fraction estimation, or a small labeled sample) (Metodiev et al., 2017).
  • Robustness to Signal Contamination: sCWoLa is robust up to f2103102f_2 \sim 10^{-3} - 10^{-2} signal contamination in real data but can degrade with larger leakage (Klein et al., 19 Mar 2025).
  • Domain Shifts: CWoLa avoids simulation mismodeling for backgrounds but remains dependent on accurate signal modeling in sCWoLa or in applications requiring simulation.

6. Empirical Performance and Metrics

Empirical studies consistently demonstrate that:

  • CWoLa achieves ROC-AUC near the fully supervised ceiling when the conditions above are met.
  • In HEP resonance searches, CWoLa delivered AUC >0.90>0.90 for S/B4×103S/B \gtrsim 4 \times 10^{-3}; in CYGNO, performance approached the “mixture ceiling” AUCmax=0.5+α/2AUC_{max} = 0.5 + \alpha/2 for signal fraction α\alpha (Amaro et al., 28 Jan 2026, Collins et al., 2021).
  • CWoLa can outperform unsupervised autoencoder anomaly detection at moderate signal fractions, whereas autoencoders provide complementary reach at very low S/BS/B (Collins et al., 2021).
  • In astrophysics, CWoLa yielded purity and completeness competitive with more complex methods, with training time and compute requirements remaining modest (Pettee et al., 2023).

Table: CWoLa versus Related Methods in Collider Physics

Approach Labels Required AUC (typical, moderate S/BS/B) Comments
Fully supervised Full eventwise labels 0.85 Requires simulation/truth
CWoLa (2 mixtures) Mixture membership only 0.84–0.85 Needs f1>f2f_1 > f_2, robust to ff unknown
sCWoLa Simulated S + real mix 0.84–0.86 No background simulation needed
Autoencoder None 0.75–0.80 Independent of S/BS/B, less sensitive for moderate signals

7. Generalization and Recent Developments

CWoLa is naturally extensible to diverse domains:

  • Multi-class classification: Given KK pure classes and KK mixtures with linearly independent fraction vectors, multiclass CWoLa yields identifiability.
  • Multiple reference datasets (Multi-CWoLa): Combining information from several sidebands increases sample efficiency and provides finite-sample theoretical guarantees (Chen et al., 2022).
  • Cross-modal and open-vocabulary classification: CWoLa is integrated into pseudo-label self-training with joint confidence selection strategies, enabling open-world 3D vision alignment without class annotations (Dharmasiri et al., 2024).
  • Anomaly detection: CWoLa forms the foundation for weakly-supervised anomaly search in both high-dimensional collider and astronomical datasets (Collins et al., 2021, Pettee et al., 2023).

Limitations remain, including the need for class-conditional distribution stability, the necessity for sufficient sample size and mixture separation, and (in sCWoLa) reliance on pure signal simulation. Recent work seeks to further reduce dependency on simulation (via data-driven estimation), extend calibration procedures, and combine CWoLa with foundation model pretraining for robust performance in high-dimensional, open-ended scientific data (Klein et al., 19 Mar 2025, Dharmasiri et al., 2024, Amaro et al., 28 Jan 2026).

References

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Classification Without Labels (CWoLa).