Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 64 tok/s

Gemini 2.5 Pro 47 tok/s Pro

GPT-5 Medium 27 tok/s Pro

GPT-5 High 31 tok/s Pro

GPT-4o 102 tok/s Pro

Kimi K2 206 tok/s Pro

GPT OSS 120B 463 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

Calibrated Discovery in ML-Driven Surveys

Updated 29 September 2025

Calibrated Discovery is the integration of well-calibrated probabilistic modeling and uncertainty quantification in large-scale, ML-driven scientific workflows.
It employs calibration transformations, active learning, and feature-space anomaly detection to align model outputs with true empirical frequencies.
This framework enhances resource allocation, supports rigorous population studies, and drives systematic identification of novel phenomena in massive datasets.

Calibrated discovery refers to the integration of well-calibrated probabilistic modeling and principled uncertainty quantification within large-scale machine learning–driven scientific or exploratory workflows. It aims not only to provide accurate class or structure assignments, but crucially to ensure that assigned probabilities can be directly interpreted as the true frequency with which those assignments are correct. This property enables optimal resource allocation, robust population studies, and rigorous identification of anomalies or outliers deserving further investigation. The concept is motivated by challenges in modern synoptic surveys, where the volume of data precludes individual inspection and necessitates automated, scalable frameworks that faithfully propagate and represent prediction uncertainty.

1. Principles of Calibrated Probabilistic Classification

A calibrated probabilistic classifier is one for which the assigned probabilities correspond closely to empirical frequencies: for all assigned probabilities $p$ , among items assigned probability $p$ to a given class, a fraction $p$ truly belong to the class. The practical realization in the MACC catalog (Richards et al., 2012) begins with the extraction of a comprehensive feature set (71 total: 66 light-curve features, 5 color features) from multi-epoch astrophysical data. A random forest (RF) classifier is trained on labeled data spanning 28 variable star classes, yielding posterior class probability vectors for each object. However, raw output probabilities from tree-based ensembles are often over-conservative and do not reflect true frequencies, manifesting as deviations in reliability diagrams.

To correct these systematic biases, the authors apply a parametric calibration transformation reminiscent of Boström’s method, where—given object $i$ and class $j$ —

$\hat{p}_{ij} = \begin{cases} p_{ij} + r(1 - p_{ij}) & \text{if } p_{ij} = \max_{k} p_{ik} \ p_{ij}(1 - r) & \text{otherwise} \end{cases}$

The scalar $r$ is not fixed, but is parameterized as a sigmoid of the classifier margin $\Delta$ (the difference between the top two predicted class probabilities): $r(\Delta) = 1 / (1 + e^{A\Delta + B})$ Parameters $A$ , $B$ are chosen via Brier score minimization: $B(\hat{p}) = \frac{1}{N}\sum_{i=1}^N \sum_{j=1}^{C} (\mathbb{I}(y_i = j) - \hat{p}_{ij})^2$ This tuning produces probabilities that are approximately frequency-aligned over the validation set, enabling direct thresholding for purity–efficiency trade-offs.

2. Active Learning, Sample Selection, and Feature Engineering

To minimize sample-selection bias and extend class coverage, the training data is actively expanded. After an initial round of supervised labeling using cross-surveys (Hipparcos, OGLE, SIMBAD), an active learning loop queries the current RF classifier for cases with high uncertainty or ambiguous predictions, followed by manual vetting. Class definitions are also refined to resolve subclasses (e.g., classical vs. weak-line T Tauri stars). The feature set includes careful discrimination of aliased periods, with a period–quality decision rule in the $(P,\ \mathrm{period\text{-}significance})$ space to flag common photometric artifacts: $s_P(x) = \alpha_{1,P} |x-P|^{1/4} + \alpha_{2,P}$ This preprocessing ensures that learned features reflect true astrophysical variability, not systematic survey effects.

3. Feature-Space Anomaly Detection

Beyond calibrated assignment to known classes, the framework provides systematic anomaly detection—central to discovery in massive, unlabeled domains. Using the RF’s tree-based structure, a feature-space proximity $\rho_{ij}$ between objects $i$ and $j$ is defined as the fraction of trees in which they co-terminate at leaf nodes. The associated discrepancy metric

$d(\mathbf{x}_i, \mathbf{x}_j) = \frac{1 - \rho_{ij}}{\rho_{ij}}$

leads to a sample-specific anomaly score: for each object, its distance to the second-nearest labeled training sample in this metric. The empirical distribution of these scores is used to set a cross-validated threshold $t^* \approx 10.0$ , above which an object can be flagged as outlier or "anomalous." This unsupervised technique is critical for identifying potential novel types or problematic cases not anticipated in the initial taxonomy.

4. Probabilistic Catalog Construction and Results

The methodology is applied to the All Sky Automated Survey (ASAS), yielding the Machine-learned ASAS Classification Catalog (MACC) of 50,124 sources with the following properties for each source:

Posterior class probabilities for 28 scientifically defined variability classes.
Calibrated probabilities per the transformation above, empirically validated by reliability diagrams to match observed frequencies.
Feature-space anomaly score as a discovery marker.

The cross-validated classification error is sub-20%. Validation shows that MACC recovers many variable star assignments cited in the ACVS (ASAS Catalog of Variable Stars, which had classified only 24% of the sample into 12 classes) and also identifies new candidates—in particular, for underrepresented classes such as Mira variables—across different probability threshold regimes. Researchers can select thresholds (e.g., probability $>0.95$ for high-purity samples) for targeted follow-up, or maximize sample yield at lower purity depending on scientific goals.

5. Population Recalibration and Bayesian Updating

To facilitate accurate population studies and post hoc reinterpretation, MACC exposes calibrated posterior vectors and details the assumed prior imposed by the training set frequency. When external information or population priors are available (e.g., spatial demographics or known sky region statistics), users can analytically adjust the class probabilities: $P(c_j|\mathbf{x}_i) = \frac{P(\mathbf{x}_i|c_j) P_{\mathrm{tr}}(c_j)}{\sum_k P(\mathbf{x}_i|c_k) P_{\mathrm{tr}}(c_k)}$ Allowing for re-weighting by alternative priors, these recalibrated posteriors preserve statistical interpretability across varying scientific use-cases.

6. Scalability and Prospects for Next-Generation Surveys

The architecture—modular random-forest classification, margin-based calibration, and proximity-based anomaly detection—directly supports scalability to larger and more complex surveys. The design is agnostic to the particular set of extracted features or taxonomic scope, making it suitable for adaptation to datasets such as LSST, ZTF, or multi-band time-domain photometry. The anomaly detection apparatus is particularly pertinent for new discovery in high-dimensional feature spaces as survey depth and sample size increase.

Open directions include extending the framework for richer anomaly interpretation, joint modeling of multi-modal data, and integrating more sophisticated semi-supervised or transfer learning strategies to better cope with sparsely labeled regimes or evolving taxonomies.

7. Impact and Implications

By combining calibrated probabilistic outputs with robust anomaly detection, MACC enables:

Optimized allocation of limited follow-up resources (e.g., ground-based telescopes) to high-value candidates.
Rigorous estimation of sample purity and contamination for population studies.
Systematic identification of outliers, enabling first-principles "discovery" in large, heterogeneous datasets.

This approach exemplifies the formal concept of “calibrated discovery,” in which all algorithmic outputs are endowed with interpretable uncertainty quantification, and both the main cataloging and the search for novel phenomena are approached within a single calibrated probabilistic framework. The strategy is foundational for turning massive, automated synoptic surveys from data-rich repositories into engines of scientifically robust discovery.

PDF Markdown Chat (Pro)

References (1)

Construction of a Calibrated Probabilistic Classification Catalog: Application to 50k Variable Sources in the All-Sky Automated Survey (2012)

Follow Topic

Get notified by email when new papers are published related to Calibrated Discovery.