Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
86 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
53 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

PineForest Active Anomaly Detection

Updated 14 July 2025
  • PineForest is an active anomaly detection algorithm that uses expert feedback to iteratively refine tree ensembles for identifying rare events.
  • It leverages a selective tree pruning strategy that enhances anomaly ranking by discarding trees misaligned with expert-labelled data.
  • The integration of external features like real–bogus scores significantly reduces false positives and improves discovery in data-rich applications.

The PineForest Active Anomaly Detection Algorithm is an active learning-based framework for identifying rare and scientifically interesting anomalies within large, high-dimensional datasets, particularly in time-domain astronomy and other data-rich scientific domains. Developed in the context of the SNAD (Supernova Hunters’ Anomaly Detection) project and further implemented in the coniferest Python package, PineForest builds upon principles of tree ensemble methods—most notably Isolation Forest—by incorporating expert feedback in an iterative, “active” fashion. Its distinguishing feature is the selective refinement or discarding of trees in the ensemble to enhance anomaly ranking based on user critique, enabling flexible adaptation to specific definitions of interesting phenomena and mitigating the prevalence of false positives associated with artifacts and spurious detections (2410.17142, Pruzhinskaya et al., 8 Jul 2025).

1. Mathematical Foundation and Algorithmic Structure

PineForest adapts the Isolation Forest paradigm to the active anomaly detection setting. In standard Isolation Forest, anomalies are identified as points that require shorter path lengths to “isolate” in randomly built decision trees. The anomaly score for a point x\mathbf{x} in a dataset of nn points is: s(x,n)=2E(h(x))/c(n)s(\mathbf{x}, n) = 2^{-E(h(\mathbf{x}))/c(n)} where E(h(x))E(h(\mathbf{x})) is the expected path length for x\mathbf{x} across the forest, and c(n)c(n) is a normalization factor (typically c(n)=2H(n1)2(n1)/nc(n) = 2H(n-1) - 2(n-1)/n with HH the harmonic number) (Pruzhinskaya et al., 8 Jul 2025).

PineForest introduces a tree selection and refinement procedure driven by expert feedback:

  • Each data point xj\mathbf{x}_j is assigned a label score yjy_j:

$y_j = \begin{cases} -1, & \text{if %%%%9%%%% is labeled as an anomaly} \ 0, & \text{if %%%%10%%%% is unlabeled} \ 1, & \text{if %%%%11%%%% is labeled as a regular point} \end{cases}$

  • Each tree ti\mathbf{t}_i is then scored by:

s(ti)=j=1Nyjd(li(xj))s(\mathbf{t}_i) = \sum_{j=1}^N y_j \cdot d(l_i(\mathbf{x}_j))

where d(li(xj))d(l_i(\mathbf{x}_j)) is the depth at which xj\mathbf{x}_j appears in leaf lil_i of tree ti\mathbf{t}_i.

Trees with low s(ti)s(\mathbf{t}_i) scores—those that do not place regulars at high depths or anomalies at low depths—are iteratively discarded, refining the forest to align more closely with expert-determined notions of anomaly (2410.17142).

2. Iterative Active Learning and Refinement Process

PineForest operates within an active learning loop:

  1. Initialization: Build an Isolation Forest (randomized tree ensemble) on the data.
  2. Candidate Selection: Present the highest-scoring anomalies to the expert for labeling.
  3. Label Aggregation: The expert assigns labels (anomaly, regular, or abstain) to sampled points.
  4. Tree Scoring and Pruning: Score each tree according to labeled points; discard a fraction (often up to 90%) of trees with the poorest agreement with the labeled data.
  5. Re-Training/Score Recalculation: Remaining trees are used to recompute anomaly scores for all points; candidate ranking is updated.
  6. Iteration: Repeat the query-label-update cycle until a stopping criterion is met (e.g., labeling budget exhausted or convergence in anomaly detection rates).

This approach emphasizes computational efficiency: PineForest does not change the branch structure or weights of trees, but refines the ensemble composition. The iterative removal of discordant trees allows rapid, label-efficient adaptation of the model (2410.17142).

3. Integration with Feature Augmentation and Artifact Filtering

Recent implementations of PineForest have been enhanced with the integration of external classifier scores, such as real–bogus predictions from supervised learning models (e.g., Random Forest classifiers trained to separate genuine events from artifacts in astronomical data). The “real–bogus” score is appended to the feature set before anomaly ranking, allowing the active anomaly detector to deprioritize artifacts effectively.

Experiments with 67 million ZTF DR17 light curves demonstrated that incorporating a real–bogus feature into the active pipeline reduced artifact contamination from 27% to 3% among top-ranked candidates, with no adverse effect on the discovery of astrophysically valuable anomalies (2409.10256). The static Isolation Forest, lacking iterative feedback, showed much less benefit from this feature augmentation, highlighting the advantage of the active, feedback-driven loop in PineForest.

PineForest shares conceptual lineage with other active learning anomaly detection algorithms, particularly Active Anomaly Discovery (AAD) (Ishida et al., 2019, 2409.10256), which reweights tree leaves based on hinge loss optimization and explicit expert feedback. The critical distinction is that PineForest does not adjust leaf weights but instead prunes whole trees, selecting for trees that most closely reflect the expert’s separation of regulars and anomalies (2410.17142).

Other ensemble and tree refinement approaches—such as those that update tree or node weights (Bodor et al., 2022), employ compact subspace descriptions for query diversity (Das et al., 2018), or track concept drift via KL divergence—can be integrated with or complement the PineForest workflow.

A plausible implication is that PineForest’s architecture may be beneficial in settings where the underlying geometry of anomalies is not easily adapted via local weight changes, but can be efficiently modulated through ensemble selection and global tree pruning.

5. Applications and Empirical Validation

PineForest has been tested on synthetic benchmarks as well as real-world astronomical time-domain datasets:

  • In the SNAD VIII Workshop, PineForest detected multiple previously undiscovered variable stars and refined classifications for known variables using light curve data from ZTF fields overlapping with LSSTComCam footprint (Pruzhinskaya et al., 8 Jul 2025). Experts inspected ~400 candidate light curves in multiple fields, and the algorithm’s ranking—refined by expert feedback—systematically elevated scientifically interesting objects.
  • In the broader SNAD anomaly detection pipeline and the coniferest package, PineForest has contributed to discoveries including binary microlensing events, new variable stars, and optical counterparts to radio sources (2410.17142).
  • Use cases extend to any massive, feature-rich dataset where rare anomalies are valuable and precise, label-efficient discovery is required—such as in survey astronomy, industrial monitoring, or fraud detection.

6. Limitations and Future Directions

PineForest relies on the iterative provision of high-quality expert feedback to reach its full potential. Its effectiveness can diminish if labeled samples are too few, unrepresentative, or highly noisy. The method’s dependence on the initial randomization and size of the ensemble, and the choice of how aggressively to prune trees at each iteration, may influence stability and sensitivity to diverse anomaly types.

Future work may explore integration with preference embeddings (2505.10441), functional or spectral methods for handling richer data modalities (Staerman et al., 2019), scaling to even larger tree ensembles, and more formal analysis of tree pruning strategies relative to sample complexity and discovery rate.

7. Summary Table: PineForest in Context

Aspect PineForest Approach Closest Comparator(s)
Active learning modality Tree selection/pruning based on expert-labeled data AAD (tree weight tuning)
Feedback utilization Iterative, label-driven discarding of poor trees Leaf/node weighting
Artifact handling Feature augmentation (e.g., real–bogus scores) Feature-informed, or not
Computational strategy Fast re-scoring via forest pruning; no tree edits Weight or structure mods
Empirical scope Astronomy (ZTF/LSST), synthetic, monitoring data Similar
Key strengths Label-efficient, interpretable, minimal recomputation Variable (see above)

PineForest’s iterative, feedback-driven ensemble refinement has demonstrated substantial potential for efficient, adaptable anomaly detection in demanding real-world settings, offering a mathematically transparent, interpretable, and computationally tractable solution to problems where conventional unsupervised methods face high false positive rates and inadequate alignment with domain-specific notions of interest.