Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 41 tok/s

Gemini 2.5 Pro 46 tok/s Pro

GPT-5 Medium 21 tok/s Pro

GPT-5 High 20 tok/s Pro

GPT-4o 91 tok/s Pro

Kimi K2 178 tok/s Pro

GPT OSS 120B 474 tok/s Pro

Claude Sonnet 4 38 tok/s Pro

2000 character limit reached

PineForest Active Anomaly Detection

Updated 14 July 2025

PineForest is an active anomaly detection algorithm that uses expert feedback to iteratively refine tree ensembles for identifying rare events.
It leverages a selective tree pruning strategy that enhances anomaly ranking by discarding trees misaligned with expert-labelled data.
The integration of external features like real–bogus scores significantly reduces false positives and improves discovery in data-rich applications.

The PineForest Active Anomaly Detection Algorithm is an active learning-based framework for identifying rare and scientifically interesting anomalies within large, high-dimensional datasets, particularly in time-domain astronomy and other data-rich scientific domains. Developed in the context of the SNAD (Supernova Hunters’ Anomaly Detection) project and further implemented in the coniferest Python package, PineForest builds upon principles of tree ensemble methods—most notably Isolation Forest—by incorporating expert feedback in an iterative, “active” fashion. Its distinguishing feature is the selective refinement or discarding of trees in the ensemble to enhance anomaly ranking based on user critique, enabling flexible adaptation to specific definitions of interesting phenomena and mitigating the prevalence of false positives associated with artifacts and spurious detections (Kornilov et al., 22 Oct 2024, Pruzhinskaya et al., 8 Jul 2025).

1. Mathematical Foundation and Algorithmic Structure

PineForest adapts the Isolation Forest paradigm to the active anomaly detection setting. In standard Isolation Forest, anomalies are identified as points that require shorter path lengths to “isolate” in randomly built decision trees. The anomaly score for a point $\mathbf{x}$ in a dataset of $n$ points is: $s(\mathbf{x}, n) = 2^{-E(h(\mathbf{x}))/c(n)}$ where $E(h(\mathbf{x}))$ is the expected path length for $\mathbf{x}$ across the forest, and $c(n)$ is a normalization factor (typically $c(n) = 2H(n-1) - 2(n-1)/n$ with $H$ the harmonic number) (Pruzhinskaya et al., 8 Jul 2025).

PineForest introduces a tree selection and refinement procedure driven by expert feedback:

Each data point $\mathbf{x}_j$ is assigned a label score $y_j$ :

$y_j = \begin{cases} -1, & \text{if %%%%9%%%% is labeled as an anomaly} \ 0, & \text{if %%%%10%%%% is unlabeled} \ 1, & \text{if %%%%11%%%% is labeled as a regular point} \end{cases}$

Each tree $\mathbf{t}_i$ is then scored by:

$s(\mathbf{t}_i) = \sum_{j=1}^N y_j \cdot d(l_i(\mathbf{x}_j))$

where $d(l_i(\mathbf{x}_j))$ is the depth at which $\mathbf{x}_j$ appears in leaf $l_i$ of tree $\mathbf{t}_i$ .

Trees with low $s(\mathbf{t}_i)$ scores—those that do not place regulars at high depths or anomalies at low depths—are iteratively discarded, refining the forest to align more closely with expert-determined notions of anomaly (Kornilov et al., 22 Oct 2024).

PineForest operates within an active learning loop:

Initialization: Build an Isolation Forest (randomized tree ensemble) on the data.
Candidate Selection: Present the highest-scoring anomalies to the expert for labeling.
Label Aggregation: The expert assigns labels (anomaly, regular, or abstain) to sampled points.
Tree Scoring and Pruning: Score each tree according to labeled points; discard a fraction (often up to 90%) of trees with the poorest agreement with the labeled data.
Re-Training/Score Recalculation: Remaining trees are used to recompute anomaly scores for all points; candidate ranking is updated.
Iteration: Repeat the query-label-update cycle until a stopping criterion is met (e.g., labeling budget exhausted or convergence in anomaly detection rates).

This approach emphasizes computational efficiency: PineForest does not change the branch structure or weights of trees, but refines the ensemble composition. The iterative removal of discordant trees allows rapid, label-efficient adaptation of the model (Kornilov et al., 22 Oct 2024).

3. Integration with Feature Augmentation and Artifact Filtering

Recent implementations of PineForest have been enhanced with the integration of external classifier scores, such as real–bogus predictions from supervised learning models (e.g., Random Forest classifiers trained to separate genuine events from artifacts in astronomical data). The “real–bogus” score is appended to the feature set before anomaly ranking, allowing the active anomaly detector to deprioritize artifacts effectively.

Experiments with 67 million ZTF DR17 light curves demonstrated that incorporating a real–bogus feature into the active pipeline reduced artifact contamination from 27% to 3% among top-ranked candidates, with no adverse effect on the discovery of astrophysically valuable anomalies (Semenikhin et al., 16 Sep 2024). The static Isolation Forest, lacking iterative feedback, showed much less benefit from this feature augmentation, highlighting the advantage of the active, feedback-driven loop in PineForest.

PineForest shares conceptual lineage with other active learning anomaly detection algorithms, particularly Active Anomaly Discovery (AAD) (Ishida et al., 2019, Semenikhin et al., 16 Sep 2024), which reweights tree leaves based on hinge loss optimization and explicit expert feedback. The critical distinction is that PineForest does not adjust leaf weights but instead prunes whole trees, selecting for trees that most closely reflect the expert’s separation of regulars and anomalies (Kornilov et al., 22 Oct 2024).

Other ensemble and tree refinement approaches—such as those that update tree or node weights (Bodor et al., 2022), employ compact subspace descriptions for query diversity (Das et al., 2018), or track concept drift via KL divergence—can be integrated with or complement the PineForest workflow.

A plausible implication is that PineForest’s architecture may be beneficial in settings where the underlying geometry of anomalies is not easily adapted via local weight changes, but can be efficiently modulated through ensemble selection and global tree pruning.

5. Applications and Empirical Validation

PineForest has been tested on synthetic benchmarks as well as real-world astronomical time-domain datasets:

In the SNAD VIII Workshop, PineForest detected multiple previously undiscovered variable stars and refined classifications for known variables using light curve data from ZTF fields overlapping with LSSTComCam footprint (Pruzhinskaya et al., 8 Jul 2025). Experts inspected ~400 candidate light curves in multiple fields, and the algorithm’s ranking—refined by expert feedback—systematically elevated scientifically interesting objects.
In the broader SNAD anomaly detection pipeline and the coniferest package, PineForest has contributed to discoveries including binary microlensing events, new variable stars, and optical counterparts to radio sources (Kornilov et al., 22 Oct 2024).
Use cases extend to any massive, feature-rich dataset where rare anomalies are valuable and precise, label-efficient discovery is required—such as in survey astronomy, industrial monitoring, or fraud detection.

6. Limitations and Future Directions

PineForest relies on the iterative provision of high-quality expert feedback to reach its full potential. Its effectiveness can diminish if labeled samples are too few, unrepresentative, or highly noisy. The method’s dependence on the initial randomization and size of the ensemble, and the choice of how aggressively to prune trees at each iteration, may influence stability and sensitivity to diverse anomaly types.

Future work may explore integration with preference embeddings (Leveni et al., 15 May 2025), functional or spectral methods for handling richer data modalities (Staerman et al., 2019), scaling to even larger tree ensembles, and more formal analysis of tree pruning strategies relative to sample complexity and discovery rate.

7. Summary Table: PineForest in Context

Aspect	PineForest Approach	Closest Comparator(s)
Active learning modality	Tree selection/pruning based on expert-labeled data	AAD (tree weight tuning)
Feedback utilization	Iterative, label-driven discarding of poor trees	Leaf/node weighting
Artifact handling	Feature augmentation (e.g., real–bogus scores)	Feature-informed, or not
Computational strategy	Fast re-scoring via forest pruning; no tree edits	Weight or structure mods
Empirical scope	Astronomy (ZTF/LSST), synthetic, monitoring data	Similar
Key strengths	Label-efficient, interpretable, minimal recomputation	Variable (see above)

PineForest’s iterative, feedback-driven ensemble refinement has demonstrated substantial potential for efficient, adaptable anomaly detection in demanding real-world settings, offering a mathematically transparent, interpretable, and computationally tractable solution to problems where conventional unsupervised methods face high false positive rates and inadequate alignment with domain-specific notions of interest.