Extended Isolation Forest (1811.02141v3)

Published 6 Nov 2018 in cs.LG, astro-ph.IM, and stat.ML

Abstract: We present an extension to the model-free anomaly detection algorithm, Isolation Forest. This extension, named Extended Isolation Forest (EIF), resolves issues with assignment of anomaly score to given data points. We motivate the problem using heat maps for anomaly scores. These maps suffer from artifacts generated by the criteria for branching operation of the binary tree. We explain this problem in detail and demonstrate the mechanism by which it occurs visually. We then propose two different approaches for improving the situation. First we propose transforming the data randomly before creation of each tree, which results in averaging out the bias. Second, which is the preferred way, is to allow the slicing of the data to use hyperplanes with random slopes. This approach results in remedying the artifact seen in the anomaly score heat maps. We show that the robustness of the algorithm is much improved using this method by looking at the variance of scores of data points distributed along constant level sets. We report AUROC and AUPRC for our synthetic datasets, along with real-world benchmark datasets. We find no appreciable difference in the rate of convergence nor in computation time between the standard Isolation Forest and EIF.

Citations (252)

View on Semantic Scholar

Summary

The paper introduces hyperplane-based splits, replacing axis-aligned cuts to mitigate bias in anomaly scoring.
The methodology delivers lower variance and clearer anomaly score maps, outperforming traditional approaches in benchmark datasets.
The enhanced detection performance is achieved without extra computational cost, bolstering its use in high-stakes applications like cybersecurity.

Extended Isolation Forest: Advancements in Anomaly Detection

The paper, "Extended Isolation Forest," by Sahand Hariri, Matias Carrasco Kind, and Robert J. Brunner, presents a significant enhancement to the Isolation Forest algorithm for anomaly detection. Isolation Forest is a widely recognized technique due to its ability to efficiently detect outliers without a predefined model of normal data. The authors introduce the Extended Isolation Forest (EIF) to address specific issues related to the anomaly scoring process within the original algorithm, particularly those arising from the limitations of axis-aligned splits inherent to the Isolation Forest methodology.

Key Contributions

The primary contribution of this work is the introduction of hyperplane-based splits, as opposed to axis-aligned cuts, during the construction of the ensemble of binary trees. This modification enhances the robustness of the algorithm by mitigating bias introduced by axis-parallel cuts, which can result in artifacts in anomaly score maps. These artifacts are undesirable as they can lead to inconsistent scoring of points that should share similar anomaly characteristics.

The authors propose two methods to enhance the Isolation Forest:

Rotated Trees: This method involves performing a random transformation on the feature space before constructing each tree, effectively averaging out the biases over multiple trees.
Extended Isolation Forest: Building on the traditional approach, this method employs random hyperplanes, allowing data splits to occur with random slopes, thus not constrained to the coordinate axes. This method is posited as superior due to its intrinsic ability to more evenly distribute the division boundaries across the data space, leading to more robust isolation of anomalies.

Empirical Results

The authors perform extensive empirical analysis using synthetic 2-D datasets and higher-dimensional real-world benchmarks. They compare the performance of the standard Isolation Forest against their proposed EIF.

Anomaly Score Maps: In 2-D synthetic datasets, EIF demonstrates a significant reduction in the undesirable artifacts present in the standard approach, providing a more coherent representation of anomaly likelihood.
Variance and Convergence: The paper of variance in anomaly scores across concentric level sets shows that EIF produces more stable scores with lower variance, especially in regions where the standard Isolation Forest is prone to inaccuracies due to its constrained splits.
Benchmark Analysis: Utilizing AUC-ROC and AUC-PRC metrics, EIF consistently outperforms the traditional Isolation Forest, particularly in complex datasets where existing data structures aren't parallel to the axes.

Computational Implications

A notable finding is that the computational efficiency, particularly convergence metrics related to the number of trees, does not deteriorate when adopting EIF over the standard Isolation Forest. This implies that the improved anomaly detection comes without additional computational cost, making EIF an attractive option for practitioners.

Conclusion and Future Work

The Extended Isolation Forest offers a substantial improvement over the standard approach by eliminating bias introduced by axis-aligned cuts and yielding a more accurate and reliable anomaly scoring system. This enhancement significantly increases the robustness of outlier detection, which is crucial in domains requiring high-stakes decision-making based on anomaly detection algorithms, such as cybersecurity and fraud detection.

Future developments could explore further optimizations in hyperplane selection strategies or the integration of EIF within broader anomaly detection frameworks, potentially combining its strengths with other model-free approaches for even greater efficacy in diverse application areas.

PDF Markdown