Learning Representations of Ultrahigh-dimensional Data for Random Distance-based Outlier Detection

Published 13 Jun 2018 in cs.LG, cs.AI, cs.DB, and stat.ML | (1806.04808v1)

Abstract: Learning expressive low-dimensional representations of ultrahigh-dimensional data, e.g., data with thousands/millions of features, has been a major way to enable learning methods to address the curse of dimensionality. However, existing unsupervised representation learning methods mainly focus on preserving the data regularity information and learning the representations independently of subsequent outlier detection methods, which can result in suboptimal and unstable performance of detecting irregularities (i.e., outliers). This paper introduces a ranking model-based framework, called RAMODO, to address this issue. RAMODO unifies representation learning and outlier detection to learn low-dimensional representations that are tailored for a state-of-the-art outlier detection approach - the random distance-based approach. This customized learning yields more optimal and stable representations for the targeted outlier detectors. Additionally, RAMODO can leverage little labeled data as prior knowledge to learn more expressive and application-relevant representations. We instantiate RAMODO to an efficient method called REPEN to demonstrate the performance of RAMODO. Extensive empirical results on eight real-world ultrahigh dimensional data sets show that REPEN (i) enables a random distance-based detector to obtain significantly better AUC performance and two orders of magnitude speedup; (ii) performs substantially better and more stably than four state-of-the-art representation learning methods; and (iii) leverages less than 1% labeled data to achieve up to 32% AUC improvement.

Abstract PDF Upgrade to Chat

Citations (190)

View on Semantic Scholar

Summary

Learning Representations of Ultrahigh-dimensional Data for Random Distance-based Outlier Detection

The research paper titled "Learning Representations of Ultrahigh-dimensional Data for Random Distance-based Outlier Detection" presents a novel framework, RAMODO, which aims to tackle the challenges associated with ultrahigh-dimensional data outlier detection. This framework unifies the learning of data representations with outlier detection, achieving notable improvements in performance and efficiency.

Summary of the Research

The core motivation of this study is based on the limitations of existing unsupervised representation learning methods, which primarily focus on preserving data regularity information. These methods often operate independently of outlier detection, leading to suboptimal results in identifying anomalies or outliers in ultrahigh-dimensional data. Thus, RAMODO introduces a unified approach where the representation learning process is specifically tailored for random distance-based outlier detection techniques.

Key Contributions

Customized Representation Learning Framework: The RAMODO framework integrates representation learning with outlier detection by tailoring the representations for a random distance-based outlier detection method. This integration ensures the learned representations are more effective in preserving the critical information required for detecting outliers.
Introducing REPEN: The paper introduces REPEN, a practical instantiation of RAMODO, aimed at learning representations for the state-of-the-art random nearest neighbor distance-based detector, Sp. REPEN effectively maps data into a lower-dimensional space while retaining the information necessary for accurate outlier detection.
Empirical Evaluation: The framework's effectiveness is demonstrated through extensive empirical results across eight real-world ultrahigh-dimensional datasets. REPEN shows significantly better AUC performance compared to working directly with original features and greatly improves computational efficiency, achieving two orders of magnitude speedup.
Utilizing Labeled Data: REPEN also capitalizes on a small amount (<1%) of labeled data to improve representation quality, achieving up to a 32% improvement in AUC performance. This showcases RAMODO's flexibility in integrating application-specific knowledge to enhance performance.

Implications and Future Directions

The integration of representation learning with outlier detection offers a pathway to more stable and efficient anomaly detection in ultrahigh-dimensional datasets. Practically, this can enhance applications in fields like fraud detection, medical diagnosis, and cybersecurity, where identifying outliers from large volumes of data is critical.

Theoretically, the findings suggest an upper error bound for the representation learning, implying that the approach retains robustness against potential inaccuracies brought by dimensionality reduction. Future research may extend RAMODO by developing other representations optimized for different outlier detection strategies or exploring deeper architectures to capture complex data distributions.

Overall, this study provides a systematic approach to effectively handle the curse of dimensionality in outlier detection tasks, offering a promising direction for researchers and practitioners dealing with high-dimensional datasets.