Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
Gemini 2.5 Pro
GPT-5
GPT-4o
DeepSeek R1 via Azure
2000 character limit reached

A Robust Autoencoder Ensemble-Based Approach for Anomaly Detection in Text (2405.13031v2)

Published 16 May 2024 in cs.CL and cs.LG

Abstract: Anomaly detection (AD) is a fast growing and popular domain among established applications like vision and time series. We observe a rich literature for these applications, but anomaly detection in text is only starting to blossom. Recently, self-supervised methods with self-attention mechanism have been the most popular choice. While recent works have proposed a working ground for building and benchmarking state of the art approaches, we propose two principal contributions in this paper: contextual anomaly contamination and a novel ensemble-based approach. Our method, Textual Anomaly Contamination (TAC), allows to contaminate inlier classes with either independent or contextual anomalies. In the literature, it appears that this distinction is not performed. For finding contextual anomalies, we propose RoSAE, a Robust Subspace Local Recovery Autoencoder Ensemble. All autoencoders of the ensemble present a different latent representation through local manifold learning. Benchmark shows that our approach outperforms recent works on both independent and contextual anomalies, while being more robust. We also provide 8 dataset comparison instead of only relying to Reuters and 20 Newsgroups corpora.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Charu C. Aggarwal. 2017a. Outlier analysis, second edition edition. Springer, Cham.
  2. Charu C. Aggarwal. 2017b. Outlier Detection in Categorical, Text, and Mixed Attribute Data. In Outlier Analysis, pages 249–272. Springer International Publishing, Cham.
  3. Charu C. Aggarwal and Saket Sathe. 2015. Theoretical foundations and algorithms for outlier ensembles. ACM SIGKDD Explorations Newsletter, 17(1):24–47.
  4. Jinwon An and Sungzoon Cho. 2015. Variational autoencoder based anomaly detection using reconstruction probability. Special lecture on IE, 2(1):1–18.
  5. Robust anomaly detection in images using adversarial autoencoders. In Machine Learning and Knowledge Discovery in Databases, pages 206–222, Cham. Springer International Publishing.
  6. AD-NLP: A benchmark for anomaly detection in natural language processing. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10766–10778, Singapore. Association for Computational Linguistics.
  7. Learning Eigenfunctions Links Spectral Embedding and Kernel PCA. Neural Computation, 16(10):2197–2219.
  8. An ensemble model for classifying idioms and literal texts using bert and roberta. Information Processing and Management, 59(1):102756.
  9. Anomaly detection: A survey. ACM Computing Surveys, 41(3):1–58.
  10. Jing Chen and Yang Liu. 2011. Locally linear embedding: a survey. Artificial Intelligence Review, 36:29–48.
  11. Outlier detection with autoencoder ensembles. In Proceedings of the 2017 SIAM international conference on data mining, pages 90–98. SIAM.
  12. Autoencoder-based network anomaly detection. In 2018 Wireless Telecommunications Symposium (WTS), pages 1–5.
  13. BERT: Pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics.
  14. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pages 5547–5569. PMLR.
  15. Depgraph: Towards any structural pruning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16091–16101.
  16. Mining text outliers in document directories. In 2020 IEEE International Conference on Data Mining (ICDM), pages 152–161. IEEE.
  17. Analyzing and improving representations with the soft nearest neighbor loss. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2012–2020. PMLR.
  18. Neighbourhood components analysis. Advances in neural information processing systems, 17.
  19. Ares: Locally adaptive reconstruction-based anomaly scoring. In Machine Learning and Knowledge Discovery in Databases, pages 193–208, Cham. Springer International Publishing.
  20. D. M Hawkins. 1980. Identification of Outliers. Springer Netherlands, Dordrecht.
  21. Mining anomalies in subspaces of high-dimensional time series for financial transactional data. In Machine Learning and Knowledge Discovery in Databases. Applied Data Science Track, pages 19–36, Cham. Springer International Publishing.
  22. Victoria Hodge and Jim Austin. 2004. A Survey of Outlier Detection Methodologies. Artificial Intelligence Review, 22(2):85–126.
  23. Outlier Detection for Text Data. SDM International Conference on Data Mining, 17:489–497.
  24. Shehroz S. Khan and Michael G. Madden. 2014. One-class classification: taxonomy of study and review of techniques. The Knowledge Engineering Review, 29(3):345–374.
  25. Outlier detection for time series with recurrent autoencoder ensembles. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pages 2725–2732. International Joint Conferences on Artificial Intelligence Organization.
  26. Robust Subspace Recovery Layer for Unsupervised Anomaly Detection. ICLR International Conference on Learning Representations.
  27. Gilad Lerman and Tyler Maunu. 2018. An overview of robust subspace recovery. Proceedings of the IEEE, 106(8):1380–1410.
  28. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  29. Contextual Anomaly Detection in Text Data. Algorithms, 5(4):469–489.
  30. Larry M. Manevitz and Malik Yousef. 2001. One-Class SVMs for Document Classification. Journal of Machine Learning Research, 2:139–154.
  31. DATE: Detecting anomalies in text via self-supervision of transformers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 267–277, Online. Association for Computational Linguistics.
  32. A well-tempered landscape for non-convex robust subspace recovery. Journal of Machine Learning Research, 20(37).
  33. Mostafa Rahmani and George K. Atia. 2017. Randomized robust subspace recovery and outlier detection for high dimensional data matrices. IEEE Transactions on Signal Processing, 65(6):1580–1594.
  34. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  35. Sam T Roweis and Lawrence K Saul. 2000. Nonlinear dimensionality reduction by locally linear embedding. science, 290(5500):2323–2326.
  36. A Unifying Review of Deep and Shallow Anomaly Detection. Proceedings of the IEEE, 109(5):756–795.
  37. Deep one-class classification. In International conference on machine learning, pages 4393–4402. PMLR.
  38. Self-attentive, multi-context one-class classification for unsupervised anomaly detection on text. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4061–4071, Florence, Italy. Association for Computational Linguistics.
  39. Self-Attentive, Multi-Context One-Class Classification for Unsupervised Anomaly Detection on Text. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4061–4071, Florence, Italy. Association for Computational Linguistics.
  40. Estimating the Support of a High-Dimensional Distribution. Neural Computation, 13(7):1443–1471.
  41. Learning and evaluating representations for deep one-class classification. In International Conference on Learning Representations, volume 9.
  42. Text classification in a hierarchical mixture model for small training sets. In Proceedings of the tenth international conference on Information and knowledge management, pages 105–113.
  43. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  44. Prince Zizhuang Wang and William Yang Wang. 2019. Riemannian normalizing flow on variational Wasserstein autoencoder for text modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 284–294, Minneapolis, Minnesota. Association for Computational Linguistics.
  45. Ji Zhang. 2013. Advancements of Outlier Detection: A Survey. ICST Transactions on Scalable Information Systems, 13(1).
  46. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
  47. Lscp: Locally selective combination in parallel outlier ensembles. In Proceedings of the 2019 SIAM International Conference on Data Mining, pages 585–593. SIAM.
  48. Pyod: A python toolbox for scalable outlier detection. Journal of Machine Learning Research, 20(96):1–7.
  49. Chong Zhou and Randy C. Paffenroth. 2017. Anomaly detection with robust deep autoencoders. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17, page 665–674, New York, NY, USA. Association for Computing Machinery.
  50. Ensembles for unsupervised outlier detection: challenges and research questions a position paper. Acm Sigkdd Explorations Newsletter, 15(1):11–22.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets