Oddballness: universal anomaly detection with language models
Abstract: We present a new method to detect anomalies in texts (in general: in sequences of any data), using LLMs, in a totally unsupervised manner. The method considers probabilities (likelihoods) generated by a LLM, but instead of focusing on low-likelihood tokens, it considers a new metric introduced in this paper: oddballness. Oddballness measures how ``strange'' a given token is according to the LLM. We demonstrate in grammatical error detection tasks (a specific case of text anomaly detection) that oddballness is better than just considering low-likelihood events, if a totally unsupervised setup is assumed.
- Context is key: Grammatical error detection with contextual word representations. CoRR, abs/1906.06593, 2019. URL http://arxiv.org/abs/1906.06593.
- LOF: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, pages 93–104, 2000.
- Grammatical error correction: A survey of the state of the art. Computational Linguistics, page 1–59, July 2023. ISSN 1530-9312. doi: 10.1162/coli_a_00478. URL http://dx.doi.org/10.1162/coli_a_00478.
- LogBERT: Log anomaly detection via BERT. In 2021 international joint conference on neural networks (IJCNN), pages 1–8. IEEE, 2021.
- LogGPT: Log anomaly detection via GPT. In 2023 IEEE International Conference on Big Data (BigData), pages 1117–1122. IEEE, 2023.
- M. Kaneko and M. Komachi. Multi-head multi-layer attention to deep language representations for grammatical error detection. CoRR, abs/1904.07334, 2019. URL http://arxiv.org/abs/1904.07334.
- Isolation forest. In 2008 Eighth IEEE International Conference on Data Mining, pages 413–422. IEEE, 2008.
- Y. Liu. RoBERTa: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- M. Rei and H. Yannakoudakis. Compositional sequence labeling models for error detection in learner writing. In K. Erk and N. A. Smith, editors, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1181–1191, Berlin, Germany, Aug. 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1112. URL https://aclanthology.org/P16-1112.
- Estimating the support of a high-dimensional distribution. Neural Computation, 13(7):1443–1471, 2001. doi: 10.1162/089976601750264965.
- MultiGED-2023 shared task at NLP4CALL: Multilingual grammatical error detection. In D. Alfter, E. Volodina, T. François, A. Jönsson, and E. Rennes, editors, Proceedings of the 12th Workshop on NLP for Computer Assisted Language Learning, pages 1–16, Tórshavn, Faroe Islands, May 2023. LiU Electronic Press. URL https://aclanthology.org/2023.nlp4call-1.1.
- A new dataset and method for automatically grading ESOL texts. In D. Lin, Y. Matsumoto, and R. Mihalcea, editors, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 180–189, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL https://aclanthology.org/P11-1019.
- Multi-class grammatical error detection for correction: A tale of two systems. In M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8722–8736, Online and Punta Cana, Dominican Republic, Nov. 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.687. URL https://aclanthology.org/2021.emnlp-main.687.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.