Maximizing Information Gain in Privacy-Aware Active Learning of Email Anomalies (2405.07440v2)
Abstract: Redacted emails satisfy most privacy requirements but they make it more difficult to detect anomalous emails that may be indicative of data exfiltration. In this paper we develop an enhanced method of Active Learning using an information gain maximizing heuristic, and we evaluate its effectiveness in a real world setting where only redacted versions of email could be labeled by human analysts due to privacy concerns. In the first case study we examined how Active Learning should be carried out. We found that model performance was best when a single highly skilled (in terms of the labelling task) analyst provided the labels. In the second case study we used confidence ratings to estimate the labeling uncertainty of analysts and then prioritized instances for labeling based on the expected information gain (the difference between model uncertainty and analyst uncertainty) that would be provided by labelling each instance. We found that the information maximization gain heuristic improved model performance over existing sampling methods for Active Learning. Based on the results obtained, we recommend that analysts should be screened, and possibly trained, prior to implementation of Active Learning in cybersecurity applications. We also recommend that the information gain maximizing sample method (based on expert confidence) should be used in early stages of Active Learning, providing that well-calibrated confidence can be obtained. We also note that the expertise of analysts should be assessed prior to Active Learning, as we found that analysts with lower labelling skill had poorly calibrated (over-) confidence in their labels.
- Training connectionist networks with queries and selective sampling. Advances in neural information processing systems 2 (1989).
- Reliability in evaluator-based tests: using simulation-constructed models to determine contextually relevant agreement thresholds. BMC medical research methodology 18 (2018), 1–12.
- L. Breiman. 2001. Random forests. (2001).
- Klaus Brinker. 2003. Incorporating diversity in active learning with support vector machines. In Proceedings of the 20th international conference on machine learning (ICML-03). 59–66.
- Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization. International Journal of Data Science and Analytics 5 (2018), 285–300.
- Ranked batch-mode active learning. Information Sciences 379 (2017), 313–337.
- C. C. Chang and C. J. Lin. 2011. LIBSVM: a library for support vector machines. (2011).
- Efficient elicitation approaches to estimate collective crowd answers. Proceedings of the ACM on Human-Computer Interaction 3, CSCW (2019), 1–25.
- Interactive Machine Learning for Data Exfiltration Detection: Active Learning with Human Expertise. In IEEE Transactions on Systems, Man, and Cybernetics: Systems, Vol. 2020-Octob. 280–287.
- Improving generalization with active learning. Machine learning 15 (1994), 201–221.
- T. Cover and P. Hart. 1967. Nearest neighbor pattern classification. (1967).
- T. Danka and P. Horvath. 2018. (2018).
- Batch-mode active-learning methods for the interactive classification of remote sensing images. IEEE Transactions on Geoscience and Remote Sensing 49, 3 (2010), 1014–1031.
- T. G. Dietterich. 1998. Approximate statistical tests for comparing supervised classification learning algorithms. (1998).
- June). Capturing ambiguity in crowdsourcing frame disambiguation. ([n. d.]).
- Enrique Estellés-Arolas and Fernando González-Ladrón-de Guevara. 2012. Towards an integrated crowdsourcing definition. Journal of Information science 38, 2 (2012), 189–200.
- A. Frank and A. Asuncion. 1998. UCI machine learning repository. (1998).
- On calibration of modern neural networks. In International conference on machine learning. PMLR, 1321–1330.
- Yuhong Guo and Dale Schuurmans. 2007. Discriminative batch mode active learning. Advances in neural information processing systems 20 (2007).
- Jacek Gwizdka and Mark Chignell. 2007. 12. Individual Differences. Personal information management (2007), 206.
- An active learning approach with uncertainty, representativeness, and diversity. The Scientific World Journal 2014 (2014).
- Large-scale text categorization by batch mode active learning. In Proceedings of the 15th international conference on World Wide Web. 633–642.
- Batch mode active learning and its application to medical image classification. In Proceedings of the 23rd international conference on Machine learning. 417–424.
- Entropy-based active learning for object recognition. In 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. IEEE, 1–8.
- G. H. John and P. Langley. 2013. Estimating continuous distributions in Bayesian classifiers. arXiv. (2013). arXiv:1302.4964
- A. R. Jonckheere. 1954. A distribution-free k-sample test again ordered alternatives. (1954).
- June). (2009).
- Brooks Jr and Frederick P. 1995. The mythical man-month: essays on software engineering. (1995).
- LightGBM: A highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems 2017-Decem (2017), 3147–3155.
- K. Krippendorff. 2004. Content analysis: An introduction to its methodology. (2004).
- K. Krippendorff. 2011. (2011).
- Empowering active learning to jointly optimize system and user demands. arXiv preprint arXiv:2005.04470 (2020).
- Vladimir I Levenshtein et al. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, Vol. 10. Soviet Union, 707–710.
- David D Lewis. 1995. A sequential algorithm for training text classifiers: Corrigendum and additional data. In Acm Sigir Forum, Vol. 29. ACM New York, NY, USA, 13–19.
- Eliciting Confidence for Improving Crowdsourced Audio Annotations. Proceedings of the ACM on Human-Computer Interaction 6, CSCW1 (2022), 1–25.
- F. Murtagh and P. Legendre. 2014. Ward’s hierarchical agglomerative clustering method: which algorithms implement Ward’s criterion? (2014).
- D. Müllner. 2011. Modern hierarchical, agglomerative clustering algorithms. arXiv. (2011). arXiv:1109.2378
- Alexandru Niculescu-Mizil and Rich Caruana. 2005. Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning. 625–632.
- The Wilcoxon signed rank test for paired comparisons of clustered data. Biometrics 62, 1 (2006), 185–192.
- T. Saito and M. Rehmsmeier. 2015. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. (2015).
- Eric Schenk and Claude Guittard. 2009. Crowdsourcing: What can be Outsourced to the Crowd, and Why? (2009).
- Burr Settles. 2009. Active learning literature survey. Technical Report (2009).
- Burr Settles. 2011. From theories to queries: Active learning in practice. JMLR: Workshop and Conference Proceedings 16 16 (2011), 1–18.
- B. Settles. 2012. Uncertainty Sampling. (2012).
- Active learning helps pretrained models learn the intended task. arXiv preprint arXiv:2204.08491 (2022).
- T. J. Terpstra. 1952. The asymptotic normality and consistency of Kendall’s test against trend, when ties are present in one ranking. (1952).
- CAPTCHA: Using hard AI problems for security. In Eurocrypt, Vol. 2656. Springer, 294–311.
- recaptcha: Human-based character recognition via web security measures. Science 321, 5895 (2008), 1465–1468.
- Frank Wilcoxon. 1992. Individual comparisons by ranking methods. In Breakthroughs in Statistics: Methodology and Distribution. Springer, 196–202.
- Representative sampling for text classification using support vector machines. In Advances in Information Retrieval: 25th European Conference on IR Research, ECIR 2003, Pisa, Italy, April 14–16, 2003. Proceedings 25. Springer, 393–407.