Improve Cost Efficiency of Active Learning over Noisy Dataset (2403.01346v1)
Abstract: Active learning is a learning strategy whereby the machine learning algorithm actively identifies and labels data points to optimize its learning. This strategy is particularly effective in domains where an abundance of unlabeled data exists, but the cost of labeling these data points is prohibitively expensive. In this paper, we consider cases of binary classification, where acquiring a positive instance incurs a significantly higher cost compared to that of negative instances. For example, in the financial industry, such as in money-lending businesses, a defaulted loan constitutes a positive event leading to substantial financial loss. To address this issue, we propose a shifted normal distribution sampling function that samples from a wider range than typical uncertainty sampling. Our simulation underscores that our proposed sampling function limits both noisy and positive label selection, delivering between 20% and 32% improved cost efficiency over different test datasets.
- P. Ren, Y. Xiao, X. Chang, P.-Y. Huang, Z. Li, B. B. Gupta, X. Chen, and X. Wang, “A survey of deep active learning,” ACM computing surveys (CSUR), vol. 54, no. 9, pp. 1–40, 2021.
- S. Budd, E. C. Robinson, and B. Kainz, “A survey on active learning and human-in-the-loop deep learning for medical image analysis,” Medical Image Analysis, vol. 71, p. 102062, 2021.
- B. Settles, “Active learning literature survey,” 2009.
- D. Lewis and W. Gale, “A sequential algorithmfor training text classifiers,” in SIGIR’94: Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, organised by Dublin City University, 1994, pp. 3–12.
- G. Schohn and D. Cohn, “Less is more: Active learning with support vector machines,” in Proceedings of the Seventeenth International Conference on Machine Learning, 2000, pp. 839–846.
- Y. Yang, Z. Ma, F. Nie, X. Chang, and A. G. Hauptmann, “Multi-class active learning by uncertainty sampling with diversity maximization,” International Journal of Computer Vision, vol. 113, pp. 113–127, 2015.
- J. Zhu, H. Wang, T. Yao, and B. K. Tsou, “Active learning with sampling by uncertainty and density for word sense disambiguation and text classification,” in 22nd International Conference on Computational Linguistics, Coling 2008, 2008, pp. 1137–1144.
- E. Lughofer and M. Pratama, “Online active learning in data stream regression using uncertainty sampling based on evolving generalized fuzzy models,” IEEE Transactions on fuzzy systems, vol. 26, no. 1, pp. 292–309, 2017.
- Y. Yang and M. Loog, “Active learning using uncertainty information,” in 2016 23rd International Conference on Pattern Recognition (ICPR). IEEE, 2016, pp. 2646–2651.
- G. Wang, J.-N. Hwang, C. Rose, and F. Wallace, “Uncertainty sampling based active learning with diversity constraint by sparse selection,” in 2017 IEEE 19th International Workshop on Multimedia Signal Processing (MMSP). IEEE, 2017, pp. 1–6.
- S. Mussmann and P. S. Liang, “Uncertainty sampling is preconditioned stochastic gradient descent on zero-one loss,” Advances in Neural Information Processing Systems, vol. 31, 2018.
- A. Raj and F. Bach, “Convergence of uncertainty sampling for active learning,” in International Conference on Machine Learning. PMLR, 2022, pp. 18 310–18 331.
- H. S. Seung, M. Opper, and H. Sompolinsky, “Query by committee,” in Proceedings of the fifth annual workshop on Computational learning theory, 1992, pp. 287–294.
- B. Settles, M. Craven, and S. Ray, “Multiple-instance active learning,” Advances in neural information processing systems, vol. 20, 2007.
- N. Roy and A. McCallum, “Toward optimal active learning through monte carlo estimation of error reduction,” ICML, Williamstown, vol. 2, pp. 441–448, 2001.
- R. Wang, C.-Y. Chow, and S. Kwong, “Ambiguity-based multiclass active learning,” IEEE Transactions on Fuzzy Systems, vol. 24, no. 1, pp. 242–248, 2015.
- H. Hino, “Active learning: Problem settings and recent developments,” arXiv preprint arXiv:2012.04225, 2020.
- A. Tharwat and W. Schenck, “A survey on active learning: State-of-the-art, practical challenges and research directions,” Mathematics, vol. 11, no. 4, p. 820, 2023.
- G. Hacohen, A. Dekel, and D. Weinshall, “Active learning on a budget: Opposite strategies suit high and low budgets,” in International Conference on Machine Learning. PMLR, 2022, pp. 8175–8195.
- T. Younesian, Z. Zhao, A. Ghiassi, R. Birke, and L. Y. Chen, “Qactor: Active learning on noisy labels,” in Asian Conference on Machine Learning. PMLR, 2021, pp. 548–563.
- H. Kaur, H. S. Pannu, and A. K. Malhi, “A systematic review on imbalanced data challenges in machine learning: Applications and solutions,” ACM Computing Surveys (CSUR), vol. 52, no. 4, pp. 1–36, 2019.
- F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
- E. LeDell and S. Poirier, “H2O AutoML: Scalable automatic machine learning,” 7th ICML Workshop on Automated Machine Learning (AutoML), July 2020.