Verifying the Selected Completely at Random Assumption in Positive-Unlabeled Learning (2404.00145v1)
Abstract: The goal of positive-unlabeled (PU) learning is to train a binary classifier on the basis of training data containing positive and unlabeled instances, where unlabeled observations can belong either to the positive class or to the negative class. Modeling PU data requires certain assumptions on the labeling mechanism that describes which positive observations are assigned a label. The simplest assumption, considered in early works, is SCAR (Selected Completely at Random Assumption), according to which the propensity score function, defined as the probability of assigning a label to a positive observation, is constant. On the other hand, a much more realistic assumption is SAR (Selected at Random), which states that the propensity function solely depends on the observed feature vector. SCAR-based algorithms are much simpler and computationally much faster compared to SAR-based algorithms, which usually require challenging estimation of the propensity score. In this work, we propose a relatively simple and computationally fast test that can be used to determine whether the observed data meet the SCAR assumption. Our test is based on generating artificial labels conforming to the SCAR case, which in turn allows to mimic the distribution of the test statistic under the null hypothesis of SCAR. We justify our method theoretically. In experiments, we demonstrate that the test successfully detects various deviations from SCAR scenario and at the same time it is possible to effectively control the type I error. The proposed test can be recommended as a pre-processing step to decide which final PU algorithm to choose in cases when nature of labeling mechanism is not known.
- J. Bekker and J. Davis. Learning from positive and unlabeled data: a survey. Machine Learning, 109:719–760, 2020.
- C. Elkan and K. Noto. Learning classifiers from only positive and unlabeled data. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’08, pages 213–220, 2008.
- A recent survey on instance-dependent positive and unlabeled learning. Fundamental Research, 2022.
- Learning with a generative adversarial network from a positive unlabeled dataset for image classification. In Proceedings of the 25th IEEE International Conference on Image Processing, ICIP’18, pages 1–6, 2018.
- Pu-learning in payload-based web anomaly detection. In Proceedings of the Third Conference on Security of Smart Cities, industrial Control Systems and Communications, SSIC’2018, pages 1–5, 2018.
- Learning from positive and unlabeled multi-instance bags in anomaly detection. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’23, page 1897–1906, 2023.
- Dealing with under-reported variables: An information theoretic solution. International Journal of Approximate Reasoning, 85:159 – 177, 2017.
- Positive-unlabeled learning in bioinformatics and computational biology: a brief review. Briefings in Bioinformatics, 23(1), 2021.
- A variational approach for learning from positive and unlabeled data. In Proceedings of the International Conference on Neural Information Processing Systems, NIPS’20, pages 14844–14854, 2020.
- Dist-pu: Positive-unlabeled learning from a label distribution perspective. In Proceedings of the Conference on Computer Vision and Pattern Recognition, CVPR’22, pages 14461–14470, 2022.
- Who is your right mixup partner in positive and unlabeled learning. In Proceedings of the 10th International Conference on Learning Representations, 2022.
- Mixture proportion estimation via kernel embeddings of distributions. In Proceedings of The 33rd International Conference on Machine Learning, volume 48, pages 2052–2060, 2016.
- Estimating the class prior and posterior from noisy positives and unlabeled data. In Proceedings of the 30th International Conference on Neural Information Processing Systems, page 2693–2701, 2016.
- J. Bekker and J. Davis. Estimating the class prior in positive and unlabeled data through decision tree induction. In Proceedings of the 32th AAAI Conference on Artificial Intelligence, pages 1–8, 2018.
- Estimating the class prior for positive and unlabelled data via logistic regression. Advances in Data Analysis and Classification, 15:1039–1068, 2021.
- Different strategies of fitting logistic regression for positive and unlabelled data. In Proceedings of Intrernational Conference on Computational Science, ICCS’20, pages 1–14, 2020.
- Beyond the Selected Completely At Random Assumption for Learning from Positive and Unlabeled Data. In Proceedings of the 2019 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECML’19, pages 71–85, 2019.
- Recovering the propensity score from biased positive unlabeled data. In Proceedings of the AAAI Conference on Artificial Intelligence, AAAI’22, pages 6694–6702, 2022.
- Instance-dependent positive and unlabeled learning with labeling bias estimation. IEEE Trans Pattern Anal Mach Intell, pages 1–16, 2021.
- Double logistic regression approach to biased positive-unlabeled data. In Proceedings of the European Conference on Artificial Intelligence, ECAI’23, pages 764–771, 2023.
- Modeling PU learning using probabilistic logic programming. Machine Learning, pages 1–22, 2023.
- Learning from positive and unlabeled data with a selection bias. In Proceedings of the 7th International Conference on Learning Representations, pages 1–12, 2019.
- Deep generative positive-unlabeled learning under selection bias. In Proceedings of CIKM’20, CIKM ’20, pages 1155––1164, New York, NY, USA, 2020. ACM.
- Presence-only data and the EM algorithm. Biometrics, 65:554–563, 2009.
- Analysis of learning from positive and unlabeled data. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27, pages 703–711. Curran Associates, Inc., 2014.
- Positive-unlabeled learning with non-negative risk estimator. In Proceedings of the NIPS’17, NIPS’17, pages 1674––1684, Red Hook, NY, USA, 2017. Curran Associates Inc.
- Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, 2006.
- Mutual information neural estimation. In Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, pages 531–540, 2018.
- P. Bickel. Some contributions to theory of order statistics. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pages 575–591, 1967.
- P. Bickel and K. Doksum. Mathematical Statistics. CRC, London, 2015.
- L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.
- UCI Machine Learning Repository, 2023.
- Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, NIPS’19, pages 8024–8035, 2019.