Positive Unlabeled Learning Selected Not At Random (PULSNAR): class proportion estimation when the SCAR assumption does not hold (2303.08269v3)
Abstract: Positive and Unlabeled (PU) learning is a type of semi-supervised binary classification where the machine learning algorithm differentiates between a set of positive instances (labeled) and a set of both positive and negative instances (unlabeled). PU learning has broad applications in settings where confirmed negatives are unavailable or difficult to obtain, and there is value in discovering positives among the unlabeled (e.g., viable drugs among untested compounds). Most PU learning algorithms make the \emph{selected completely at random} (SCAR) assumption, namely that positives are selected independently of their features. However, in many real-world applications, such as healthcare, positives are not SCAR (e.g., severe cases are more likely to be diagnosed), leading to a poor estimate of the proportion, $\alpha$, of positives among unlabeled examples and poor model calibration, resulting in an uncertain decision threshold for selecting positives. PU learning algorithms vary; some estimate only the proportion, $\alpha$, of positives in the unlabeled set, while others calculate the probability that each specific unlabeled instance is positive, and some can do both. We propose two PU learning algorithms to estimate $\alpha$, calculate calibrated probabilities for PU instances, and improve classification metrics: i) PULSCAR (positive unlabeled learning selected completely at random), and ii) PULSNAR (positive unlabeled learning selected not at random). PULSNAR employs a divide-and-conquer approach to cluster SNAR positives into subtypes and estimates $\alpha$ for each subtype by applying PULSCAR to positives from each cluster and all unlabeled. In our experiments, PULSNAR outperformed state-of-the-art approaches on both synthetic and real-world benchmark datasets.
- Alxneit, I. (2020). Particle size distributions from electron microscopy images: avoiding pitfalls. The Journal of Physical Chemistry A, 124(48):10075–10081.
- Arzamasov, V. (2018). Electrical Grid Stability Simulated Data . UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5PG66.
- Estimating the class prior in positive and unlabeled data through decision tree induction. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1).
- Beyond the selected completely at random assumption for learning from positive and unlabeled data. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 71–85. Springer.
- Bock, R. (2007). MAGIC Gamma Telescope. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C52C8B.
- KDD-Cup 2004: results and analysis. SIGKDD Explor. Newsl., 6(2):95–108.
- Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–357.
- Chen, S. X. (1999). Beta kernel estimators for density functions. Computational Statistics & Data Analysis, 31(2):131–145.
- XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794.
- Anuran Calls (MFCCs). UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5CC9H.
- Smartphone Dataset for Human Activity Recognition (HAR) in Ambient Assisted Living (AAL). UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5P597.
- Class prior estimation from positive and unlabeled data. IEICE TRANSACTIONS on Information and Systems, 97(5):1358–1362.
- Supervised machine learning bot detection techniques to identify social twitter bots. SMU Data Science Review, 1(2):5.
- Learning classifiers from only positive and unlabeled data. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 213–220.
- A comprehensive survey of clustering algorithms: State-of-the-art machine learning applications, taxonomy, challenges, and future research prospects. Engineering Applications of Artificial Intelligence, 110:104743.
- How many clusters? which clustering method? answers via model-based cluster analysis. The computer journal, 41(8):578–588.
- Text classification without negative examples revisit. IEEE transactions on Knowledge and Data Engineering, 18(1):6–20.
- Mixture proportion estimation and pu learning: a modern approach. Advances in Neural Information Processing Systems, 34:8532–8544.
- Recovering the propensity score from biased positive unlabeled data. Proceedings of the AAAI Conference on Artificial Intelligence, 36(6):6694–6702.
- Instance-dependent positive and unlabeled learning with labeling bias estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8):4163–4177.
- Learning from positive and unlabeled data with arbitrary positive shift. Advances in Neural Information Processing Systems, 33:13088–13099.
- An empirical study of machine learning algorithms for social media bot detection. In 2021 IEEE International IOT, Electronics and Mechatronics Conference (IEMTRONICS), pages 1–5. IEEE.
- Mice Protein Expression. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C50S3Z.
- Ivanov, D. (2020). Dedpul: Difference-of-estimated-densities-based positive-unlabeled learning. In 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 782–790. IEEE.
- Nonparametric semi-supervised learning of class proportions. arXiv preprint arXiv:1601.01944.
- A modified logistic regression for positive and unlabeled learning. In 2019 53rd Asilomar Conference on Signals, Systems, and Computers, pages 2007–2011. IEEE.
- Positive and unlabeled learning algorithms and applications: A survey. In 2019 10th International Conference on Information, Intelligence, Systems and Applications (IISA), pages 1–8. IEEE.
- Johnson, B. (2014). Wilt. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5KS4M.
- Pulsnar: Positive unlabeled learning selected not at random–towards imputing undocumented conditions in ehrs and estimating their incidence. https://www.ohdsi.org/2022showcase-77/.
- Learning with positive and unlabeled examples using weighted logistic regression. In ICML, volume 3, pages 448–455.
- Pulns: Positive-unlabeled learning with effective negative sample selector. Proceedings of the AAAI Conference on Artificial Intelligence, 35(10):8784–8792.
- The planar k-means problem is np-hard. Theoretical Computer Science, 442:13–21.
- Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research, 70:1373–1411.
- Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
- On fairness and calibration. Advances in neural information processing systems, 30.
- Mixture proportion estimation via kernel embeddings of distributions. In International conference on machine learning, pages 2052–2060. PMLR.
- Classifier calibration: a survey on how to assess and improve predicted class probabilities. Machine Learning, 112(9):3211–3260.
- Room Occupancy Estimation. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5P605.
- Slate, D. (1991). Letter Recognition. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5ZP40.
- Differential evolution–a simple and efficient heuristic for global optimization over continuous spaces. Journal of global optimization, 11:341–359.
- Positive-unlabeled learning from imbalanced data. In IJCAI, pages 2995–3001.
- UCI ML Repository (2020). University of california irvine (uci) machine learning repository: Dry bean dataset. https://archive.ics.uci.edu/dataset/602/dry+bean+dataset. Accessed: 2024-03-01.
- UCI ML Repository (2022). University of california irvine (uci) machine learning repository: Statlog (shuttle) data set. https://archive.ics.uci.edu/ml/datasets/Statlog+(Shuttle). Accessed: 2022-06-30.
- UCI ML Repository (2024). University of california irvine (uci) machine learning repository: Cdc diabetes health indicators. https://archive.ics.uci.edu/dataset/891/cdc+diabetes+health+indicators. Accessed: 2024-03-01.
- Virtanen, P. et al. (2020). SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–272.
- Vrieze, S. I. (2012). Model selection and psychological theory: a discussion of the differences between the akaike information criterion (aic) and the bayesian information criterion (bic). Psychological methods, 17(2):228.
- Psol: a positive sample only learning algorithm for finding non-coding rna genes. Bioinformatics, 22(21):2590–2596.
- Multi-positive and unlabeled learning. In IJCAI, pages 3182–3188.
- Oversampling for imbalanced data via optimal transport. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):5605–5612.
- Yu, H. (2005). Single-class classification with mapping convergence. Machine Learning, 61(1):49–69.
- Pebl: Web page classification without negative examples. IEEE Transactions on Knowledge and Data Engineering, 16(1):70–81.
- Knee point detection in bic for detecting the number of clusters. In International conference on advanced concepts for intelligent vision systems, pages 664–673. Springer.
- Praveen Kumar (50 papers)
- Christophe G. Lambert (2 papers)