OTLP: Output Thresholding Using Mixed Integer Linear Programming
Abstract: Output thresholding is the technique to search for the best threshold to be used during inference for any classifiers that can produce probability estimates on train and testing datasets. It is particularly useful in high imbalance classification problems where the default threshold is not able to refer to imbalance in class distributions and fail to give the best performance. This paper proposes OTLP, a thresholding framework using mixed integer linear programming which is model agnostic, can support different objective functions and different set of constraints for a diverse set of problems including both balanced and imbalanced classification problems. It is particularly useful in real world applications where the theoretical thresholding techniques are not able to address to product related requirements and complexity of the applications which utilize machine learning models. Through the use of Credit Card Fraud Detection Dataset, we evaluate the usefulness of the framework.
- \bibcommenthead
- Chen T, Guestrin C. Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining; 2016. p. 785–794.
- Breiman L. Random forests. Machine learning. 2001;45:5–32.
- Overview of logistic regression model analysis and application. Zhonghua yu fang yi xue za zhi [Chinese journal of preventive medicine]. 2019;53(9):955–960.
- Sheng VS, Ling CX. Thresholding for making classifiers cost-sensitive. In: Aaai. vol. 6; 2006. p. 476–481.
- GHOST: adjusting the decision threshold to handle imbalanced data in machine learning. Journal of Chemical Information and Modeling. 2021;61(6):2623–2640.
- Threshold optimization and random undersampling for imbalanced credit card data. Journal of Big Data. 2023;10(1):58.
- A systematic study of the class imbalance problem in convolutional neural networks. Neural networks. 2018;106:249–259.
- Comparison of evaluation metrics in classification applications with imbalanced datasets. In: 2008 seventh international conference on machine learning and applications. IEEE; 2008. p. 777–782.
- Limitations of ROC on imbalanced data: Evaluation of LVAD mortality risk scores. arXiv 2020. arXiv preprint arXiv:20101625. 2020;.
- Finding the best classification threshold in imbalanced classification. Big Data Research. 2016;5:2–8.
- Worldline, the Machine Learning Group (http://mlg ulb ac be) of ULB (Université Libre de Bruxelles).: Kaggle: credit card fraud detection. Kaggle. https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud.
- Chen T, Guestrin C.: XGBoost: A Scalable Tree Boosting System. GitHub. Accessed on April 18, 2024. https://xgboost.readthedocs.io/en/stable/index.html.
- Scikit-learn: Machine learning in Python. the Journal of machine Learning research. 2011;12:2825–2830.
- SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research. 2002;16:321–357.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.