TCE: A Test-Based Approach to Measuring Calibration Error (2306.14343v1)
Abstract: This paper proposes a new metric to measure the calibration error of probabilistic binary classifiers, called test-based calibration error (TCE). TCE incorporates a novel loss function based on a statistical test to examine the extent to which model predictions differ from probabilities estimated from data. It offers (i) a clear interpretation, (ii) a consistent scale that is unaffected by class imbalance, and (iii) an enhanced visual representation with repect to the standard reliability diagram. In addition, we introduce an optimality criterion for the binning procedure of calibration error metrics based on a minimal estimation error of the empirical probabilities. We provide a novel computational algorithm for optimal bins under bin-size constraints. We demonstrate properties of TCE through a range of experiments, including multiple real-world imbalanced datasets and ImageNet 1000.
- Fraud detection system: A survey. Journal of Network and Computer Applications, 68:90–113, 2016. ISSN 1084-8045.
- Metrics of calibration for probabilistic predictions. Journal of Machine Learning Research, 23(351):1–54, 2022.
- Adjusting for multiple testing—when and how? Journal of Clinical Epidemiology, 54(4):343–349, 2001. ISSN 0895-4356.
- Glen W. Brier. Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1):1 – 3, 1950.
- Jochen Bröcker. Reliability, sufficiency, and the decomposition of proper scores. Quarterly Journal of the Royal Meteorological Society, 135(643):1512–1519, 2009.
- Philip Dawid. The well-calibrated bayesian. Journal of the American Statistical Association, 77(379):605–610, 1982.
- Isotone optimization in r: Pool-adjacent-violators algorithm (pava) and active set methods. Journal of Statistical Software, 32(5):1–24, 2009.
- The comparison and evaluation of forecasters. The Statistician, 32:12–22, 1983.
- Stable reliability diagrams for probabilistic classifiers. Proceedings of the National Academy of Sciences, 118(8), 2021.
- UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
- The prediction of breast cancer biopsy outcomes using two cad approaches that both emphasize an intelligible decision process. Medical Physics, 34(11), 2007.
- Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(2):243–268, 2007.
- A survey of deep learning techniques for autonomous driving. Journal of Field Robotics, 37(3):362–386, 2020.
- Calibration of neural networks using splines. In International Conference on Learning Representations, 2021.
- Accelerating the Pool-Adjacent-Violators Algorithm for Isotonic Distributional Regression. Methodology and Computing in Applied Probability, 24(4):2633–2645, 2022.
- Beyond sigmoids: How to obtain well-calibrated probabilities from binary classifiers with beta calibration. Electronic Journal of Statistics, 11(2):5052 – 5080, 2017.
- Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration. In Advances in Neural Information Processing Systems, volume 32, 2019.
- Verified uncertainty calibration. In Advances in Neural Information Processing Systems, volume 32, 2019.
- Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. Journal of Machine Learning Research, 18(17):1–5, 2017.
- Click-through prediction for advertising in twitter timeline. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1959–1968, 2015.
- Revisiting the calibration of modern neural networks. In Advances in Neural Information Processing Systems, 2021.
- Obtaining well calibrated probabilities using bayesian binning. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2901––2907, 2015.
- Predicting good probabilities with supervised learning. In Proceedings of the 22nd International Conference on Machine Learning, page 625–632, 2005.
- Measuring calibration in deep learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2019.
- John C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers, 10(3), 1999.
- Amos Storkey et al. When training and test sets are different: characterizing learning transfer. Dataset shift in machine learning, 30:3–28, 2009.
- Machine learning for fraud detection in e-commerce: A research agenda. In Proceedings of the KDD International Workshop on Deployable Machine Learning for Security Defense (MLHat), pages 30–54. Springer, 2021.
- Nearly-isotonic regression. Technometrics, 53(1):54–61, 2011.
- Eric Topol. High-performance medicine: the convergence of human and artificial intelligence. Nature Medicine, 25:44–56, 2019.
- Evaluating model calibration in classification. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, 2019.
- Peter van der Putten and Maarten van Someren. Coil challenge 2000: The insurance company case. Technical report, Sentient Machine Research, Amsterdam and Leiden Institute of Advanced Computer Science, 2000-2009.
- Improving class probability estimates for imbalanced data. Knowledge and Information Systems, 41(1):33–52, 2014.
- Calibration tests in multi-class classification: A unifying framework. In Advances in Neural Information Processing Systems, volume 32, 2019.
- Click-through rate prediction in online advertising: A literature review. Information Processing & Management, 59(2):102853, 2022.
- Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In Proceedings of the Eighteenth International Conference on Machine Learning, page 609–616, 2001.
- Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, page 694–699, 2002.