Validation of ML-UQ calibration statistics using simulated reference values: a sensitivity analysis (2403.00423v2)
Abstract: Some popular Machine Learning Uncertainty Quantification (ML-UQ) calibration statistics do not have predefined reference values and are mostly used in comparative studies. In consequence, calibration is almost never validated and the diagnostic is left to the appreciation of the reader. Simulated reference values, based on synthetic calibrated datasets derived from actual uncertainties, have been proposed to palliate this problem. As the generative probability distribution for the simulation of synthetic errors is often not constrained, the sensitivity of simulated reference values to the choice of generative distribution might be problematic, shedding a doubt on the calibration diagnostic. This study explores various facets of this problem, and shows that some statistics are excessively sensitive to the choice of generative distribution to be used for validation when the generative distribution is unknown. This is the case, for instance, of the correlation coefficient between absolute errors and uncertainties (CC) and of the expected normalized calibration error (ENCE). A robust validation workflow to deal with simulated reference values is proposed.
- Pairwise difference regression: A machine learning meta-algorithm for improved prediction and uncertainty quantification in chemical search. J. Chem. Inf. Model., 61:3846–3857, 2021. PMID: 34347460.
- P. Pernot. Prediction uncertainty validation for computational chemists. J. Chem. Phys., 157:144103, 2022.
- Evaluating and Calibrating Uncertainty Prediction in Regression Tasks. Sensors, 22:5540, 2022.
- P. Pernot. Properties of the ENCE and other MAD-based calibration metrics. arXiv:2305.11905, May 2023.
- P. Pernot. Confidence curves for UQ validation: probabilistic reference vs. oracle. arXiv:2206.15272, June 2022.
- Uncertain of uncertainties? A comparison of uncertainty quantification metrics for chemical data sets. J. Cheminf., 15:1–17, December 2023.
- Methods for comparing uncertainty quantifications for material property predictions. Mach. Learn.: Sci. Technol., 1:025006, 2020.
- Evaluation of measurement data - Guide to the expression of uncertainty in measurement (GUM). Technical Report 100:2008, Joint Committee for Guides in Metrology, JCGM, 2008. URL: http://www.bipm.org/utils/common/documents/jcgm/JCGM_100_2008_F.pdf.
- Calibrated uncertainty for molecular property prediction using ensembles of message passing neural networks. Mach. Learn.: Sci. Technol., 3:015012, 2022.
- Deep Evidential Regression. arXiv:1910.02600, October 2019.
- Statistical Distributions. Wiley-Interscience, 3rd edition, 2000.
- P. Pernot. How to validate average calibration for machine learning regression tasks ? arXiv:2402.10043, February 2024. URL: https://arxiv.org/abs/2402.10043.
- P. Pernot. Calibration in machine learning uncertainty quantification: Beyond consistency to target adaptivity. APL Mach. Learn., 1:046121, 2023.
- P. Pernot. The long road to calibrated prediction uncertainty in computational chemistry. J. Chem. Phys., 156:114109, 2022.
- Graph neural network interatomic potential ensembles with calibrated aleatoric and epistemic uncertainty on energy and forces. Phys. Chem. Chem. Phys., 25:25828–25837, 2023.
- One step closer to unbiased aleatoric uncertainty estimation. arXiv:2312.10469, December 2023.
- Probabilistic forecasts, calibration and sharpness. J. R. Statist. Soc. B, 69:243–268, 2007.
- P. Pernot. Can bin-wise scaling improve consistency and adaptivity of prediction uncertainty for machine learning regression ? arXiv:2310.11978, October 2023.
- T. J. DiCiccio and B. Efron. Bootstrap confidence intervals. Statist. Sci., 11:189–212, 1996. URL: https://www.jstor.org/stable/2246110.
- Calibration after bootstrap for accurate uncertainty quantification in regression models. npj Comput. Mater., 8:115, 2022.
- R. A. Groeneveld and G. Meeden. Measuring skewness and kurtosis. The Statistician, 33:391–399, 1984. URL: http://www.jstor.org/stable/2987742.
- P. Pernot and A. Savin. Using the Gini coefficient to characterize the shape of computational chemistry error distributions. Theor. Chem. Acc., 140:24, 2021.
- Impact of non-normal error distributions on the benchmarking and ranking of Quantum Machine Learning models. Mach. Learn.: Sci. Technol., 1:035011, 2020.
- The variance of sample variance from a finite population. Int. J. Pure Appl. Math., 21:387–394, 2005. URL: http://www.ijpam.eu/contents/2005-21-3/10/10.pdf.