Abstract: Applications such as weather forecasting and personalized medicine demand models that output calibrated probability estimates---those representative of the true likelihood of a prediction. Most models are not calibrated out of the box but are recalibrated by post-processing model outputs. We find in this work that popular recalibration methods like Platt scaling and temperature scaling are (i) less calibrated than reported, and (ii) current techniques cannot estimate how miscalibrated they are. An alternative method, histogram binning, has measurable calibration error but is sample inefficient---it requires $O(B/\epsilon2)$ samples, compared to $O(1/\epsilon2)$ for scaling methods, where $B$ is the number of distinct probabilities the model can output. To get the best of both worlds, we introduce the scaling-binning calibrator, which first fits a parametric function to reduce variance and then bins the function values to actually ensure calibration. This requires only $O(1/\epsilon2 + B)$ samples. Next, we show that we can estimate a model's calibration error more accurately using an estimator from the meteorological community---or equivalently measure its calibration error with fewer samples ($O(\sqrt{B})$ instead of $O(B)$). We validate our approach with multiclass calibration experiments on CIFAR-10 and ImageNet, where we obtain a 35% lower calibration error than histogram binning and, unlike scaling methods, guarantees on true calibration. In these experiments, we also estimate the calibration error and ECE more accurately than the commonly used plugin estimators. We implement all these methods in a Python library: https://pypi.org/project/uncertainty-calibration
The paper’s main contribution is the scaling-binning calibrator that combines parametric scaling with histogram binning for improved measurable calibration errors and sample efficiency.
The authors reveal that traditional techniques underestimate calibration errors, achieving up to 35% lower errors on datasets such as CIFAR-10 and ImageNet.
The work introduces an estimator that reduces sample complexity from O(B) to O(√B), enhancing both theoretical guarantees and practical deployment in machine learning.
Verified Uncertainty Calibration
In the domain of machine learning, obtaining models that yield well-calibrated probability estimates is crucial for applications ranging from weather prediction to healthcare. It is essential that the predicted probabilities genuinely represent the true likelihood of an event. Traditional machine learning models, particularly neural networks, often do not produce probabilities that align with true event frequencies, necessitating the use of recalibration techniques. This paper, "Verified Uncertainty Calibration," addresses both the challenges and advances in uncertainty calibration.
Historically, techniques such as Platt scaling, temperature scaling, and isotonic regression have been employed to recalibrate model outputs to produce calibrated probabilities. While these scaling methods are computationally efficient, they have been criticized for potentially underestimating calibration errors. This paper's investigation reveals that these methods often result in higher calibration errors than previously reported. Notably, existing binning approaches used for calibration error estimation fall short, as the true calibration error for continuous probability outputs is inherently unmeasurable with a finite number of bins.
An alternative strategy, histogram binning, has been recognized for its ability to yield measurable calibration error. However, it is sample inefficient, necessitating O(B/ϵ2) samples for calibration, with B being the number of unique probabilities a model can output. In contrast, the authors present the scaling-binning calibrator that aims to combine the benefits of both approaches; it promises better sample efficiency along with measurable calibration error. This method first adjusts model outputs using parametric scaling to minimize variance, subsequently binning these outputs to guarantee effective calibration. Impressively, the sample requirement is O(1/ϵ2+B), a substantial improvement over histogram binning.
The paper further introduces an estimator borrowed from the field of meteorology for better calibration error estimation. Traditional plug-in estimators require sampling that scales linearly with B. However, the proposed estimator reduces this burden to O(B), demonstrating superior accuracy with fewer samples—a significant enhancement confirmed through experiments on datasets like CIFAR-10 and ImageNet. The scaling-binning calibrator achieved up to 35\% lower calibration errors than histogram binning and offered guarantees of true calibration, notably outperforming when finely tuned over multiple classes or outputs.
The implications of these findings are substantial for practical machine learning model deployment, especially in fields where decision confidence is paramount. Theoretical advancements paired with practical validation extend the work's relevance across various predictively driven domains. Looking forward, this work lays groundwork for further exploration into maintaining calibration under domain shifts, enhancing calibration in multiclass scenarios, and refining binning strategies to minimize the adverse effects on mean-squared error.
The paper concludes with a call for further research into efficient recalibration under data shifts and improving estimation methods for various calibration metrics. All proposed methods have been encapsulated into a Python library for ease of adoption within the research community. By providing more efficient methods for achieving and verifying calibration, this paper contributes a meaningful advancement in the utilization of calibrated models for reliable decision-making across myriad applications.