Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Measuring Calibration in Deep Learning (1904.01685v2)

Published 2 Apr 2019 in cs.LG and stat.ML

Abstract: Overconfidence and underconfidence in machine learning classifiers is measured by calibration: the degree to which the probabilities predicted for each class match the accuracy of the classifier on that prediction. How one measures calibration remains a challenge: expected calibration error, the most popular metric, has numerous flaws which we outline, and there is no clear empirical understanding of how its choices affect conclusions in practice, and what recommendations there are to counteract its flaws. In this paper, we perform a comprehensive empirical study of choices in calibration measures including measuring all probabilities rather than just the maximum prediction, thresholding probability values, class conditionality, number of bins, bins that are adaptive to the datapoint density, and the norm used to compare accuracies to confidences. To analyze the sensitivity of calibration measures, we study the impact of optimizing directly for each variant with recalibration techniques. Across MNIST, Fashion MNIST, CIFAR-10/100, and ImageNet, we find that conclusions on the rank ordering of recalibration methods is drastically impacted by the choice of calibration measure. We find that conditioning on the class leads to more effective calibration evaluations, and that using the L2 norm rather than the L1 norm improves both optimization for calibration metrics and the rank correlation measuring metric consistency. Adaptive binning schemes lead to more stablity of metric rank ordering when the number of bins vary, and is also recommended. We open source a library for the use of our calibration measures.

Citations (430)

Summary

  • The paper critiques Expected Calibration Error (ECE) by revealing its inability to capture full multiclass uncertainty.
  • It introduces alternative metrics such as Static Calibration Error and Adaptive Calibration Error that evaluate predictions across all classes using adaptive binning.
  • Empirical analyses on MNIST, CIFAR-10/100, and ImageNet show that metric selection significantly affects recalibration performance in safety-critical applications.

Analysis of Calibration Metrics in Deep Learning

The paper under analysis offers a critical examination of calibration metrics used to assess the reliability of deep learning classifiers. Calibration, in the context of machine learning, reflects the alignment between a model's predicted probabilities and its true accuracy. The paper highlights the deficiencies of the widely used Expected Calibration Error (ECE) and explores alternative metrics and methods for a more effective evaluation of model calibration.

Critique of Expected Calibration Error (ECE)

The authors identify several limitations of ECE in capturing true calibration error:

  1. Class Conditionality: ECE traditionally measures the maximum predicted probability for each data point, potentially obscuring significant error in other class probabilities. This can lead to incomplete assessments, especially in critical applications where secondary class predictions are non-trivial.
  2. Fixed Binning Schemes: The paper points out how ECE's use of evenly spaced bins does not account for the distribution density of predicted probabilities, often resulting in overestimation or underestimation of calibration error due to skewness.
  3. Norm Usage: The choice of norms (L1 vs. L2) impacts the sensitivity of calibration metrics. The paper suggests that norms play a critical role in optimizing calibration and ranking recalibration methods.
  4. Pathologies in Static Binning: Due to cancellation effects, where overconfident and underconfident predictions overlap within the same bin, ECE may report near-zero calibration error even for poorly calibrated models.

Proposed Metrics and Methodological Insights

Responding to the identified flaws in ECE, this paper presents a diverse range of calibration metrics aimed at providing a more robust measure of calibration error:

  • Static Calibration Error (SCE) and Adaptive Calibration Error (ACE) are introduced to encompass predictions beyond only the maximum probability, with ACE addressing the bias-variance tradeoff through adaptive binning.
  • Emphasizing the importance of evaluating calibration across all classes and predictions, rather than focusing solely on maximum probabilities, offers a more holistic view of a model's uncertainty.

The authors perform comprehensive empirical analysis across datasets such as MNIST, CIFAR-10/100, and ImageNet, demonstrating that conclusions about recalibration methods are considerably influenced by the choice of calibration metrics. They find significant differences in metric performance when modifying calibration properties such as class-conditional evaluations, demonstrating that metrics favor consistency (measured by rank correlation) when adaptive schemes are applied.

Implications and Future Directions

The insights presented have profound implications for machine learning practices, particularly in safety-critical domains where calibration of predictive probabilities is paramount. By critiquing the default industry practices around calibration, the paper pushes for an informed selection of calibration measures tailored to specific model and application needs.

The inclusion of new metrics like ACE offers pathways for exploring nuanced aspects of calibration error that proper weighting of probability ranges might further enhance. Moving forward, developing consensus around the adoption of new benchmark metrics could harmonize evaluation standards across the field.

Conclusion

This paper presents a thorough critique of ECE and proposes robust alternatives for evaluating calibration in multiclass classification scenarios. With thoughtful analysis and experimentation, the authors advocate for a nuanced understanding and application of calibration metrics, urging the community to adopt flexible and adaptive measures that better reflect prediction uncertainty in deep learning models. This work sets the stage for ongoing research into optimizing network uncertainties and recalibration techniques to achieve truly reliable deep learning systems.