Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with Dirichlet calibration (1910.12656v1)

Published 28 Oct 2019 in cs.LG and stat.ML

Abstract: Class probabilities predicted by most multiclass classifiers are uncalibrated, often tending towards over-confidence. With neural networks, calibration can be improved by temperature scaling, a method to learn a single corrective multiplicative factor for inputs to the last softmax layer. On non-neural models the existing methods apply binary calibration in a pairwise or one-vs-rest fashion. We propose a natively multiclass calibration method applicable to classifiers from any model class, derived from Dirichlet distributions and generalising the beta calibration method from binary classification. It is easily implemented with neural nets since it is equivalent to log-transforming the uncalibrated probabilities, followed by one linear layer and softmax. Experiments demonstrate improved probabilistic predictions according to multiple measures (confidence-ECE, classwise-ECE, log-loss, Brier score) across a wide range of datasets and classifiers. Parameters of the learned Dirichlet calibration map provide insights to the biases in the uncalibrated model.

Citations (336)

View on Semantic Scholar

Summary

The paper introduces a novel Dirichlet calibration method that extends beyond temperature scaling to yield reliable multiclass probability estimates.
It employs specific parametrization schemes and an off-diagonal regularization technique to minimize log-loss and classwise-ECE.
Experimental results across various classifiers demonstrate improved calibration and enhanced model interpretability for diagnostic insights.

Well-Calibrated Multiclass Probabilities with Dirichlet Calibration

The paper by Kull et al. addresses an essential aspect of probabilistic classification in machine learning: the calibration of predicted class probabilities. Calibration is crucial because the probabilities produced by classifiers often exhibit overconfidence, leading to unreliable decisions when integrated into cost-sensitive applications or human decision-making workflows. This paper proposes an innovative calibration method, namely Dirichlet calibration, extending beyond current techniques like temperature scaling, which is primarily suited for neural networks. Dirichlet calibration is versatile as it can be applied to any probabilistic classifier, making it a significant step forward in ensuring reliable probabilistic predictions across various models and datasets.

Summary of Methods

Dirichlet Calibration: This approach adapts ideas from Dirichlet distributions to natively calibrate multiclass predictions. By log-transforming uncalibrated probabilities followed by a linear layer and softmax normalization, it improves calibration without being constrained to specific architectures like neural networks.
Parametrization Schemes: The method operates under three parametrizations:
- Generative Parametrization: Derives the calibration function based on Dirichlet distribution likelihoods for each class.
- Linear Parametrization: Facilitates the implementation through neural network frameworks, interpreting the process as a series of log transformations and linear layers.
- Canonical Parametrization: Provides a unique and interpretable framework allowing analysis of parameter effects on the calibration map.
Regularization Techniques: The authors introduce a novel Off-Diagonal and Intercept Regularisation (ODIR) scheme, particularly effective in scenarios with high-dimensional logits, as seen in deep neural networks. This approach fine-tunes the calibration process, avoiding overfitting which is a common issue with high parameter models.

Experimental Findings

The paper presents extensive experiments across a wide range of datasets and classifiers, comparing Dirichlet calibration with existing methods, including one-vs-rest isotonic calibration, equal-width and equal-frequency binning, beta calibration, and temperature scaling. Key observations include:

Effectiveness Across Models: Dirichlet calibration consistently showed strong performance across various classifiers, notably outperforming other methods in terms of log-loss and classwise-ECE without significantly affecting accuracy.
Applications to Neural Networks: On deep neural networks, Dirichlet calibration combined with ODIR showed competitive performance, often surpassing temperature scaling, highlighting its adaptability to modern, complex architectures.
Interpretability: The calibration map generated by Dirichlet calibration provides insights into biases of uncalibrated models, a feature highly beneficial for model diagnostics and improvements.

Implications and Future Directions

Dirichlet calibration's adaptability to different classifiers without structural modifications suggests significant practical implications. It allows practitioners to enhance the reliability of probabilistic outputs on diverse machine learning tasks, fostering broader adoption in critical areas like automated decision-making systems where confidence in probabilistic estimates is paramount.

The theoretical implications are equally promising. The notion that canonical calibration functions reside within specific parametric families opens avenues to explore other distributions in the exponential family for calibration, providing tailored solutions for distinct types of classification challenges.

Future work could focus on further refining regularization techniques to enhance scalability and effectiveness in even larger and more complex models. Exploration into dynamic calibration mechanisms that adjust in response to dataset shifts could also provide more responsive and adaptive machine learning models, capable of maintaining calibration quality in evolving environments.

PDF Markdown