Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift (1906.02530v2)

Published 6 Jun 2019 in stat.ML and cs.LG

Abstract: Modern machine learning methods including deep learning have achieved great success in predictive accuracy for supervised learning tasks, but may still fall short in giving useful estimates of their predictive {\em uncertainty}. Quantifying uncertainty is especially critical in real-world settings, which often involve input distributions that are shifted from the training distribution due to a variety of factors including sample bias and non-stationarity. In such settings, well calibrated uncertainty estimates convey information about when a model's output should (or should not) be trusted. Many probabilistic deep learning methods, including Bayesian-and non-Bayesian methods, have been proposed in the literature for quantifying predictive uncertainty, but to our knowledge there has not previously been a rigorous large-scale empirical comparison of these methods under dataset shift. We present a large-scale benchmark of existing state-of-the-art methods on classification problems and investigate the effect of dataset shift on accuracy and calibration. We find that traditional post-hoc calibration does indeed fall short, as do several other previous methods. However, some methods that marginalize over models give surprisingly strong results across a broad spectrum of tasks.

Authors (9)

Yaniv Ovadia (4 papers)
Emily Fertig (7 papers)
Jie Ren (329 papers)
Zachary Nado (23 papers)
Sebastian Nowozin (45 papers)
Joshua V. Dillon (23 papers)
Balaji Lakshminarayanan (62 papers)
Jasper Snoek (42 papers)
D Sculley (6 papers)

Citations (1,561)

View on Semantic Scholar

Summary

Evaluating Predictive Uncertainty Under Dataset Shift

The paper addresses the critical issue of quantifying predictive uncertainty in deep neural networks (DNNs) under conditions of dataset shift. Despite the significant strides made in the accuracy of DNNs across various domains, the utility of uncertainty estimates remains underexplored, particularly in scenarios where the input distribution deviates from the training distribution. This paper delivers a rigorous, large-scale empirical evaluation of multiple state-of-the-art uncertainty estimation methods on classification tasks subjected to dataset shift.

Introduction

The necessity of reliable uncertainty estimates is profoundly evident in high-stakes applications such as medical diagnosis or autonomous driving, where decisions are contingent upon model outputs. Modern machine learning frameworks, including probabilistic neural networks and Bayesian neural networks, offer methods to quantify uncertainty. These methods, however, have typically been evaluated in ideal conditions where the test data is assumed to be identically distributed as the training data. The paper posits that such evaluations are insufficient and proposes examining the robustness of uncertainty estimates under various types and intensities of dataset shift.

Methodological Framework

The authors benchmarked several popular uncertainty estimation methods including:

Vanilla Maximum Softmax Probability
Temperature Scaling
Monte-Carlo Dropout
Deep Ensembles
Stochastic Variational Inference (SVI)
Last Layer Variational Inference models

The benchmarks are conducted across diverse datasets representative of image (CIFAR-10, ImageNet), text (20 Newsgroups), and categorical data (Criteo ad-click prediction).

Evaluation Metrics

The empirical evaluation considers multiple metrics:

Accuracy and Brier Score to assess prediction quality.
Negative Log-Likelihood (NLL) for uncertainty quantification.
Expected Calibration Error (ECE) to determine the calibration.
Entropy and confidence distribution for out-of-distribution (OOD) inputs.

Results and Discussion

Performance Under Dataset Shift: The empirical results show a consistent decline in the quality of uncertainty estimates with increasing dataset shift for all evaluated methods. This underscores the inadequacy of relying solely on i.i.d. test sets for evaluating uncertainty methods.

Comparative Analysis: While temperature scaling provides good calibration on validation sets, it performs suboptimally under dataset shift. In contrast, methods that accounted for epistemic uncertainty (e.g., Ensembles and Dropout) exhibited superior performance in maintaining reliability and resolution under shift conditions. Particularly, deep ensembles outperformed other methods across the majority of datasets and evaluation metrics.

Ensembles and Scalability: Although deep ensembles demonstrated the best overall performance, they come with higher computational and storage costs, a point of concern for scalability in practical applications. Interestingly, the performance gains from ensembles plateau after a certain number of models (5-10), suggesting that smaller ensembles may be sufficient to garner the benefits of uncertainty quantification.

SVI and Limitations: SVI methods showed promise on datasets like MNIST but were difficult to scale to more complex datasets like ImageNet, indicating implementation and computational hurdles.

Practical Implications and Future Work

The paper provides several key takeaways:

Deep Ensembles: As a robust method for quantifying uncertainty under dataset shift, deep ensembles should be considered despite their higher resource requirements.
SVI Challenges: Future work should explore more scalable variants or approximations of SVI to leverage its theoretical advantages in practical scenarios.
Calibration Beyond Validation Sets: Calibration methods need enhancements that allow for adaptability and robustness against significant distributional changes.
Real-world Relevance: The inclusion of realistic, domain-specific benchmarks like the genomic sequences dataset illustrates the broader applicability and challenges of uncertainty quantification in diverse fields.

Future Directions:

Developing hybrid models that combine the efficiency of single-model approaches with the robustness of ensemble methods.
Investigating adaptive calibration techniques that dynamically adjust to shifting distributions.
Enhancing computational efficiency and scalability through distributed and parallel processing techniques.

Conclusion

This paper makes a significant contribution to the domain of uncertainty quantification in neural networks by systematically benchmarking various state-of-the-art methods under conditions of dataset shift. The findings emphasize the inadequacy of traditional evaluation methods and highlight the robustness of deep ensembles. Moving forward, the research community is encouraged to develop more scalable and adaptive methods for uncertainty quantification to address the complexities introduced by real-world data shifts.

Related Papers

Find Related Papers