Pitfalls of In-Domain Uncertainty Estimation and Ensembling in Deep Learning

Published 15 Feb 2020 in stat.ML and cs.LG | (2002.06470v4)

Abstract: Uncertainty estimation and ensembling methods go hand-in-hand. Uncertainty estimation is one of the main benchmarks for assessment of ensembling performance. At the same time, deep learning ensembles have provided state-of-the-art results in uncertainty estimation. In this work, we focus on in-domain uncertainty for image classification. We explore the standards for its quantification and point out pitfalls of existing metrics. Avoiding these pitfalls, we perform a broad study of different ensembling techniques. To provide more insight in this study, we introduce the deep ensemble equivalent score (DEE) and show that many sophisticated ensembling techniques are equivalent to an ensemble of only few independently trained networks in terms of test performance.

Abstract PDF Upgrade to Chat

Citations (305)

View on Semantic Scholar

Summary

The paper critiques standard uncertainty metrics and introduces a calibrated log-likelihood for fair model comparisons.
It demonstrates that sophisticated ensembling methods yield similar test performance to simpler deep ensembles.
The paper finds that test-time data augmentation significantly enhances accuracy with minimal extra computation.

Analysis of In-Domain Uncertainty Estimation and Ensembling in Deep Learning

The paper under scrutiny, "Pitfalls of In-Domain Uncertainty Estimation and Ensembling in Deep Learning," provides a rigorous exploration of uncertainty estimation and ensembling methods in deep learning, focusing on in-domain settings for image classification. The authors dissect the conventional metrics and methodologies, addressing the inadequacies and misconceptions commonly associated with these tools.

Key Contributions and Findings

Uncertainty Estimation Metrics and Pitfalls: The authors explore popular metrics such as log-likelihood, Brier score, and calibration metrics, pointing out their vulnerabilities. They emphasize the sensitivity of log-likelihood to temperature scaling, criticizing its use without optimal temperature calibration. A novel metric—the calibrated log-likelihood—is proposed to rectify these pitfalls, ensuring a fairer comparison across models.
Inefficacy of Ensembling Techniques: The study introduces the Deep Ensemble Equivalent (DEE) score, quantitatively evaluating ensemble methods by comparing their performance to independently trained models, also known as deep ensembles. The findings suggest that most sophisticated ensembling techniques, despite their complexity, achieve equivalent test performance to ensembles with only a few independent models.
Surprising Efficacy of Test-Time Data Augmentation: Test-time data augmentation (TTA) emerges as a compelling technique that enhances the performance of ensembles without additional training or substantial additional computational overhead. This finding underscores the potential of TTA as an overlooked efficiency booster in uncertainty estimation.

Implications and Future Directions

The implications of this research are noteworthy in both practical and theoretical domains. Practically, the findings encourage a shift towards simpler ensembling techniques, which are both computationally efficient and effective, emphasizing the importance of post-hoc calibration methods like temperature scaling to achieve optimal performance. In theory, the study pushes for a reconsideration of the metrics used to evaluate uncertainty estimation, advocating for the adoption of more reliable, calibrated alternatives.

The paper's insights hold substantial potential for influencing the future trajectory of research in AI, particularly in deep learning architectures' robustness and reliability. Future work could focus on expanding the DEE framework to various domains beyond image classification and investigating the applicability of TTA in broader contexts.

Conclusion

In essence, the paper provides a comprehensive critique of current practices in in-domain uncertainty estimation and ensembling, underscoring the importance of reliable evaluation metrics and efficient techniques. The observations derived through numerical analysis underscore potential avenues for refining model ensembles in deep learning, offering substantial contributions to the ongoing discourse on model reliability and efficiency in AI.

Markdown