Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
12 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
37 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Dissecting Out-of-Distribution Detection and Open-Set Recognition: A Critical Analysis of Methods and Benchmarks (2408.16757v2)

Published 29 Aug 2024 in cs.CV and cs.AI

Abstract: Detecting test-time distribution shift has emerged as a key capability for safely deployed machine learning models, with the question being tackled under various guises in recent years. In this paper, we aim to provide a consolidated view of the two largest sub-fields within the community: out-of-distribution (OOD) detection and open-set recognition (OSR). In particular, we aim to provide rigorous empirical analysis of different methods across settings and provide actionable takeaways for practitioners and researchers. Concretely, we make the following contributions: (i) We perform rigorous cross-evaluation between state-of-the-art methods in the OOD detection and OSR settings and identify a strong correlation between the performances of methods for them; (ii) We propose a new, large-scale benchmark setting which we suggest better disentangles the problem tackled by OOD detection and OSR, re-evaluating state-of-the-art OOD detection and OSR methods in this setting; (iii) We surprisingly find that the best performing method on standard benchmarks (Outlier Exposure) struggles when tested at scale, while scoring rules which are sensitive to the deep feature magnitude consistently show promise; and (iv) We conduct empirical analysis to explain these phenomena and highlight directions for future research. Code: https://github.com/Visual-AI/Dissect-OOD-OSR

Summary

  • The paper presents a comprehensive cross-evaluation of OOD detection and OSR methods, revealing robust performance of magnitude-aware scoring mechanisms.
  • It introduces a large-scale benchmark that distinguishes between semantic and covariate shifts, enhancing evaluation of model resilience.
  • A novel metric, Outlier-Aware Accuracy, is proposed to better balance detection capability with robustness, emphasizing the importance of auxiliary data selection.

Dissecting Out-of-Distribution Detection and Open-Set Recognition: A Critical Analysis of Methods and Benchmarks

The paper, titled "Dissecting Out-of-Distribution Detection and Open-Set Recognition: A Critical Analysis of Methods and Benchmarks," presents a critical and comprehensive exploration into the intricacies of out-of-distribution (OOD) detection and open-set recognition (OSR). Researchers Hongjun Wang, Sagar Vaze, and Kai Han meticulously analyze these closely related subfields, offering empirical insights and proposing new benchmarks to redefine assessment standards in this domain.

The authors first emphasize the increasing importance of detecting test-time distribution shifts, a capability crucial for the safe deployment of machine learning models. OOD detection is concerned with identifying samples that come from a different distribution than the training set, while OSR involves recognizing when a model encounters novel semantic categories not present during training. Despite their similar objectives, these fields have evolved somewhat independently, and a comprehensive cross-evaluation of methods from each area had been lacking.

One of the goals of the paper is to thoroughly benchmark and analyze the performance of existing state-of-the-art methods across both OOD detection and OSR. The authors conduct cross-evaluations and surprisingly identify strong correlations between methods effective in both tasks. Magnitude-aware scoring mechanisms, such as Maximum Logit Score (MLS) and Energy scoring, consistently outperform others across tasks and datasets. This insight is particularly interesting given the conceptual differences assumed between OOD and OSR.

Furthermore, the paper critiques the conventional benchmarks, which inadequately separate different types of distribution shifts, namely semantic shift and covariate shift. To better understand these phenomena, the authors propose a large-scale benchmark designed to disentangle these shifts. The benchmark involves ImageNet-C and ImageNet-R for covariate shifts and introduces a Semantic Shift Benchmark (SSB) for isolating semantic shifts. This initiative echoes the need for more nuanced benchmarks that capture the practical challenges posed by real-world applications.

A notable discovery is that Outlier Exposure (OE), which performs well on standard OOD detection tasks, struggles in the proposed large-scale benchmark settings, whereas magnitude-aware scoring rules remain robust. The results suggest that the efficacy of OE is heavily dependent on the correlation between auxiliary training data and the actual OOD testing data. The implication here is profound for practitioners: auxiliary data selection is critical, and methods need to be adaptive to varying distributional characteristics.

The paper also introduces a novel metric, Outlier-Aware Accuracy (OAA), to better evaluate the balance between robustness against distribution shifts and the ability to detect them. In scenarios where robustness and detection trade-offs exist, OAA offers a more scalable measure, aiding researchers in aligning theoretical improvements with practical performance.

In terms of future directions, the findings in this paper hint at several promising avenues. The community could explore the development of training methodologies that are less sensitive to auxiliary data choices or that adapt dynamically to diverse distributional shifts. Additionally, the potential for unifying frameworks that leverage the strengths of both OOD detection and OSR could catalyze new innovations across the board.

In conclusion, this paper makes significant contributions to the field by rigorously evaluating current techniques, proposing robust benchmarks, and establishing new metrics for performance evaluation. It aligns empirical observations with theoretical evaluations, thus providing a foundation for future research in improving the resilience and reliability of machine learning systems under diverse and unpredictable conditions.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com