The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization (2006.16241v3)

Published 29 Jun 2020 in cs.CV, cs.LG, and stat.ML

Abstract: We introduce four new real-world distribution shift datasets consisting of changes in image style, image blurriness, geographic location, camera operation, and more. With our new datasets, we take stock of previously proposed methods for improving out-of-distribution robustness and put them to the test. We find that using larger models and artificial data augmentations can improve robustness on real-world distribution shifts, contrary to claims in prior work. We find improvements in artificial robustness benchmarks can transfer to real-world distribution shifts, contrary to claims in prior work. Motivated by our observation that data augmentations can help with real-world distribution shifts, we also introduce a new data augmentation method which advances the state-of-the-art and outperforms models pretrained with 1000 times more labeled data. Overall we find that some methods consistently help with distribution shifts in texture and local image statistics, but these methods do not help with some other distribution shifts like geographic changes. Our results show that future research must study multiple distribution shifts simultaneously, as we demonstrate that no evaluated method consistently improves robustness.

View on arXiv

Authors (13)

Dan Hendrycks (63 papers)
Steven Basart (16 papers)
Norman Mu (13 papers)
Saurav Kadavath (14 papers)
Frank Wang (7 papers)
Evan Dorundo (1 paper)
Rahul Desai (2 papers)
Tyler Zhu (11 papers)
Samyak Parajuli (11 papers)
Mike Guo (1 paper)
Dawn Song (229 papers)
Jacob Steinhardt (88 papers)
Justin Gilmer (39 papers)

Citations (1,463)

View on Semantic Scholar

Summary

A Critical Analysis of Out-of-Distribution Generalization

The paper, "The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization," addresses a significant challenge in the field of machine learning: the robustness of models to distribution shifts. Authored by Dan Hendrycks et al., this paper provides a comprehensive evaluation of existing robustness interventions by introducing four new real-world distribution shift datasets and a novel data augmentation method.

Dataset Contribution

The authors have introduced four new datasets that capture real-world distribution shifts:

ImageNet-Renditions (ImageNet-R): This dataset consists of 30,000 images depicting renditions of 200 ImageNet classes, including artistic representations such as paintings and sculptures.
DeepFashion Remixed (DFR): This dataset contains fashion images with shifts in attributes like occlusion, scale, viewpoint, and zoom.
StreetView StoreFronts (SVSF): This dataset includes images of storefronts altered by variables such as geographic location, capture year, and camera type.
Real Blurry Images: A collection of 1,000 naturally blurry images representing a subset of 100 ImageNet classes.

These datasets allow for a nuanced examination of model performance in the face of various real-world distribution shifts.

Evaluation of Robustness Techniques

The paper evaluates four key classes of methods aimed at improving robustness:

Larger Models: It is shown that increasing model size enhances robustness to distribution shifts in several cases, particularly on datasets like ImageNet-R and ImageNet-C.
Self-Attention: The addition of self-attention mechanisms improves robustness to certain shifts but does not consistently benefit all types of distribution shifts examined.
Diverse Data Augmentation: This technique, including methods like AugMix and the newly proposed DeepAugment, consistently enhances robustness across various datasets.
Pretraining: Utilizing larger pretraining datasets like ImageNet-21K and Instagram data showed mixed results. While certain pretraining methods improved robustness, their efficacy varied depending on the dataset.

Numerical Results and Key Findings

Notably, the paper presents strong numerical results that counter prior claims about the limitations of certain robustness interventions. For instance:

The combination of DeepAugment and AugMix significantly improved robustness on ImageNet-C, achieving a substantial reduction in mean corruption error (mCE) from 76.7% to 53.6% for a ResNet-50 model.
On the ImageNet-R dataset, DeepAugment combined with AugMix reduced the top-1 error rate from 63.9% to 53.2%, outperforming even pretrained models on larger datasets.
While data augmentation showed potent benefits on datasets like ImageNet-R and Real Blurry Images, it had limited impact on DeepFashion Remixed, suggesting that certain distribution shifts require different robustness strategies.

Implications and Future Directions

The paper's multidimensional evaluation highlights the complexity of achieving robustness across varying shifts. It emphasizes the importance of employing multiple benchmarks to capture the diverse nature of distribution shifts encountered in real-world scenarios. Practically, the findings suggest that combining diverse data augmentation methods is a promising approach for improving model robustness. Theoretically, the results advocate for a deeper investigation into the relationship between synthetic benchmarks and real-world robustness.

The introduction of new datasets and the proposal of DeepAugment pave the way for future research to explore robustness more comprehensively. Future work can build on these findings to develop new robustness techniques that generalize across an even broader array of distribution shifts.

Conclusion

Hendrycks et al.'s paper is an essential contribution to the robustness literature, providing extensive empirical evidence that challenges and refines existing theories about model robustness. By introducing new benchmarks and demonstrating the efficacy of innovative data augmentation methods, the paper sets a new standard for evaluating robustness in machine learning. It invites the research community to adopt more rigorous, multifaceted evaluation protocols to advance the robustness and reliability of AI systems in real-world applications.

PDF Markdown