Assaying Out-Of-Distribution Generalization in Transfer Learning (2207.09239v2)

Published 19 Jul 2022 in cs.LG and stat.ML

Abstract: Since out-of-distribution generalization is a generally ill-posed problem, various proxy targets (e.g., calibration, adversarial robustness, algorithmic corruptions, invariance across shifts) were studied across different research programs resulting in different recommendations. While sharing the same aspirational goal, these approaches have never been tested under the same experimental conditions on real data. In this paper, we take a unified view of previous work, highlighting message discrepancies that we address empirically, and providing recommendations on how to measure the robustness of a model and how to improve it. To this end, we collect 172 publicly available dataset pairs for training and out-of-distribution evaluation of accuracy, calibration error, adversarial attacks, environment invariance, and synthetic corruptions. We fine-tune over 31k networks, from nine different architectures in the many- and few-shot setting. Our findings confirm that in- and out-of-distribution accuracies tend to increase jointly, but show that their relation is largely dataset-dependent, and in general more nuanced and more complex than posited by previous, smaller scale studies.

Citations (67)

View on Semantic Scholar

Summary

The paper demonstrates, via factor analysis on 31K model evaluations, that in-distribution error strongly predicts out-of-distribution accuracy.
It shows that full-model fine-tuning and diverse data augmentations boost robustness, especially in low-data and few-shot scenarios.
The study reveals that common robustness metrics like adversarial and corruption measures often fail to reliably predict real-world OOD performance.

Overview of "Assaying Out-Of-Distribution Generalization in Transfer Learning"

The paper "Assaying Out-Of-Distribution Generalization in Transfer Learning" presents a comprehensive empirical investigation into the out-of-distribution (OOD) generalization performance of deep learning models when applied to transfer learning tasks. The authors aim to unravel the complex relationships among various robustness metrics and how these relate to OOD performance in a transfer learning context. Unlike previous studies which have been limited in scale or scope, this research involves an extensive exploration using 172 dataset pairs to evaluate numerous robustness metrics, encompassing accuracy, adversarial robustness, calibration, invariance, and sensitivity to corruptions.

Primary Contributions and Findings

Empirical Analysis and Factor Loadings: By conducting a factor analysis, the authors identify the primary latent factors that underlie the correlations between metrics, determining that ID classification error is a strong general predictor of OOD accuracy. This contradicts common preconceptions that corruption or adversarial metrics would better predict robustness to natural distribution shifts.
In-Depth Evaluation of Models: With over 31,000 model evaluations, the paper examines nine architectures across various settings, including fine-tuning strategies and data regimes (e.g., few-shot learning). It is observed that augmentations generally enhance performance, particularly in low data regimes, and fine-tuning the full model architecture typically yields better robustness than fine-tuning the head only, unless data is significantly scarce.
Diverse Facets of OOD Generalization: The research challenges the notion that ID and OOD accuracy are always linked linearly by demonstrating that this is not a universal characteristic. The authors observed multiple correlations—sometimes functional, sometimes entirely mixed or even non-existent in diverse datasets, indicating that robustness cannot be assessed uniformly across different types of distribution shifts.
Calibration and Invariance Metrics: Calibration evaluated on ID data is shown to be an unreliable proxy for OOD robustness. However, metrics such as demographic disparity and multi-domain calibration showed better predictiveness for OOD calibration under certain conditions, underscoring the multi-dimensional characteristics of OOD generalization.
Implications of Pre-Training: Interestingly, the paper finds that upstream robustness on ImageNet does not significantly transfer to better downstream OOD performance beyond the improvements secured by clean pre-training accuracy.

Implications and Speculation on Future Advances

Evaluation Protocol and Best Practices: The findings imply that improving ID accuracy is often the best strategy for enhancing OOD robustness, as most alternative metrics do not considerably supplement the information provided by ID accuracy. This suggests that effective transfer learning models should prioritize robust ID performance, complemented by evaluation on real-world held-out data.
Augmentations and Data Variety: Given the observed benefits of augmentations, incorporating diverse data augmentations should remain a cornerstone of training strategies, especially in contexts where data is limited.
Architectural Innovations: The variation in robustness between architectures suggests that architectural innovation remains a fertile ground for research, potentially yielding models that can generalize better across different domains. However, any claims about architectural superiority should be substantiated by rigorous testing across a wide array of shifts and tasks.
Broader Evaluation of Robustness: The limited utility of synthetic corruptions in predicting robustness points to a need for better benchmarks that can emulate real-world conditions more effectively. Future advancements in high-fidelity simulations or generative models may offer more deterministic evaluations.
Refinement of Robustness Concepts: This paper lays the groundwork for refining what constitutes robustness in machine learning models, calling for a richer vocabulary and understanding that transcends current simplistic metrics of accuracy and error analyses.

Overall, this paper's extensive empirical nature and the scale at which experiments are conducted make it a vital reference point for anyone engaging with transfer learning challenges, providing insights that could guide future developments in robustness evaluation and improvement.

PDF Markdown

Related Papers

GitHub

GitHub - amazon-science/assaying-ood (9 stars)

YouTube

Show All Videos