On the Value of Out-of-Distribution Testing: An Example of Goodhart's Law (2005.09241v1)

Published 19 May 2020 in cs.CV and cs.LG

Abstract: Out-of-distribution (OOD) testing is increasingly popular for evaluating a machine learning system's ability to generalize beyond the biases of a training set. OOD benchmarks are designed to present a different joint distribution of data and labels between training and test time. VQA-CP has become the standard OOD benchmark for visual question answering, but we discovered three troubling practices in its current use. First, most published methods rely on explicit knowledge of the construction of the OOD splits. They often rely on ``inverting'' the distribution of labels, e.g. answering mostly 'yes' when the common training answer is 'no'. Second, the OOD test set is used for model selection. Third, a model's in-domain performance is assessed after retraining it on in-domain splits (VQA v2) that exhibit a more balanced distribution of labels. These three practices defeat the objective of evaluating generalization, and put into question the value of methods specifically designed for this dataset. We show that embarrassingly-simple methods, including one that generates answers at random, surpass the state of the art on some question types. We provide short- and long-term solutions to avoid these pitfalls and realize the benefits of OOD evaluation.

Authors (6)

Damien Teney (43 papers)
Kushal Kafle (22 papers)
Robik Shrestha (14 papers)
Ehsan Abbasnejad (59 papers)
Christopher Kanan (72 papers)
Anton van den Hengel (188 papers)

Citations (137)

View on Semantic Scholar

Summary

Overview of Out-of-Distribution Testing: An Example of Goodhart's Law

The paper under consideration investigates the practice of Out-of-Distribution (OOD) testing in machine learning, specifically through the lens of the Visual Question Answering (VQA) domain. The authors focus on the VQA-CP benchmark, a well-known OOD dataset, and identify several issues with its current use in research. OOD testing is generally employed to evaluate a model’s ability to generalize beyond the biases of its training dataset; however, the manner in which VQA-CP is utilized has raised questions about the validity of its outcomes.

Key Issues Identified

Exploiting Dataset Construction: The paper highlights that many methods evaluated on VQA-CP leverage the known construction of the dataset’s OOD splits. Specifically, models often implicitly learn to "invert" the label distributions between the training and testing sets by exploiting the correlation between question prefixes and answers, rather than genuinely understanding the content.
Use of Test Set for Model Selection: Given the absence of a formal validation set in VQA-CP, researchers often use the OOD test set for hyperparameter tuning and model selection purposes. This practice contravenes standard machine learning protocols, leading to potential overfitting to the idiosyncrasies of the test set rather than assessing true generalization capabilities.
In-domain Evaluation Post-retraining: It is common practice to retrain models on the more balanced VQA v2 dataset before evaluating their in-domain performance. This skews the evaluation results as it conceals any poor performance attributable to the OOD training configuration, resulting in metrics that don’t reflect genuine model robustness.

Experimental Findings

The authors conducted experiments demonstrating that simple, non-reasoning models, including randomly generated predictions, can exceed state-of-the-art performance on certain question types in VQA-CP under the current practices. For instance, by simply inverting the distribution of sampled answers, even a random method can surpass sophisticated models on yes/no questions. This empirical evidence underscores their point about the flawed nature of existing practices, showing that claimed improvements might not be indicative of true model advances.

Recommendations and Implications

The paper concludes with several suggestions both for using VQA-CP and for future dataset designs to better achieve the goal of evaluating out-of-distribution performance:

Implement in-domain and OOD evaluations on the same trained model configuration to ensure comparability of results.
Focus more analyses on questions less prone to dataset-specific gaming, such as those requiring richer reasoning (e.g., other than yes/no questions).
Design new benchmarks that account for multiple sources of distributional shifts and avoid aligning the test set distributions too closely with training biases.

From a theoretical standpoint, the insights from this paper have substantial implications for research in AI and machine learning. They call into question the validity of numerous published VQA models that report improvements on the VQA-CP dataset, suggesting that some may not be advancing true generalization capabilities.

Overall, the challenges identified underscore the complexity of designing datasets and evaluations that faithfully assess a model's ability to generalize beyond known data distributions. As AI systems are increasingly deployed in diverse real-world applications, ensuring the robustness of these systems to OOD data becomes critically important. This paper thus represents an important reflection on current practices and offers a roadmap for more rigorous and meaningful OOD evaluations in future research.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos