Overview of Out-of-Distribution Testing: An Example of Goodhart's Law
The paper under consideration investigates the practice of Out-of-Distribution (OOD) testing in machine learning, specifically through the lens of the Visual Question Answering (VQA) domain. The authors focus on the VQA-CP benchmark, a well-known OOD dataset, and identify several issues with its current use in research. OOD testing is generally employed to evaluate a model’s ability to generalize beyond the biases of its training dataset; however, the manner in which VQA-CP is utilized has raised questions about the validity of its outcomes.
Key Issues Identified
- Exploiting Dataset Construction: The paper highlights that many methods evaluated on VQA-CP leverage the known construction of the dataset’s OOD splits. Specifically, models often implicitly learn to "invert" the label distributions between the training and testing sets by exploiting the correlation between question prefixes and answers, rather than genuinely understanding the content.
- Use of Test Set for Model Selection: Given the absence of a formal validation set in VQA-CP, researchers often use the OOD test set for hyperparameter tuning and model selection purposes. This practice contravenes standard machine learning protocols, leading to potential overfitting to the idiosyncrasies of the test set rather than assessing true generalization capabilities.
- In-domain Evaluation Post-retraining: It is common practice to retrain models on the more balanced VQA v2 dataset before evaluating their in-domain performance. This skews the evaluation results as it conceals any poor performance attributable to the OOD training configuration, resulting in metrics that don’t reflect genuine model robustness.
Experimental Findings
The authors conducted experiments demonstrating that simple, non-reasoning models, including randomly generated predictions, can exceed state-of-the-art performance on certain question types in VQA-CP under the current practices. For instance, by simply inverting the distribution of sampled answers, even a random method can surpass sophisticated models on yes/no questions. This empirical evidence underscores their point about the flawed nature of existing practices, showing that claimed improvements might not be indicative of true model advances.
Recommendations and Implications
The paper concludes with several suggestions both for using VQA-CP and for future dataset designs to better achieve the goal of evaluating out-of-distribution performance:
- Implement in-domain and OOD evaluations on the same trained model configuration to ensure comparability of results.
- Focus more analyses on questions less prone to dataset-specific gaming, such as those requiring richer reasoning (e.g., other than yes/no questions).
- Design new benchmarks that account for multiple sources of distributional shifts and avoid aligning the test set distributions too closely with training biases.
From a theoretical standpoint, the insights from this paper have substantial implications for research in AI and machine learning. They call into question the validity of numerous published VQA models that report improvements on the VQA-CP dataset, suggesting that some may not be advancing true generalization capabilities.
Overall, the challenges identified underscore the complexity of designing datasets and evaluations that faithfully assess a model's ability to generalize beyond known data distributions. As AI systems are increasingly deployed in diverse real-world applications, ensuring the robustness of these systems to OOD data becomes critically important. This paper thus represents an important reflection on current practices and offers a roadmap for more rigorous and meaningful OOD evaluations in future research.