Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Do ImageNet Classifiers Generalize to ImageNet? (1902.10811v2)

Published 13 Feb 2019 in cs.CV, cs.LG, and stat.ML

Abstract: We build new test sets for the CIFAR-10 and ImageNet datasets. Both benchmarks have been the focus of intense research for almost a decade, raising the danger of overfitting to excessively re-used test sets. By closely following the original dataset creation processes, we test to what extent current classification models generalize to new data. We evaluate a broad range of models and find accuracy drops of 3% - 15% on CIFAR-10 and 11% - 14% on ImageNet. However, accuracy gains on the original test sets translate to larger gains on the new test sets. Our results suggest that the accuracy drops are not caused by adaptivity, but by the models' inability to generalize to slightly "harder" images than those found in the original test sets.

Citations (1,523)

Summary

  • The paper demonstrates that state-of-the-art ImageNet classifiers experience significant accuracy drops (11%-14% top-1) on new test sets.
  • It replicates original dataset construction methods, ensuring that even minor differences in data lead to measurable impacts on performance.
  • Despite lower accuracies, the relative ranking of models remains consistent, underscoring the robustness of comparative evaluations.

An Analysis of the Generalization of ImageNet Classifiers to New ImageNet Test Sets

The paper "Do ImageNet Classifiers Generalize to ImageNet?" investigates the generalization capabilities of ImageNet classification models by constructing new test sets for CIFAR-10 and ImageNet. Given the pervasive reuse of these datasets over years of research, there is an inherent risk of overfitting to the existing test sets. The research replicates the original dataset creation process to determine the extent to which current models generalize to new, slightly different data.

Methodology and Experimental Setup

The authors constructed new test sets for CIFAR-10 and ImageNet by closely following the original procedures used to create these datasets. For CIFAR-10, new images from the Tiny Images dataset were selected following the class keyword distribution of CIFAR-10. This process involved manual selection by a researcher who replicated the role of the original dataset creators. For ImageNet, a similar approach was taken, with images gathered from Flickr based on historical upload dates and subsequently annotated by Amazon Mechanical Turk (MTurk) workers using interfaces closely mimicking the original ImageNet annotation tasks.

Results Overview

The evaluation of a broad range of classification models, spanning a decade of advances in machine learning, on these new test sets yielded insightful results:

  1. Accuracy Drops:
    • Models experienced significant drops in accuracy on the new test sets. For instance, accuracies dropped by 3% to 15% on CIFAR-10 and 11% to 14% on ImageNet.
    • The most advanced ImageNet model (#pnasnet_large_tf) showed a top-1 accuracy drop from 82.9% to 72.2%, and a top-5 accuracy drop from 96% to 90%.
  2. Model Rankings:
    • Despite the accuracy drops, the relative ranking of models remained largely preserved, suggesting that while the models' abilities to generalize decreased, the order of their performance stayed consistent.
  3. Linear Relationship:
    • The relationship between the original and new test set accuracies followed a roughly linear trend, with improvements on the original test set translating to larger improvements on the new test set in terms of percentage points.
    • Specifically, on CIFAR-10, a slope of approximately 1.7 was observed, and on ImageNet, the slope was about 1.1, indicating that robustness to new data improved with increasing model accuracy.

Hypotheses for Accuracy Drops

Several hypotheses were explored to explain the accuracy decline, including:

  1. Statistical Error:
    • Given the substantial sample sizes, statistical fluctuations alone could not account for the observed drops in accuracy.
  2. Near-Duplicate Removal:
    • The stringent removal of near-duplicates might explain up to 1% of the accuracy drop, which is insufficient to fully account for the observed declines.
  3. Differences in Dataset Construction:
    • Small variations in human annotations and selection frequencies significantly impacted accuracies, as observed from the MTurk experiments.
  4. Adaptivity and Distribution Gaps:
    • The adaptivity gap, wherein models are tuned on specific test sets, was expected to play a role. However, this paper revealed no diminishing returns on new test sets, discounting adaptivity as the main cause.

Implications and Future Directions

The results of this paper have profound implications for the evaluation and development of machine learning models:

  • Dataset Reliability:
    • Current ImageNet classifiers exhibit a surprising sensitivity to minor discrepancies in dataset construction, implying that generalizations about model robustness need revisal.
  • Need for Diverse Test Sets:
    • More diverse and possibly multiple test sets may be necessary to fully evaluate the capabilities of models and avoid overfitting to a single, specific dataset.
  • Robust Learning Algorithms:
    • There is a pressing need to develop learning algorithms that are more resilient to subtle distribution shifts and hence capable of true generalization.
  • Human Baselines:
    • Understanding the efficiency of human annotations further can aid in improving dataset quality and ensuring class definitions are comprehensive and unambiguous.

Conclusion

The findings from this research caution against over-relying on single benchmark test sets for evaluating the performance of machine learning models. As models advance towards higher accuracy levels, ensuring they generalize well to unseen, slightly varied data becomes paramount. This paper serves as a call to the community to consider more stringent and varied tests for model evaluation and to improve dataset documentation and annotation transparency in future benchmarks. This ensures the benchmarks keep pace with increasing model sophistication and reliability expectations.