Are we done with ImageNet?

Published 12 Jun 2020 in cs.CV and cs.LG | (2006.07159v1)

Abstract: Yes, and no. We ask whether recent progress on the ImageNet classification benchmark continues to represent meaningful generalization, or whether the community has started to overfit to the idiosyncrasies of its labeling procedure. We therefore develop a significantly more robust procedure for collecting human annotations of the ImageNet validation set. Using these new labels, we reassess the accuracy of recently proposed ImageNet classifiers, and find their gains to be substantially smaller than those reported on the original labels. Furthermore, we find the original ImageNet labels to no longer be the best predictors of this independently-collected set, indicating that their usefulness in evaluating vision models may be nearing an end. Nevertheless, we find our annotation procedure to have largely remedied the errors in the original labels, reinforcing ImageNet as a powerful benchmark for future research in visual recognition.

Abstract PDF Upgrade to Chat

Citations (374)

View on Semantic Scholar

Summary

The paper introduces ReaL labels to reexamine ImageNet's reliability, revealing that models increasingly overfit to flawed labels.
It demonstrates a diminishing correlation between traditional and ReaL accuracies as newer models become overly tailored to ImageNet's specifics.
The study proposes improvements like sigmoid cross-entropy loss and label filtering to better capture multi-object scenes in image classification.

Evaluation of ImageNet Label Reliability and Its Implications for Progress in Image Classification

The academic paper "Are We Done with ImageNet?" critically examines the validity and utility of the ImageNet dataset as a benchmark for evaluating advancements in image classification systems. The authors embark on this investigation with a nuanced perspective, suggesting that while progress has been made on ImageNet, there is an increasing potential for overfitting to the peculiarities of its labeling system rather than achieving true generalization.

Critique of ImageNet Labels

The core innovation of the paper lies in its meticulous re-evaluation of the ImageNet validation set labels. The authors employ a refined approach to collecting Reassessed Labels (ReaL), which aim to more accurately reflect human judgment by resolving previously identified biases and noise within the original ImageNet label set. The analysis involves using a selection of machine learning models to generate a comprehensive set of label proposals which are rated by human annotators. This process reveals several critical insights:

Discrepancy in Label Associations: While early models demonstrated a strong correlation between ImageNet and ReaL accuracies, this association diminishes with more recent models. Such a trend highlights an increasing overfitting to the ImageNet-specific labeling nuances, which may not generalize beyond this dataset.
Outperformance of Original Labels: Recent models have surpassed the original ImageNet labels in their ability to predict human preferences, indicating a decline in the label set's ability to serve as a reliable benchmark in its current form.
Complexity in Multi-Object Scenes: Many images in ImageNet contain multiple prominent objects, yet they are assigned a single label, which dilutes label accuracy. The authors propose changes to the assessment metric, recognizing the necessity to account for multiple correct labels per image to better capture scene complexity.

Analysis of Proposed Improvements

The paper proposes two measures to mitigate the limitations of the original labeling process:

Incorporation of a sigmoid cross-entropy loss to allow for non-exclusive predictions, addressing the problem of single-label constraints in multi-object scenes.
Utilization of clean label sets by filtering noisy labels through consensus reached by top-performing models, thereby refining training data quality.

Both methods improve the robustness of image classification systems on the ReaL benchmark, especially when combined. Notably, these modifications deliver pronounced benefits over extended training schedules, suggesting that label noise may have impeded the efficacy of prolonged training regimens.

Broader Implications and Future Directions

ReaL labels evidently enhance the granularity and representativeness of evaluations compared to the traditional ImageNet labels. This recalibration of benchmarks could drive more meaningful advancements in image classification algorithms by demanding better alignment with human-like judgments. However, this transition introduces challenges in maintaining benchmarks' uniformity across diverse algorithm types and assessing progress in fine-grained classification tasks.

The work presented points towards a critical juncture in the evolution of computer vision benchmarks. As the broader AI community ponders the future utility of ImageNet, there may be a necessity to adopt more nuanced, context-sensitive evaluation paradigms, akin to ReaL, that better capture the complexities of real-world image classification.

In conclusion, the authors deliver a compelling case for revisiting and refining the fundamental benchmarks by which machine learning models are assessed. This re-evaluation underpins the ongoing endeavor to develop models that better emulate human perceptual accuracy and opens the floor for significant future developments in artificial intelligence.

Markdown