Do We Train on Test Data? Purging CIFAR of Near-Duplicates (1902.00423v2)

Published 1 Feb 2019 in cs.CV

Abstract: The CIFAR-10 and CIFAR-100 datasets are two of the most heavily benchmarked datasets in computer vision and are often used to evaluate novel methods and model architectures in the field of deep learning. However, we find that 3.3% and 10% of the images from the test sets of these datasets have duplicates in the training set. These duplicates are easily recognizable by memorization and may, hence, bias the comparison of image recognition techniques regarding their generalization capability. To eliminate this bias, we provide the "fair CIFAR" (ciFAIR) dataset, where we replaced all duplicates in the test sets with new images sampled from the same domain. We then re-evaluate the classification performance of various popular state-of-the-art CNN architectures on these new test sets to investigate whether recent research has overfitted to memorizing data instead of learning abstract concepts. We find a significant drop in classification accuracy of between 9% and 14% relative to the original performance on the duplicate-free test set. The ciFAIR dataset and pre-trained models are available at https://cvjena.github.io/cifair/, where we also maintain a leaderboard.

Citations (89)

View on Semantic Scholar

Summary

The paper reveals that 3.3% of CIFAR-10 and 10% of CIFAR-100 test images duplicate training samples.
The paper introduces the ciFAIR datasets that replace duplicate images with new ones from the Tiny Images dataset to prevent bias.
The paper demonstrates that model accuracy drops by up to 14% on ciFAIR, highlighting the impact of duplicate-induced memorization.

Analysis of Duplicate Influence in CIFAR Test Sets

The paper entitled "Do we train on test data? Purging CIFAR of near-duplicates" offers a critical examination of the CIFAR-10 and CIFAR-100 datasets, which are prominent benchmarks in computer vision. This paper explores the prevalence of duplicate images across the training and test sets of these datasets, a factor potentially skewing the apparent generalization capabilities of various image recognition models.

Investigation of Duplicates

Within the paper, the authors identified a significant proportion of duplicate images: 3.3% in the CIFAR-10 test set and an alarming 10% in the CIFAR-100 test set have duplicates within the training data. These duplicates, distinguished between exact duplicates, near-duplicates, and very similar images, can often lead to biasing model evaluations due to memorization, rather than genuine abstract learning.

To counteract this bias, the researchers introduced the ciFAIR datasets, which replace the duplicate test images with new ones from the Tiny Images dataset, the original source for CIFAR. This adjustment provides a cleaner testing environment to evaluate true model performance devoid of memorization impacts.

Experimental Reevaluation

The paper reevaluated modern CNN architectures on both the original CIFAR and the refurbished ciFAIR test sets to assess the impact of the duplicates. Notably, while performance degradation on the ciFAIR data was universal across models, the relative rankings of models remained constant, indicating underlying robustness in the comparative evaluations despite the duplicate-induced biases. The results exhibited a drop in classification accuracy of up to 14% relative to analyses conducted on the original CIFAR dataset.

Implications and Future Directions

The paper draws attention to the ramifications of duplicate image presence and the consequent risk of inflated performance metrics due to data memorization rather than genuine learning. This observation is particularly pertinent given the high capacity of modern neural networks capable of memorizing extensive data volumes, including duplicates.

Practically, the insights from this paper serve as a cautionary note for researchers relying on the CIFAR datasets for model validation and comparison. It's imperative to switch to ciFAIR for a more accurate gauge of generalization capabilities, reducing the risk of overfitting to training set nuances.

Theoretically, the paper opens avenues to explore duplicate detection and mitigation strategies more broadly in dataset preparation. Ensuring future datasets have more rigorous filtering protocols could safeguard against similar pitfalls and further advance deep learning methodologies.

Additionally, the integration of ciFAIR datasets in the community, along with an open leaderboard, underscores a move towards transparency and reproducibility in AI model evaluation, with a call to researchers to contribute their findings accompanied by pre-trained models for verification purposes.

The work signifies a shift in dataset accuracy protocols, revealing intricate nuances often overshadowed in benchmark reporting and promoting best practices in dataset preparation for robust model understanding and deployment in real-world scenarios.

Do We Train on Test Data? Purging CIFAR of Near-Duplicates (1902.00423v2)

Summary

Analysis of Duplicate Influence in CIFAR Test Sets

Investigation of Duplicates

Experimental Reevaluation

Implications and Future Directions

GitHub

YouTube

Do We Train on Test Data? Purging CIFAR of Near-Duplicates (1902.00423v2)

Summary

Analysis of Duplicate Influence in CIFAR Test Sets

Investigation of Duplicates

Experimental Reevaluation

Implications and Future Directions

Related Papers

GitHub

YouTube