Papers
Topics
Authors
Recent
Search
2000 character limit reached

A Decade's Battle on Dataset Bias: Are We There Yet?

Published 13 Mar 2024 in cs.CV and cs.LG | (2403.08632v2)

Abstract: We revisit the "dataset classification" experiment suggested by Torralba & Efros (2011) a decade ago, in the new era with large-scale, diverse, and hopefully less biased datasets as well as more capable neural network architectures. Surprisingly, we observe that modern neural networks can achieve excellent accuracy in classifying which dataset an image is from: e.g., we report 84.7% accuracy on held-out validation data for the three-way classification problem consisting of the YFCC, CC, and DataComp datasets. Our further experiments show that such a dataset classifier could learn semantic features that are generalizable and transferable, which cannot be explained by memorization. We hope our discovery will inspire the community to rethink issues involving dataset bias.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. Common Crawl. https://commoncrawl.org.
  2. Layer normalization. arxiv preprint arXiv:1607.06450, 2016.
  3. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency, 2018.
  4. Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.
  5. An empirical study of training self-supervised Vision Transformers. In ICCV, 2021.
  6. The Cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
  7. Randaugment: Practical automated data augmentation with a reduced search space. In CVPR Workshops, 2020.
  8. ImageNet: A large-scale hierarchical image database. In CVPR, 2009.
  9. RedCaps: Web-curated image-text data created by the people, for the people. In NeurIPS Datasets and Benchmarks Track, 2021.
  10. Decaf: A deep convolutional activation feature for generic visual recognition. In ICML, 2014.
  11. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  12. Identifying statistical bias in dataset replication. In ICML, 2020.
  13. The PASCAL Visual Object Classes (VOC) Challenge. IJCV, 2010.
  14. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In CVPR workshops, 2004.
  15. Datacomp: In search of the next generation of multimodal datasets. In NeurIPS Datasets and Benchmarks Track, 2023.
  16. Domain-adversarial training of neural networks. JMLR, 2016.
  17. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
  18. Google People + AI Research. Know your data. 2021.
  19. Masked autoencoders are scalable vision learners. In CVPR, 2022.
  20. Deep residual learning for image recognition. In CVPR, 2016.
  21. Benchmarking neural network robustness to common corruptions and perturbations. In ICLR, 2018.
  22. Natural adversarial examples. In CVPR, 2021.
  23. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
  24. Andrej Karpathy. What I learned from competing against a ConvNet on ImageNet. https://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/, 2014. Accessed: October 21, 2023.
  25. Wilds: A benchmark of in-the-wild distribution shifts. In ICML, 2021.
  26. Imagenet classification with deep convolutional neural networks. In NeurIPS, 2012.
  27. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 2020.
  28. Mseg: A composite dataset for multi-domain semantic segmentation. In CVPR, 2020.
  29. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998.
  30. Microsoft coco: Common objects in context. In ECCV, 2014.
  31. A convnet for the 2020s. In CVPR, 2022.
  32. David G Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004.
  33. Quality not quantity: On the interaction between dataset design and robustness of clip. NeurIPS, 2022.
  34. Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics & Image Processing, 2008.
  35. Cats and dogs. In CVPR, 2012.
  36. Large image datasets: A pyrrhic win for computer vision? arXiv preprint arXiv:2006.16923, 2020.
  37. Learning transferable visual models from natural language supervision. In ICML, 2021.
  38. Do imagenet classifiers generalize to imagenet? In ICML, 2019.
  39. Lawrence Gilman Roberts. Machine Perception of Three-Dimensional Solids. PhD thesis, Massachusetts Institute of Technology, 1963.
  40. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.
  41. William F Schreiber. Image processing for quality improvement. Proceedings of the IEEE, 1978.
  42. LAION-5b: An open large-scale dataset for training next generation image-text models. In NeurIPS Datasets and Benchmarks Track, 2022.
  43. No classification without representation: Assessing geodiversity issues in open data sets for the developing world. arXiv preprint arXiv:1711.08536, 2017.
  44. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  45. Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. In International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021.
  46. Revisiting unreasonable effectiveness of data in deep learning era. In ICCV, 2017.
  47. Test-time training with self-supervision for generalization under distribution shifts. In ICML, 2020.
  48. Going deeper with convolutions. In CVPR, 2015.
  49. YFCC100M: The new data in multimedia research. Communications of the ACM, 2016.
  50. A deeper look at dataset bias. arxiv preprint arXiv:1505.01257, 2015.
  51. Unbiased look at dataset bias. In CVPR, 2011.
  52. Training data-efficient image transformers & distillation through attention. arxiv preprint arXiv:2012.12877, 2020.
  53. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014.
  54. CWJ van Miltenburg. Stereotyping and bias in the flickr30k dataset. In Workshop on multimodal corpora: computer vision and language processing, 2016.
  55. Attention is all you need. In NeurIPS, 2017.
  56. Efficient additive kernels via explicit feature maps. IEEE transactions on pattern analysis and machine intelligence, 2012.
  57. Hermann Von Helmholtz. Optique physiologique. Masson, 1867.
  58. REVISE: A tool for measuring and mitigating bias in visual datasets. IJCV, 2022.
  59. Tent: Fully test-time adaptation by entropy minimization. In ICLR, 2021.
  60. Towards fairness in visual recognition: Effective strategies for bias mitigation. In CVPR, 2020.
  61. Towards fairer datasets: Filtering and balancing the distribution of the people subtree in the imagenet hierarchy. In Conference on fairness, accountability and transparency, 2020.
  62. How transferable are features in deep neural networks? arxiv preprint arXiv:1411.1792, 2014.
  63. Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV, 2019.
  64. Wilddash-creating hazard-aware benchmarks. In ECCV, 2018.
  65. Mitigating unwanted biases with adversarial learning. arXiv preprint arXiv:1801.07593, 2018.
  66. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
  67. mixup: Beyond empirical risk minimization. In ICLR, 2018.
  68. Understanding and evaluating racial biases in image captioning. In ICCV, 2021.
  69. Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence, 2017.
Citations (16)

Summary

  • The paper demonstrates that modern neural networks excel at dataset classification, revealing persistent biases in large-scale datasets.
  • It employs rigorous experiments including controlled corruptions and pseudo-dataset tasks to distinguish generalization from mere memorization.
  • The study finds that even self-supervised models capture intrinsic biases, emphasizing the ongoing challenge of achieving truly unbiased AI.

Revisiting Dataset Bias: Surprising Findings from Modern Neural Networks

Introduction to Dataset Bias

Over the past decade, the computer vision community has made substantial strides in addressing dataset bias, a critical issue highlighted by Torralba and Efros in their seminal paper in 2011. The concern was that datasets might inadvertently capture and perpetuate biases present at the time of their creation, impacting the generalizability and fairness of models trained on these datasets. Since then, the advent of deep learning and the creation of larger, more diverse datasets were hoped to mitigate these issues. This blog summarily revisits these assumptions using modern neural networks and extensive experimentation, shedding new light on the persistent nature of dataset bias.

Study Design and Findings

The current investigation is anchored around the concept of "dataset classification," a task designed to understand whether a given model can identify the originating dataset of an unseen image based on learned biases. This setup extends the "Name That Dataset" challenge proposed by Torralba and Efros, incorporating images from contemporary large-scale datasets presumed to be diverse and representative, like YFCC, CC, and DataComp.

Surprisingly, results unveil that modern neural networks, across various architectures and sizes, achieve excellent accuracy in dataset classification tasks, significantly surpassing chance levels. This phenomenon is robust across different dataset combinations and persists despite attempts to mask low-level image features (e.g., compression artifacts, color schemes) through controlled image corruptions. The implications are clear: despite advancements in dataset creation and model development, neural networks continue to capture and exploit dataset-specific biases effectively.

Analyzing Neural Behaviors

Delving further, the study explores whether these models are generalizing patterns or merely memorizing dataset-specific cues. Comparisons with "pseudo-dataset" classification tasks—where subsets are randomly sampled from the same dataset, making them genuinely unbiased—reveal that neural networks attempt to learn generalizable features from dataset biases rather than memorizing specific images. This is in stark contrast to their failure to solve the pseudo-dataset task beyond mere chance, highlighting the presence of exploitable biases in "real" datasets and the networks’ ability to capitalize on them.

Additionally, the investigation extends into the field of self-supervised learning. Models pre-trained without any dataset labels, when later fine-tuned with minimal supervision for dataset classification, still achieve remarkable accuracy. This underscores the deeply ingrained nature of dataset biases within learned representations, further substantiated by the transferability of these biases to enhance performance on unrelated semantic classification tasks.

Human Benchmarking

A user study featuring machine learning practitioners and researchers provides an intriguing contrast to the neural findings. When tasked with dataset classification, human accuracy hovers around 45%, indicating that while certain dataset biases are perceptible to humans, neural networks far exceed human capability in exploiting these biases for classification purposes.

Concluding Thoughts

The persistence and exploitability of dataset biases, as this study showcases, pose crucial questions for the AI research community. While large-scale, diverse datasets represented a path toward mitigating bias, this research suggests that biases, in some form, remain unavoidable. The ability of modern neural networks to discern these biases, sometimes beyond human recognition, highlights the nuanced challenges in achieving truly unbiased AI systems. Future efforts should not only focus on dataset creation but also on developing models and algorithms inherently resistant to or capable of correcting for dataset biases. This study serves as a stark reminder of the complexities inherent in striving for fairness and generalizability within AI and points toward an ongoing journey in understanding and mitigating dataset bias.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.

GitHub

  1. GitHub - liuzhuang13/bias (113 stars)  

Tweets

Sign up for free to view the 9 tweets with 321 likes about this paper.