What Makes ImageNet Look Unlike LAION (2306.15769v2)
Abstract: ImageNet was famously created from Flickr image search results. What if we recreated ImageNet instead by searching the massive LAION dataset based on image captions alone? In this work, we carry out this counterfactual investigation. We find that the resulting ImageNet recreation, which we call LAIONet, looks distinctly unlike the original. Specifically, the intra-class similarity of images in the original ImageNet is dramatically higher than it is for LAIONet. Consequently, models trained on ImageNet perform significantly worse on LAIONet. We propose a rigorous explanation for the discrepancy in terms of a subtle, yet important, difference in two plausible causal data-generating processes for the respective datasets, that we support with systematic experimentation. In a nutshell, searching based on an image caption alone creates an information bottleneck that mitigates the selection bias otherwise present in image-based filtering. Our explanation formalizes a long-held intuition in the community that ImageNet images are stereotypical, unnatural, and overly simple representations of the class category. At the same time, it provides a simple and actionable takeaway for future dataset creation efforts.
- BEit: BERT pre-training of image transformers. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=p-BhZSz59o4.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=YicbFdNTTy.
- Identifying statistical bias in dataset replication. In International Conference on Machine Learning, pages 2922–2932. PMLR, 2020.
- Data determines distributional robustness in contrastive language image pre-training (clip). In International Conference on Machine Learning, pages 6216–6234. PMLR, 2022a.
- Data determines distributional robustness in contrastive language image pre-training (clip). In International Conference on Machine Learning, pages 6216–6234. PMLR, 2022b.
- Imagenet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bygh9j09KX.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- What makes imagenet good for transfer learning? arXiv preprint arXiv:1608.08614, 2016.
- Imagenet-x: Understanding model mistakes with factor of variation annotations. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=HXz7Vcm3VgM.
- A whac-a-mole dilemma: Shortcuts come in multiples where mitigating one amplifies others. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20071–20082, 2023.
- A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11976–11986, 2022.
- G. A. Miller. WordNet: An electronic lexical database. MIT press, 1998.
- Quality not quantity: On the interaction between dataset design and robustness of CLIP. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=LTCBavFWp5C.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Do imagenet classifiers generalize to imagenet? In International conference on machine learning, pages 5389–5400. PMLR, 2019.
- On causal and anticausal learning. In International Coference on International Conference on Machine Learning, page 459–466, 2012.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
- LAION-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL https://openreview.net/forum?id=M3Y74vmsMcY.
- Vl-taboo: An analysis of attribute-based zero-shot capabilities of vision-language models. CoRR, abs/2209.06103, 2022. URL https://doi.org/10.48550/arXiv.2209.06103.
- Noise or signal: The role of image backgrounds in object recognition. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=gl3D-xY7wLq.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.