Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual Concept Understanding (2401.04575v2)
Abstract: Vision and vision-language applications of neural networks, such as image classification and captioning, rely on large-scale annotated datasets that require non-trivial data-collecting processes. This time-consuming endeavor hinders the emergence of large-scale datasets, limiting researchers and practitioners to a small number of choices. Therefore, we seek more efficient ways to collect and annotate images. Previous initiatives have gathered captions from HTML alt-texts and crawled social media postings, but these data sources suffer from noise, sparsity, or subjectivity. For this reason, we turn to commercial shopping websites whose data meet three criteria: cleanliness, informativeness, and fluency. We introduce the Let's Go Shopping (LGS) dataset, a large-scale public dataset with 15 million image-caption pairs from publicly available e-commerce websites. When compared with existing general-domain datasets, the LGS images focus on the foreground object and have less complex backgrounds. Our experiments on LGS show that the classifiers trained on existing benchmark datasets do not readily generalize to e-commerce data, while specific self-supervised visual feature extractors can better generalize. Furthermore, LGS's high-quality e-commerce-focused images and bimodal nature make it advantageous for vision-language bi-modal tasks: LGS enables image-captioning models to generate richer captions and helps text-to-image generation models achieve e-commerce style transfer.
- VQA: Visual question answering. In ICCV, 2015.
- Accelerating diffusion-based text-to-audio generation with consistency distillation. arXiv preprint arXiv:2309.10740, 2023.
- Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. In Advances in Neural Information Processing Systems, 2019.
- Learning visual representations with caption annotations. In ECCV, 2020.
- Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts. In CVPR, 2021.
- Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
- Uniter: Learning universal image-text representations. arXiv preprint arXiv:1909.11740, 2019.
- Francois Chollet. Xception: Deep Learning with Depthwise Separable Convolutions. In CVPR, 2017.
- ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, 2009.
- VirTex: Learning Visual Representations from Textual Annotations. In CVPR, 2020.
- Redcaps: web-curated image-text data created by the people, for the people. ArXiv, abs/2111.11431, 2021.
- M5product: Self-harmonized contrastive learning for e-commercial multi-modal pretraining. In CVPR, 2022.
- The pascal visual object classes (VOC) challenge. IJCV, 2009.
- Jacob Gildenblat and contributors. Pytorch library for cam methods. https://github.com/jacobgil/pytorch-grad-cam, 2021.
- LVIS: A dataset for large vocabulary instance segmentation. In CVPR, 2019.
- Automatic spatially-aware fashion concept discovery. In Proceedings of the IEEE international conference on computer vision, 2017.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Deep residual learning for image recognition. In CVPR, 2016.
- Distilling the Knowledge in a Neural Network. NeurIPS Deep Learning and Representation Learning Workshop, 2015.
- Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 2020.
- spaCy: Industrial-strength Natural Language Processing in Python. 2020.
- Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers. arXiv preprint arXiv:2004.00849, 2020.
- GQA: A New Dataset for Real-world Visual Reasoning and Compositional Question Answering. In CVPR, 2019.
- Adversarial examples are not bugs, they are features. Advances in neural information processing systems, 2019.
- Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. arXiv preprint arXiv:2102.05918, 2021.
- Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, 2014.
- ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. In ICML, 2021.
- Visual Genome: Connecting Language and Vision using Crowdsourced Dense Image Annotations. IJCV, 2017.
- Alex Krizhevsky. Learning multiple layers of features from tiny images, 2012.
- The open images dataset v4. International Journal of Computer Vision, 128(7):1956–1981, 2020.
- Ya Le and Xuan S. Yang. Tiny imagenet visual recognition challenge. 2015.
- Mnist handwritten digit database. ATT Labs, 2, 2010.
- Learning visual n-grams from web data. In ICCV, 2017.
- Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training. AAAI, 2020.
- VisualBERT: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.
- WebVision Database: Visual Learning and Understanding from Web Data. arXiv preprint arXiv:1708.02862, 2017.
- Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV, 2020.
- Microsoft COCO: Common objects in context. In ECCV, 2014.
- Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In CVPR, 2016.
- ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS, 2019.
- Exploring the limits of weakly supervised pretraining. In ECCV, 2018.
- Umap: Uniform manifold approximation and projection. Journal of Open Source Software, 3(29):861, 2018.
- George A. Miller. Wordnet: A lexical database for english. Communications of the ACM, 38(11):39–41, 1995.
- Reading digits in natural images with unsupervised feature learning. In NeurIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011.
- Im2Text: Describing Images Using 1 Million Captioned Photographs. In NeurIPS, 2011.
- Connecting vision and language with localized narratives. In ECCV, 2020.
- ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data. arXiv preprint arXiv:2001.07966, 2020.
- Learning Transferable Visual Models From Natural Language Supervision. arXiv preprint arXiv:2103.00020, 2021.
- High-resolution image synthesis with latent diffusion models, 2021.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. 2022.
- Imagenet Large Scale Visual Recognition Challenge. IJCV, 2015.
- Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, 2017.
- Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset for Automatic Image Captioning. In ACL, 2018.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, 2015.
- WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning. arXiv preprint arXiv:2103.01913, 2021.
- VL-BERT: Pre-training of generic visual-linguistic representations. In ICLR, 2020.
- A corpus for reasoning about natural language grounded in photographs. In ACL, 2019.
- LXMERT: Learning cross-modality encoder representations from transformers. In EMNLP, 2019.
- YFCC100M: The New Data in Multimedia Research. Communications of the ACM, 2016.
- The inaturalist species classification and detection dataset. In CVPR, 2018.
- Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. arXiv preprint arXiv:2202.03052, 2022.
- The fashion iq dataset: Retrieving images by combining side information and relative natural language feedback. CVPR, 2021.
- Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017.
- Learning from massive noisy labeled data for image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2015.
- Image2point: 3d point-cloud understanding with 2d image pretrained models. In Shai Avidan, Gabriel J. Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, ECCV, 2022.
- From recognition to cognition: Visual commonsense reasoning. In CVPR, 2019.
- Learning deep features for scene recognition using places database. In NeurIPS, 2014.
- Unified vision-language pre-training for image captioning and VQA. AAAI, 2020.
- Visual7W: Grounded Question Answering in Images. In CVPR, 2016.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.