Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 70 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 37 tok/s Pro
GPT-5 High 34 tok/s Pro
GPT-4o 21 tok/s Pro
Kimi K2 191 tok/s Pro
GPT OSS 120B 448 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual Concept Understanding (2401.04575v2)

Published 9 Jan 2024 in cs.CV and cs.AI

Abstract: Vision and vision-language applications of neural networks, such as image classification and captioning, rely on large-scale annotated datasets that require non-trivial data-collecting processes. This time-consuming endeavor hinders the emergence of large-scale datasets, limiting researchers and practitioners to a small number of choices. Therefore, we seek more efficient ways to collect and annotate images. Previous initiatives have gathered captions from HTML alt-texts and crawled social media postings, but these data sources suffer from noise, sparsity, or subjectivity. For this reason, we turn to commercial shopping websites whose data meet three criteria: cleanliness, informativeness, and fluency. We introduce the Let's Go Shopping (LGS) dataset, a large-scale public dataset with 15 million image-caption pairs from publicly available e-commerce websites. When compared with existing general-domain datasets, the LGS images focus on the foreground object and have less complex backgrounds. Our experiments on LGS show that the classifiers trained on existing benchmark datasets do not readily generalize to e-commerce data, while specific self-supervised visual feature extractors can better generalize. Furthermore, LGS's high-quality e-commerce-focused images and bimodal nature make it advantageous for vision-language bi-modal tasks: LGS enables image-captioning models to generate richer captions and helps text-to-image generation models achieve e-commerce style transfer.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. VQA: Visual question answering. In ICCV, 2015.
  2. Accelerating diffusion-based text-to-audio generation with consistency distillation. arXiv preprint arXiv:2309.10740, 2023.
  3. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. In Advances in Neural Information Processing Systems, 2019.
  4. Learning visual representations with caption annotations. In ECCV, 2020.
  5. Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts. In CVPR, 2021.
  6. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  7. Uniter: Learning universal image-text representations. arXiv preprint arXiv:1909.11740, 2019.
  8. Francois Chollet. Xception: Deep Learning with Depthwise Separable Convolutions. In CVPR, 2017.
  9. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, 2009.
  10. VirTex: Learning Visual Representations from Textual Annotations. In CVPR, 2020.
  11. Redcaps: web-curated image-text data created by the people, for the people. ArXiv, abs/2111.11431, 2021.
  12. M5product: Self-harmonized contrastive learning for e-commercial multi-modal pretraining. In CVPR, 2022.
  13. The pascal visual object classes (VOC) challenge. IJCV, 2009.
  14. Jacob Gildenblat and contributors. Pytorch library for cam methods. https://github.com/jacobgil/pytorch-grad-cam, 2021.
  15. LVIS: A dataset for large vocabulary instance segmentation. In CVPR, 2019.
  16. Automatic spatially-aware fashion concept discovery. In Proceedings of the IEEE international conference on computer vision, 2017.
  17. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  18. Deep residual learning for image recognition. In CVPR, 2016.
  19. Distilling the Knowledge in a Neural Network. NeurIPS Deep Learning and Representation Learning Workshop, 2015.
  20. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 2020.
  21. spaCy: Industrial-strength Natural Language Processing in Python. 2020.
  22. Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers. arXiv preprint arXiv:2004.00849, 2020.
  23. GQA: A New Dataset for Real-world Visual Reasoning and Compositional Question Answering. In CVPR, 2019.
  24. Adversarial examples are not bugs, they are features. Advances in neural information processing systems, 2019.
  25. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. arXiv preprint arXiv:2102.05918, 2021.
  26. Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, 2014.
  27. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. In ICML, 2021.
  28. Visual Genome: Connecting Language and Vision using Crowdsourced Dense Image Annotations. IJCV, 2017.
  29. Alex Krizhevsky. Learning multiple layers of features from tiny images, 2012.
  30. The open images dataset v4. International Journal of Computer Vision, 128(7):1956–1981, 2020.
  31. Ya Le and Xuan S. Yang. Tiny imagenet visual recognition challenge. 2015.
  32. Mnist handwritten digit database. ATT Labs, 2, 2010.
  33. Learning visual n-grams from web data. In ICCV, 2017.
  34. Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training. AAAI, 2020.
  35. VisualBERT: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.
  36. WebVision Database: Visual Learning and Understanding from Web Data. arXiv preprint arXiv:1708.02862, 2017.
  37. Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV, 2020.
  38. Microsoft COCO: Common objects in context. In ECCV, 2014.
  39. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In CVPR, 2016.
  40. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS, 2019.
  41. Exploring the limits of weakly supervised pretraining. In ECCV, 2018.
  42. Umap: Uniform manifold approximation and projection. Journal of Open Source Software, 3(29):861, 2018.
  43. George A. Miller. Wordnet: A lexical database for english. Communications of the ACM, 38(11):39–41, 1995.
  44. Reading digits in natural images with unsupervised feature learning. In NeurIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011.
  45. Im2Text: Describing Images Using 1 Million Captioned Photographs. In NeurIPS, 2011.
  46. Connecting vision and language with localized narratives. In ECCV, 2020.
  47. ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data. arXiv preprint arXiv:2001.07966, 2020.
  48. Learning Transferable Visual Models From Natural Language Supervision. arXiv preprint arXiv:2103.00020, 2021.
  49. High-resolution image synthesis with latent diffusion models, 2021.
  50. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. 2022.
  51. Imagenet Large Scale Visual Recognition Challenge. IJCV, 2015.
  52. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, 2017.
  53. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset for Automatic Image Captioning. In ACL, 2018.
  54. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, 2015.
  55. WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning. arXiv preprint arXiv:2103.01913, 2021.
  56. VL-BERT: Pre-training of generic visual-linguistic representations. In ICLR, 2020.
  57. A corpus for reasoning about natural language grounded in photographs. In ACL, 2019.
  58. LXMERT: Learning cross-modality encoder representations from transformers. In EMNLP, 2019.
  59. YFCC100M: The New Data in Multimedia Research. Communications of the ACM, 2016.
  60. The inaturalist species classification and detection dataset. In CVPR, 2018.
  61. Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. arXiv preprint arXiv:2202.03052, 2022.
  62. The fashion iq dataset: Retrieving images by combining side information and relative natural language feedback. CVPR, 2021.
  63. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017.
  64. Learning from massive noisy labeled data for image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2015.
  65. Image2point: 3d point-cloud understanding with 2d image pretrained models. In Shai Avidan, Gabriel J. Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, ECCV, 2022.
  66. From recognition to cognition: Visual commonsense reasoning. In CVPR, 2019.
  67. Learning deep features for scene recognition using places database. In NeurIPS, 2014.
  68. Unified vision-language pre-training for image captioning and VQA. AAAI, 2020.
  69. Visual7W: Grounded Question Answering in Images. In CVPR, 2016.
Citations (2)

Summary

  • The paper introduces a novel web-scale dataset with 15M curated image-caption pairs mined from diverse e-commerce platforms to enhance visual concept understanding.
  • The paper details a robust data collection process employing automated quality tests to ensure images and captions are precise and detailed compared to general datasets.
  • The dataset improves domain-specific AI tasks, with models showing enhanced image classification and caption generation performance over benchmarks like ImageNet.

Introduction

Understanding visual concepts is crucial for the progression of computer vision (CV) and NLP tasks. These fields frequently use large-scale datasets, which are often lacking in public domain due to the complexity and cost associated with their creation. The "Let's Go Shopping" (LGS) dataset offers an alternative by leveraging publicly available e-commerce websites to collect 15 million high-quality image-caption pairs.

Dataset Collection and Characteristics

The LGS dataset stands out from its contemporaries in several ways. Firstly, it pools data from thousands of diverse e-commerce websites, leading to a rich mix of product images and descriptions that are cleaner, more detailed, and less background-complex than those found in other general-domain datasets. These features make LGS potentially more useful for tasks that demand precise visual understanding and language association. In gathering data, LGS targets product pages specifically, and rigorous automated tests are applied to filter out low-quality submissions, thus maintaining dataset integrity.

Visual and Linguistic Analysis

Upon examining LGS images and captions, distinct characteristics emerge. Images typically focus on the main product with a clear or single-colored background, while captions exhibit a wide variety in language use, with high informative value detailing product specifics. This dataset fills a significant gap by providing numerous captions with rich semantics and diverse structure, unlike other datasets where captions are either inadequate or overly simplistic.

Application Performance and Potential

The LGS dataset presents a new avenue for enhancing various AI models. It has shown capabilities in improving image classification, image reconstruction, and text-to-image generation tasks, particularly within the context of e-commerce. For instance, classifiers trained on the LGS dataset outperform those pre-trained on ImageNet when applied to e-commerce data, underscoring the dataset's value for domain-specific applications. Moreover, the dataset aids in the generation of "attribute-rich" image captions and helps adapt existing text-to-image models for e-commerce-related tasks with promising qualitative and quantitative outcomes.

Conclusion

In sum, LGS is a well-structured bi-modal dataset that not only provides an exhaustive collection of image-caption pairs but also encourages the development of AI models tailored to commercial applications. Its distribution is unique enough to offer novel insights into domain-specific visual features while possessing generalizability to enable broader AI advancements. As researchers and practitioners work with and build upon the LGS dataset, it is poised to enrich the ecosystem of publicly accessible visual datasets and spur innovation across various vision-language applications.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 11 posts and received 104 likes.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube