Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 70 tok/s

Gemini 2.5 Pro 41 tok/s Pro

GPT-5 Medium 37 tok/s Pro

GPT-5 High 34 tok/s Pro

GPT-4o 21 tok/s Pro

Kimi K2 191 tok/s Pro

GPT OSS 120B 448 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual Concept Understanding (2401.04575v2)

Published 9 Jan 2024 in cs.CV and cs.AI

Abstract: Vision and vision-language applications of neural networks, such as image classification and captioning, rely on large-scale annotated datasets that require non-trivial data-collecting processes. This time-consuming endeavor hinders the emergence of large-scale datasets, limiting researchers and practitioners to a small number of choices. Therefore, we seek more efficient ways to collect and annotate images. Previous initiatives have gathered captions from HTML alt-texts and crawled social media postings, but these data sources suffer from noise, sparsity, or subjectivity. For this reason, we turn to commercial shopping websites whose data meet three criteria: cleanliness, informativeness, and fluency. We introduce the Let's Go Shopping (LGS) dataset, a large-scale public dataset with 15 million image-caption pairs from publicly available e-commerce websites. When compared with existing general-domain datasets, the LGS images focus on the foreground object and have less complex backgrounds. Our experiments on LGS show that the classifiers trained on existing benchmark datasets do not readily generalize to e-commerce data, while specific self-supervised visual feature extractors can better generalize. Furthermore, LGS's high-quality e-commerce-focused images and bimodal nature make it advantageous for vision-language bi-modal tasks: LGS enables image-captioning models to generate richer captions and helps text-to-image generation models achieve e-commerce style transfer.

References (69)

Citations (2)

View on Semantic Scholar

Summary

The paper introduces a novel web-scale dataset with 15M curated image-caption pairs mined from diverse e-commerce platforms to enhance visual concept understanding.
The paper details a robust data collection process employing automated quality tests to ensure images and captions are precise and detailed compared to general datasets.
The dataset improves domain-specific AI tasks, with models showing enhanced image classification and caption generation performance over benchmarks like ImageNet.

Introduction

Understanding visual concepts is crucial for the progression of computer vision (CV) and NLP tasks. These fields frequently use large-scale datasets, which are often lacking in public domain due to the complexity and cost associated with their creation. The "Let's Go Shopping" (LGS) dataset offers an alternative by leveraging publicly available e-commerce websites to collect 15 million high-quality image-caption pairs.

Dataset Collection and Characteristics

The LGS dataset stands out from its contemporaries in several ways. Firstly, it pools data from thousands of diverse e-commerce websites, leading to a rich mix of product images and descriptions that are cleaner, more detailed, and less background-complex than those found in other general-domain datasets. These features make LGS potentially more useful for tasks that demand precise visual understanding and language association. In gathering data, LGS targets product pages specifically, and rigorous automated tests are applied to filter out low-quality submissions, thus maintaining dataset integrity.

Visual and Linguistic Analysis

Upon examining LGS images and captions, distinct characteristics emerge. Images typically focus on the main product with a clear or single-colored background, while captions exhibit a wide variety in language use, with high informative value detailing product specifics. This dataset fills a significant gap by providing numerous captions with rich semantics and diverse structure, unlike other datasets where captions are either inadequate or overly simplistic.

Application Performance and Potential

The LGS dataset presents a new avenue for enhancing various AI models. It has shown capabilities in improving image classification, image reconstruction, and text-to-image generation tasks, particularly within the context of e-commerce. For instance, classifiers trained on the LGS dataset outperform those pre-trained on ImageNet when applied to e-commerce data, underscoring the dataset's value for domain-specific applications. Moreover, the dataset aids in the generation of "attribute-rich" image captions and helps adapt existing text-to-image models for e-commerce-related tasks with promising qualitative and quantitative outcomes.

Conclusion

In sum, LGS is a well-structured bi-modal dataset that not only provides an exhaustive collection of image-caption pairs but also encourages the development of AI models tailored to commercial applications. Its distribution is unique enough to offer novel insights into domain-specific visual features while possessing generalizability to enable broader AI advancements. As researchers and practitioners work with and build upon the LGS dataset, it is poised to enrich the ecosystem of publicly accessible visual datasets and spur innovation across various vision-language applications.