LAION-5B: An open large-scale dataset for training next generation image-text models

Published 16 Oct 2022 in cs.CV, cs.AI, and cs.LG | (2210.08402v1)

Abstract: Groundbreaking language-vision architectures like CLIP and DALL-E proved the utility of training on large amounts of noisy image-text data, without relying on expensive accurate labels used in standard vision unimodal supervised learning. The resulting models showed capabilities of strong text-guided image generation and transfer to downstream tasks, while performing remarkably at zero-shot classification with noteworthy out-of-distribution robustness. Since then, large-scale language-vision models like ALIGN, BASIC, GLIDE, Flamingo and Imagen made further improvements. Studying the training and capabilities of such models requires datasets containing billions of image-text pairs. Until now, no datasets of this size have been made openly available for the broader research community. To address this problem and democratize research on large-scale multi-modal models, we present LAION-5B - a dataset consisting of 5.85 billion CLIP-filtered image-text pairs, of which 2.32B contain English language. We show successful replication and fine-tuning of foundational models like CLIP, GLIDE and Stable Diffusion using the dataset, and discuss further experiments enabled with an openly available dataset of this scale. Additionally we provide several nearest neighbor indices, an improved web-interface for dataset exploration and subset generation, and detection scores for watermark, NSFW, and toxic content detection. Announcement page https://laion.ai/laion-5b-a-new-era-of-open-large-scale-multi-modal-datasets/

Abstract PDF Upgrade to Chat

Authors (16)

First 10 authors:

Citations (2,646)

View on Semantic Scholar

Summary

The paper introduces LAION-5B, an open-source dataset containing 5.85 billion image-text pairs curated using a CLIP-based filtering process.
The paper details a novel methodology that leverages Common Crawl data and cosine similarity thresholds to ensure high semantic alignment between images and texts.
The paper validates the dataset's utility by replicating competitive CLIP-like performance and demonstrating effectiveness in generative model fine-tuning with careful ethical considerations.

LAION-5B: An Open Large-Scale Dataset for Training Next Generation Image-Text Models

The paper introduces LAION-5B, a significant open-source contribution to the research community that aims to democratize access to large-scale datasets required for training state-of-the-art multimodal models. This dataset comprises 5.85 billion image-text pairs filtered by a CLIP model, offering researchers a valuable resource to advance the development and evaluation of language-vision systems.

Introduction and Context

The rapid advancements in multimodal learning, exemplified by models such as CLIP and DALL-E, have illustrated the profound impact of training on massive datasets. These models achieve remarkable performance on zero-shot tasks and exhibit exceptional robustness to distribution shifts. However, the datasets used to train these models, often containing billions of image-text pairs, have generally been proprietary and thus inaccessible to the broader research community. LAION-5B addresses this gap by providing a publicly available, large-scale image-text dataset, enabling transparent and reproducible research.

Dataset Composition

LAION-5B is segregated into three subsets: 2.32 billion English image-text pairs, 2.26 billion multilingual pairs, and 1.27 billion pairs where language detection was inconclusive. This categorization allows for targeted research on various linguistic and application-specific tasks. The dataset also includes metadata such as image URLs, text, dimensions, and cosine similarity scores.

Data Curation and Filtering

The dataset was constructed from Common Crawl, leveraging metadata and alt-text from HTML IMG tags. The filtering process employed OpenAI's ViT-B/32 CLIP model to select image-text pairs with a cosine similarity above predefined thresholds, ensuring a higher quality of semantic alignment between images and texts. Additionally, safety and ethical considerations were incorporated, tagging images with attributes such as NSFW content and watermarks to facilitate responsible usage.

Validation Through CLIP Reproduction

To validate the utility of LAION-5B, the paper presents extensive experiments replicating CLIP-like models. Using OpenCLIP, multiple architectures, including ViT-B/32, ViT-B/16, ViT-B/16+, and ViT-L/14, were trained on LAION-400M and LAION-2B-en subsets. The models demonstrated competitive performance compared to the original CLIP models across various zero-shot classification benchmarks like ImageNet-1k, ImageNet-A, and others. Stronger models, trained on larger subsets, showed improved transfer and robustness performance, indicating the beneficial impact of dataset scale and model size.

Generative Model Fine-Tuning

Experiments were also conducted with generative models such as GLIDE and Stable Diffusion. Fine-tuning OpenAI's pre-trained GLIDE on LAION-5B resulted in models capable of generating high-quality images, demonstrating the dataset's efficacy for training state-of-the-art text-to-image models. These results further validate LAION-5B's potential to support diverse research directions within multimodal learning.

Ethical Considerations

The paper underscores the ethical imperatives associated with deploying large-scale datasets. While LAION-5B offers unprecedented access to vast multimodal data, it simultaneously inherits potential biases from its sources and filtering algorithms. Mitigating these biases by tagging inappropriate content is a step toward responsible usage, though the authors strongly advocate for academic use only and caution against deploying models trained on LAION-5B without comprehensive bias and safety evaluations.

Implications and Future Directions

The implications of LAION-5B extend beyond immediate academic research. By providing a large-scale, openly accessible dataset, it paves the way for community-driven advancements in multimodal learning, enabling researchers to build and test robust models without proprietary constraints. Future work includes curating more refined subsets using larger CLIP models and exploring the dataset's potential for applications in low-resource languages and ethnically diverse contexts.

In summary, LAION-5B significantly advances the accessibility of large-scale datasets for the broader research community. It facilitates the development and evaluation of robust, state-of-the-art multimodal models while laying a foundation for ethical and transparent research practices in AI.

Markdown Report Issue