OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text (2406.08418v3)

Published 12 Jun 2024 in cs.CV and cs.AI

Abstract: Image-text interleaved data, consisting of multiple images and texts arranged in a natural document format, aligns with the presentation paradigm of internet data and closely resembles human reading habits. Recent studies have shown that such data aids multimodal in-context learning and maintains the capabilities of LLMs during multimodal fine-tuning. However, the limited scale and diversity of current image-text interleaved data restrict the development of multimodal LLMs. In this paper, we introduce OmniCorpus, a 10 billion-scale image-text interleaved dataset. Using an efficient data engine, we filter and extract large-scale high-quality documents, which contain 8.6 billion images and 1,696 billion text tokens. Compared to counterparts (e.g., MMC4, OBELICS), our dataset 1) has 15 times larger scales while maintaining good data quality; 2) features more diverse sources, including both English and non-English websites as well as video-centric websites; 3) is more flexible, easily degradable from an image-text interleaved format to pure text corpus and image-text pairs. Through comprehensive analysis and experiments, we validate the quality, usability, and effectiveness of the proposed dataset. We hope this could provide a solid data foundation for future multimodal model research. Code and data are released at https://github.com/OpenGVLab/OmniCorpus.

PDF HTML Abstract

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

The paper "OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text" presents a substantial contribution to the development of multimodal LLMs (MLLMs). The authors introduce OmniCorpus, an expansive dataset comprising 8.6 billion images and 1,696 billion text tokens sourced from diverse origins, including English and non-English websites as well as video-centric platforms. This dataset's scale and diversity significantly surpass existing multimodal datasets.

Key Contributions

The paper makes three primary contributions:

Introduction of OmniCorpus Dataset: Featuring 8.6 billion images and 1,696 billion text tokens across 2.2 billion documents, OmniCorpus sets a new benchmark in multimodal data curation. This scale is unprecedented and pushes the boundaries of available datasets, providing a substantially richer resource for training MLLMs.
Development of a Comprehensive Data Engine: The authors have created a highly efficient data processing pipeline capable of handling large-scale multimodal data. The pipeline includes stages for main body extraction, preliminary and detailed text filtering, content filtering of images, and document deduplication. Notably, the integration of human-feedback filtering ensures the dataset's high quality.
Exploratory Analysis and Empirical Validation: The authors provide extensive empirical evidence validating the quality and utility of OmniCorpus. They demonstrate the dataset's effectiveness through various experiments that highlight its impact on enhancing few-shot capabilities and preserving language understanding in MLLMs.

Dataset Characteristics and Processing Pipeline

OmniCorpus is built using an optimized pipeline designed to handle extensive data volumes efficiently. The pipeline's key features include:

Main Body Extraction and Preliminary Filtering: Improved upon existing methods, this process identifies primary content within documents, discarding irrelevant sections.
Document Deduplication: The use of minihash values ensures that the dataset retains only unique documents, significantly reducing redundancy.
Image Downloading and Filtering: Sophisticated filtering based on aesthetic and NSFW scores enhances the dataset's quality by removing low-quality images.
Human-Feedback Filtering: An iterative process incorporating human feedback to refine filtering rules, ensuring the removal of low-quality content and improving overall data quality.

Comparison with Existing Datasets

OmniCorpus is notable for its extensive size and the diversity of its sources. Compared to counterparts like MMC4 and OBELICS, OmniCorpus not only offers a larger scale but also includes bilingual data and content from varied sources, including Common Crawl and YouTube. This comprehensive approach addresses the limitations of existing datasets, which often lack diversity and are primarily focused on English text.

Experimental Evaluation and Results

The paper presents a series of experiments that illustrate OmniCorpus's superiority:

Image Position Strategies: The authors experiment with different strategies for positioning images within text sequences. They find that the natural layout is optimal for fully autoregressive architectures, while a retrieval-based strategy benefits cross-attention architectures.
Data Filtering Impact: The paper shows that stringent data filtering enhances model performance but warns against over-filtering, which can homogenize the data and negatively impact results.
Impact on Pre-training: Training models on OmniCorpus—both with and without fine-tuning—consistently enhances performance on a range of benchmarks, including few-shot tasks. This validates the dataset’s utility in improving the training efficacy of MLLMs.

Broader Implications and Future Directions

The introduction of OmniCorpus represents a significant advancement in the field of multimodal machine learning. By providing a robust data foundation, this dataset can foster new developments in MLLMs, particularly in enhancing few-shot learning capabilities and maintaining LLM competencies across multimodal domains.

The authors acknowledge that the current filtering improvements, while beneficial, are not exhaustive. Future work should investigate specific factors that contribute to model performance enhancements. Moreover, the immediate applicability of findings in practical MLLM implementations suggests a trajectory towards more sophisticated, contextually aware multimodal models.

In sum, OmniCorpus stands as an invaluable resource poised to influence the future landscape of multimodal machine learning. Its comprehensive scale, diversity of sources, and rigorous data processing mechanisms position it as a foundational dataset for advancing MLLM research and applications.