MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens (2406.11271v5)

Published 17 Jun 2024 in cs.CV and cs.LG

Abstract: Multimodal interleaved datasets featuring free-form interleaved sequences of images and text are crucial for training frontier large multimodal models (LMMs). Despite the rapid progression of open-source LMMs, there remains a pronounced scarcity of large-scale, diverse open-source multimodal interleaved datasets. In response, we introduce MINT-1T, the most extensive and diverse open-source Multimodal INTerleaved dataset to date. MINT-1T comprises one trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. As scaling multimodal interleaved datasets requires substantial engineering effort, sharing the data curation process and releasing the dataset greatly benefits the community. Our experiments show that LMMs trained on MINT-1T rival the performance of models trained on the previous leading dataset, OBELICS. Our data and code will be released at https://github.com/mlfoundations/MINT-1T.

PDF HTML Abstract

MINT-1T: Scaling Open-Source Multimodal Data by 10x

The paper "MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens" addresses a significant gap in the domain of open-source large multimodal models (LMMs). It introduces MINT-1T, a multimodal interleaved dataset that scales up the existing dataset size tenfold, comprising one trillion text tokens and three billion images. This dataset stands out for its scale and diversity, addressing the limitations faced by current open-source datasets.

Motivation and Context

With the rapid advancement of large multimodal models (LMMs), there is an increasing demand for extensive, diverse, and open-source multimodal datasets. Existing datasets such as OBELICS, though effective, are relatively constrained in size and diversity, primarily sourcing samples only from HTML documents. MINT-1T distinguishes itself by integrating data from diverse sources, including HTML, PDFs, and ArXiv papers, to provide a more comprehensive dataset for training LMMs.

Key Contributions

The primary contributions of the paper can be summarized as follows:

Data Engineering: The construction of MINT-1T involved significant engineering challenges due to the need to handle large document sizes while preserving the original sequence and structure of images and text.
Diversity and Scale: The dataset includes one trillion text tokens and three billion images, sourced from HTML, PDFs, and ArXiv documents. This scale and diversity are unprecedented in open-source multimodal datasets.
Model Performance: Experimental results indicate that LMMs trained on MINT-1T perform on par with, or even surpass, those trained on leading existing datasets like OBELICS, especially in tasks requiring deeper multimodal understanding.

Dataset Construction

The construction of MINT-1T is a multi-faceted process involving several steps to ensure the quality and diversity of the data. Key aspects include:

Sourcing and Filtering: HTML documents were sourced from CommonCrawl WARC files, PDFs from CommonCrawl WAT files, and ArXiv documents directly from their repository. Various filtering techniques were employed to ensure the quality and appropriateness of content, including filtering out documents based on text quality, removing non-English texts, and filtering inappropriate content.
Image Processing: Images associated with text documents were rigorously filtered to exclude low-quality or inappropriate images. This included removing images with insufficient resolution or inappropriate content, identified using an NSFW image detector.
Deduplication: The dataset was deduplicated at both paragraph and document levels to avoid redundancy and ensure uniqueness. This process was performed separately for each data source and involved the use of Bloom filters for efficient deduplication.

Evaluation and Experiments

To validate the effectiveness of MINT-1T, the authors trained multimodal models on this dataset and compared the performance with models trained on OBELICS. Key findings from the experiments include:

In-Context Learning: Models trained on MINT-1T demonstrated superior performance in in-context learning tasks, particularly as the number of demonstrations increased. This suggests better generalization and adaptability when provided with richer, multimodal, interleaved sequences.
Domain Diversity: The inclusion of PDFs and ArXiv documents enhances the dataset's coverage across various domains, contributing to improved performance in specific areas like Science and Technology, which are less represented in HTML-only datasets.

Practical and Theoretical Implications

The creation of MINT-1T has several implications:

Open Science and Research: By providing a significantly larger and more diverse open-source dataset, MINT-1T enables the research community to develop and benchmark new multimodal models more effectively.
Training Robust LMMs: The improved performance of models trained on MINT-1T suggests that scaling up and diversifying training data is beneficial. This can potentially reduce the gap between closed and open-source LMMs.
Engineering Efforts: The dataset construction process underscores the engineering challenges involved in scaling multimodal datasets. Future work may explore even more efficient methods for handling and processing such large-scale data.

Future Directions

While MINT-1T represents a significant step forward, there are several areas for future research and development:

Enhanced Filtering Techniques: Developing more advanced methods for filtering and curating multimodal data to further improve dataset quality.
Extended Sources: Exploring additional diverse data sources to include even broader domain coverage.
Advanced Models: Utilizing the dataset to train more sophisticated models that can better leverage the rich, interleaved multimodal content.

In conclusion, MINT-1T addresses a crucial need in the field of large multimodal models by providing an extensive, diverse, and open-source dataset. This work lays a foundation for future research and development in the field, promoting more transparent and accessible advancements in multimodal AI.