Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts (2102.08981v2)

Published 17 Feb 2021 in cs.CV and cs.CL

Abstract: The availability of large-scale image captioning and visual question answering datasets has contributed significantly to recent successes in vision-and-language pre-training. However, these datasets are often collected with overrestrictive requirements inherited from their original target tasks (e.g., image caption generation), which limit the resulting dataset scale and diversity. We take a step further in pushing the limits of vision-and-language pre-training data by relaxing the data collection pipeline used in Conceptual Captions 3M (CC3M) [Sharma et al. 2018] and introduce the Conceptual 12M (CC12M), a dataset with 12 million image-text pairs specifically meant to be used for vision-and-language pre-training. We perform an analysis of this dataset and benchmark its effectiveness against CC3M on multiple downstream tasks with an emphasis on long-tail visual recognition. Our results clearly illustrate the benefit of scaling up pre-training data for vision-and-language tasks, as indicated by the new state-of-the-art results on both the nocaps and Conceptual Captions benchmarks.

PDF Abstract

Conceptual 12M: Expanding Vision-and-Language Pre-Training with Increased Scale and Diversity

The paper "Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training to Recognize Long-Tail Visual Concepts," authored by Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut from Google Research, introduces a new vision-and-language (V+L) pre-training dataset—Conceptual 12M (CC12M). This dataset expands upon previous efforts, particularly the Conceptual Captions 3M (CC3M) dataset, aiming to address the limitations of scale and diversity in existing V+L datasets.

Introduction and Motivation

Transfer learning via pre-training and fine-tuning has gained prominence in V+L research, significantly influencing tasks such as visual question answering (VQA), image and text retrieval, and referring expression comprehension. However, existing V+L datasets, while useful, often suffer from limited scale and diversity, restricting the scope of learned visual concepts. Notable examples like COCO-Captions, Visual Genome, and VQA2 primarily revolve around a limited visual domain, restricting comprehensive visual representation learning.

The authors of this paper posit that much larger, noisier datasets can be highly beneficial for V+L pre-training, facilitating the learning of long-tail visual concepts that smaller, curated datasets cannot capture.

Dataset Construction

The CC12M dataset is constructed by significantly relaxing the stringent filtering criteria used in the CC3M dataset. The CC12M pipeline involves:

Image-based Filtering: Reduced strictness in dimension ratios and enlarged size thresholds.
Text-based Filtering: Broadened acceptance criteria for text attributes, permitting longer and more diverse text.
Text Transformation: Removal of hypernymization and digit substitution to preserve fine-grained textual entities except for person-name substitutions to safeguard privacy.

This methodology increases the dataset size from CC3M's 3.3 million pairs to CC12M’s 12.4 million image-text pairs, thus enriching the variety and depth of visual concepts.

Analysis and Results

The authors provide an in-depth analysis of CC12M compared to CC3M and other V+L datasets. Key findings include:

Scale and Diversity: CC12M exhibits a markedly longer-tail distribution of visual concepts, a greater variety of fine-grained entities, and a broader spectrum of visual domain coverage.
Quality Assessment: Despite a reduction in precision due to relaxed filtering, CC12M's larger scale compensates by significantly enhancing recall, providing a more comprehensive dataset for pre-training.
Bias Analysis: Preliminary checks suggest no severe biases related to sensitive terms or web domains, indicating balanced and extensive visual concept coverage.

Vision-to-Language Generation

In multiple downstream tasks, particularly nocaps (novel object captioning) and Conceptual Captions benchmarks, models pre-trained on CC12M demonstrated substantial improvements over those pre-trained on CC3M. The broader visual and textual diversity enabled more effective long-tail recognition. Quantitatively, models fine-tuned on COCO Captions after pre-training on CC12M showed a dramatic increase in CIDEr scores, affirming the dataset's efficacy.

Vision-and-Language Matching

CC12M pre-training also significantly enhanced performance in caption-based image retrieval tasks, most notably in zero-shot settings. The dataset’s extensive coverage of visual and textual concepts proved beneficial for generalized representation learning, outperforming CC3M and existing models.

Future Directions and Implications

The success of CC12M highlights the importance of dataset scale and diversity in advancing V+L research. Future developments could explore further scaling and more sophisticated filtering mechanisms to balance quality and quantity effectively. Additionally, combining CC12M with other datasets or auxiliary pre-training tasks could yield even better performance.

The findings also suggest potential shifts in downstream task setups towards more out-of-domain generalizations, reflecting real-world application scenarios more accurately. The creation of large-scale, diverse datasets like CC12M sets a new precedent in V+L research, emphasizing the critical role of extensive, naturally occurring data for comprehensive multimodal learning.

CC12M’s public availability increases accessibility for researchers, fostering advances in V+L applications and potentially prompting novel benchmarks and challenges in AI. The ongoing development in this space is likely to yield models that not only perform well in controlled environments but are robust and versatile in the wild, fulfilling the growing demands of real-world AI applications.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Soravit Changpinyo (24 papers)
Piyush Sharma (16 papers)
Nan Ding (57 papers)
Radu Soricut (54 papers)

Citations (931)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos