Scaling Up Vision-Language Pre-training for Image Captioning (2111.12233v2)

Published 24 Nov 2021 in cs.CV and cs.CL

Abstract: In recent years, we have witnessed significant performance boost in the image captioning task based on vision-language pre-training (VLP). Scale is believed to be an important factor for this advance. However, most existing work only focuses on pre-training transformers with moderate sizes (e.g., 12 or 24 layers) on roughly 4 million images. In this paper, we present LEMON, a LargE-scale iMage captiONer, and provide the first empirical study on the scaling behavior of VLP for image captioning. We use the state-of-the-art VinVL model as our reference model, which consists of an image feature extractor and a transformer model, and scale the transformer both up and down, with model sizes ranging from 13 to 675 million parameters. In terms of data, we conduct experiments with up to 200 million image-text pairs which are automatically collected from web based on the alt attribute of the image (dubbed as ALT200M). Extensive analysis helps to characterize the performance trend as the model size and the pre-training data size increase. We also compare different training recipes, especially for training on large-scale noisy data. As a result, LEMON achieves new state of the arts on several major image captioning benchmarks, including COCO Caption, nocaps, and Conceptual Captions. We also show LEMON can generate captions with long-tail visual concepts when used in a zero-shot manner.

PDF Abstract

An Overview of Scaling Up Vision-Language Pre-training for Image Captioning

The paper "Scaling Up Vision-Language Pre-training for Image Captioning" addresses the trend of increasing scale in vision-language pre-training (VLP) models, specifically tailored for the task of image captioning. The authors introduce LEMON, a large-scale image captioning model that marks a significant investigation into how scaling both dataset size and model complexity influences performance in VLP tasks.

Objectives and Approach

The authors of this paper aim to explore the effects of scaling in VLP models for image captioning—a domain where large-scale vision-LLMs have shown substantial promise. Scaling here pertains to both the expansion of dataset sizes and the enhancement of model complexity (i.e., the number of parameters). To facilitate this investigation, the authors construct an extensive dataset named ALT200M, consisting of up to 200 million image-text pairs sourced from the web. This dataset enables a systematic evaluation of how VLP models behave when exposed to increased data volumes.

The paper utilizes the VinVL model, a high-performing VLP framework, as a base model. LEMON models are trained by varying the transformer model's size from 13 million to 675 million parameters while scaling the training dataset from a few million to 200 million image-text pairs. This comprehensive scaling endeavor allows the authors to delineate the effects of data and model scaling on captioning performance across several benchmarks like COCO Caption, nocaps, and Conceptual Captions.

Key Findings

The empirical results put forward several pivotal insights:

Data and Model Scaling Synergy: The benefits of scaling are pronounced when both dataset size and model size are increased concomitantly. While smaller models demonstrate limited gains from modest data increments, larger models exhibit substantial performance improvements when trained on larger datasets.
Performance on Benchmarks: LEMON achieves state-of-the-art performance on numerous image captioning benchmarks, notably surpassing existing method performances on COCO Caption and nocaps datasets. This reinforces the efficacy of large-scale pre-training on noisy, web-sourced data for complex tasks like image captioning.
Generalization and Zero-shot Capabilities: The model demonstrates strong ability in generating captions for long-tail visual concepts in a zero-shot manner, indicating that scaling can lead to models with more robust generalization capabilities.
Sample Efficiency: Larger models achieve comparable or superior performance with fewer training samples compared to smaller models, evidenced by rapid convergence and robust performance consistency across domains.

Implications and Future Prospects

The profound implication of these findings is the potential bottleneck that model capacity becomes in light of ever-increasing data availability. Future AI developments could thus benefit from efforts dedicated to efficiently scaling model parameters in tandem with leveraging vast datasets. The research also underscores an important pivot where web-scale datasets can be effectively harnessed for training robust vision-LLMs, extending capabilities beyond traditional dataset constraints.

Continued exploration could focus on even larger models and datasets, as the linear-logarithmic trends suggest room for further advancements. Future work may also explore novel model architectures or pre-training objectives that better leverage the innate patterns in expansive, noisy datasets like ALT200M.

Overall, this paper contributes significantly to the domain of VLP for image captioning by providing demonstrable evidence on the gains provided by scale, thereby charting a clear course for future endeavors in scalable vision-language integration.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Xiaowei Hu (54 papers)
Zhe Gan (135 papers)
Jianfeng Wang (149 papers)
Zhengyuan Yang (86 papers)
Zicheng Liu (153 papers)
Yumao Lu (8 papers)
Lijuan Wang (133 papers)

Citations (217)

View on Semantic Scholar

Scaling Up Vision-Language Pre-training for Image Captioning (2111.12233v2)

An Overview of Scaling Up Vision-Language Pre-training for Image Captioning

Objectives and Approach

Key Findings

Implications and Future Prospects

Related Papers