VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning (2009.13682v2)

Published 28 Sep 2020 in cs.CV, cs.CL, and cs.LG

Abstract: It is highly desirable yet challenging to generate image captions that can describe novel objects which are unseen in caption-labeled training data, a capability that is evaluated in the novel object captioning challenge (nocaps). In this challenge, no additional image-caption training data, other thanCOCO Captions, is allowed for model training. Thus, conventional Vision-Language Pre-training (VLP) methods cannot be applied. This paper presents VIsual VOcabulary pretraining (VIVO) that performs pre-training in the absence of caption annotations. By breaking the dependency of paired image-caption training data in VLP, VIVO can leverage large amounts of paired image-tag data to learn a visual vocabulary. This is done by pre-training a multi-layer Transformer model that learns to align image-level tags with their corresponding image region features. To address the unordered nature of image tags, VIVO uses a Hungarian matching loss with masked tag prediction to conduct pre-training. We validate the effectiveness of VIVO by fine-tuning the pre-trained model for image captioning. In addition, we perform an analysis of the visual-text alignment inferred by our model. The results show that our model can not only generate fluent image captions that describe novel objects, but also identify the locations of these objects. Our single model has achieved new state-of-the-art results on nocaps and surpassed the human CIDEr score.

PDF Abstract

VIVO: Advancements in Image Captioning through Visual Vocabulary Pre-Training

The paper "VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning" presents a novel approach to image captioning, specifically targeting the challenge of describing novel objects that are unseen in paired image-caption training datasets. This capacity is critical for the novel object captioning challenge (nocaps), where the task is constrained to not use additional caption annotations beyond what is available in COCO Captions. The authors introduced Visual Vocabulary Pre-training (VIVO) as a mechanism to break the reliance on paired image-caption datasets by leveraging extensive image-tag pair data for model pre-training.

Methodology Overview

VIVO employs a multi-layer Transformer model to learn a rich visual vocabulary by aligning image-level tags with corresponding image region features. The core innovation lies in utilizing large-scale image-tag data, allowing the development of a visual vocabulary without explicit caption annotations. This vocabulary is crafted through a pre-training process that uses a Hungarian matching loss coupled with masked tag prediction to handle the unordered nature of image tags.

The training process involves two key stages:

Pre-training: The model is trained using image-tag data, where the goal is to predict masked tags based on the context provided by tags and image features. This stage establishes a semantic space where vectors for tags and visual features of semantically similar objects are positioned closely together.
Fine-tuning: The pre-trained model is subsequently fine-tuned using a smaller and limited paired image-caption dataset. This phase emphasizes learning to generate captions conditional on visual and textual inputs, exploiting the visual vocabulary pre-learned in the initial stage.

Results and Implications

The paper reports on compelling empirical results, demonstrating that the VIVO-enhanced model achieves new state-of-the-art performance on the nocaps benchmark, even surpassing human CIDEr scores. The substantial improvements over baseline models, including UpDown and OSCAR, underscore the effectiveness of VIVO in recognizing and accurately describing novel objects.

Specific results show consistent gains across various domains:

Validation Set: Compared with methods like OSCAR with Constrained Beam Search, VIVO alone achieves competitive results, further improving when coupled with SCST and CBS.
Test Set: VIVO achieves exceptional CIDEr scores, particularly in out-of-domain cases—a testament to its robust generalization capacity.

VIVO's success highlights crucial implications for advancing the field of image captioning:

Generalization: The approach enhances zero-shot generalization capability, allowing models to describe novel objects effectively without direct exposure during caption-annotated training.
Scalability: By leveraging readily available image-tag data instead of labor-intensive caption annotations, VIVO demonstrates a scalable path for enhancing vision-LLMs.
Semantic Alignment: The use of novel technical methodologies such as Hungarian matching loss for tag prediction is pivotal, ensuring that the visual-text alignment contributes to the increased performance.

Looking forward, the methodology opens the possibility for further exploration into leveraging extensive vision datasets, potentially surpassing current constraints in dataset collection for pre-training models. This approach could transform vision-language applications, extending beyond captioning to other modalities necessitating semantic understanding across visual and textual data.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Xiaowei Hu (54 papers)
Xi Yin (88 papers)
Kevin Lin (98 papers)
Lijuan Wang (133 papers)
Lei Zhang (1689 papers)
Jianfeng Gao (344 papers)
Zicheng Liu (153 papers)

Citations (56)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos