Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks (2004.06165v5)

Published 13 Apr 2020 in cs.CV, cs.CL, cs.IR, and cs.LG

Abstract: Large-scale pre-training methods of learning cross-modal representations on image-text pairs are becoming popular for vision-language tasks. While existing methods simply concatenate image region features and text features as input to the model to be pre-trained and use self-attention to learn image-text semantic alignments in a brute force manner, in this paper, we propose a new learning method Oscar (Object-Semantics Aligned Pre-training), which uses object tags detected in images as anchor points to significantly ease the learning of alignments. Our method is motivated by the observation that the salient objects in an image can be accurately detected, and are often mentioned in the paired text. We pre-train an Oscar model on the public corpus of 6.5 million text-image pairs, and fine-tune it on downstream tasks, creating new state-of-the-arts on six well-established vision-language understanding and generation tasks.

PDF Abstract

Overview of the Paper "Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks"

"Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks" presents a novel approach to enhance Vision-Language Pre-training (VLP) methods, an area critical to various vision-language (V+L) tasks such as visual question answering, image-text retrieval, and image captioning. In this paper, the authors introduce Oscar, a pre-training method that employs object tags detected in images as anchor points for aligning image and text modalities. This method leverages the observation that salient objects in an image are often mentioned in the accompanying text, using these identified objects to simplify cross-modal semantic alignment.

Introduction and Motivation

VLP models like UNITER and LXMERT have demonstrated the effectiveness of learning cross-modal representations from vast datasets of image-text pairs via multi-layer Transformers. However, these models typically concatenate image region features with text features and rely on self-attention mechanisms to perform alignment implicitly—a process that is both computationally intensive and prone to ambiguities due to noise and oversampling of visual regions.

Oscar addresses these shortcomings by incorporating object tags to explicitly guide the alignment process, making cross-modal representation learning more efficient and accurate. By pre-training on a large-scale dataset of 6.5 million image-text pairs and subsequently fine-tuning on specific tasks, Oscar has established new state-of-the-art results across six established V+L tasks.

Key Contributions

The paper's contributions are multifaceted:

Introduction of Object Tags in Pre-training: By using object tags detected in images as anchor points, Oscar facilitates more effective learning of image-text alignments, reducing the need for weakly-supervised learning approaches previously employed.
Novel Pre-training Objectives: Oscar employs a combination of masked token loss and contrastive loss in its pre-training objectives. The masked token loss allows the model to predict masked words or object tags, leveraging context from both text and image regions. The contrastive loss, meanwhile, encourages the model to discern between aligned and non-aligned image-text pairs, further refining the alignment process.
State-of-the-Art Performance: Oscar has been demonstrated to significantly outperform existing VLP models, particularly highlighting improvements in V+L understanding tasks such as VQA, image-text retrieval, NLVR2, and image captioning tasks.

Experimental Results

Oscar's empirical evaluations underscore its superior performance. For instance, in image-text retrieval on the COCO dataset, Oscar achieves a top-1 retrieval precision (R@1) of 57.5%, a marked improvement over prior models. Similarly, in the challenging task of image captioning, Oscar accomplishes a BLEU score of 41.7 and a CIDEr score of 140.0, outperforming existing methods by substantial margins.

Qualitative Insights

Qualitative analyses presented in the paper include t-SNE visualizations, offering insights into the semantic alignment efficacy of Oscar compared to baseline models. These visualizations indicate that Oscar's use of object tags considerably reduces semantic discrepancies between visual and textual representations, resulting in more coherent cross-modal embeddings.

Theoretical and Practical Implications

Theoretically, Oscar contributes to the understanding of how multimodal representations can be aligned using object-level semantics, suggesting broader applications in multimodal machine learning. Practically, the adoption of object tags can streamline the development of V+L systems by enhancing their performance and reducing the computational overhead involved in training.

Future Directions

Future work might include exploring more refined methods for detecting and utilizing object tags, potentially improving the scalability and robustness of V+L models. Additionally, integrating more sophisticated attention mechanisms could further optimize the pre-training process. Finally, assessing Oscar's performance across diverse and less structured datasets would help generalize its applicability.

Conclusion

Overall, "Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks" makes significant strides in VLP, demonstrating how object tags can effectively bridge the semantic gap between image and text modalities. Through innovative pre-training objectives and empirical validation, this work not only sets new benchmarks in V+L tasks but also charts a course for future research in cross-modal representation learning.

PDF Markdown Bookmark Chat (Pro)

Authors (12)

Xiujun Li (37 papers)
Xi Yin (88 papers)
Chunyuan Li (122 papers)
Pengchuan Zhang (58 papers)
Xiaowei Hu (54 papers)
Lei Zhang (1689 papers)
Lijuan Wang (133 papers)
Houdong Hu (14 papers)
Li Dong (154 papers)
Furu Wei (291 papers)
Yejin Choi (287 papers)
Jianfeng Gao (344 papers)

Citations (1,797)

View on Semantic Scholar