SemVLP: Vision-Language Pre-training by Aligning Semantics at Multiple Levels (2103.07829v1)

Published 14 Mar 2021 in cs.CL

Abstract: Vision-language pre-training (VLP) on large-scale image-text pairs has recently witnessed rapid progress for learning cross-modal representations. Existing pre-training methods either directly concatenate image representation and text representation at a feature level as input to a single-stream Transformer, or use a two-stream cross-modal Transformer to align the image-text representation at a high-level semantic space. In real-world image-text data, we observe that it is easy for some of the image-text pairs to align simple semantics on both modalities, while others may be related after higher-level abstraction. Therefore, in this paper, we propose a new pre-training method SemVLP, which jointly aligns both the low-level and high-level semantics between image and text representations. The model is pre-trained iteratively with two prevalent fashions: single-stream pre-training to align at a fine-grained feature level and two-stream pre-training to align high-level semantics, by employing a shared Transformer network with a pluggable cross-modal attention module. An extensive set of experiments have been conducted on four well-established vision-language understanding tasks to demonstrate the effectiveness of the proposed SemVLP in aligning cross-modal representations towards different semantic granularities.

PDF Abstract

Summarize Bookmark Chat (Pro)

Authors (7)

Chenliang Li (92 papers)
Ming Yan (190 papers)
Haiyang Xu (67 papers)
Fuli Luo (23 papers)
Wei Wang (1793 papers)
Bin Bi (24 papers)
Songfang Huang (51 papers)

Citations (32)

View on Semantic Scholar

SemVLP: Vision-Language Pre-training by Aligning Semantics at Multiple Levels (2103.07829v1)

Related Papers