VLDeformer: Vision-Language Decomposed Transformer for Fast Cross-Modal Retrieval (2110.11338v3)

Published 20 Oct 2021 in cs.CV, cs.CL, and cs.IR

Abstract: Cross-model retrieval has emerged as one of the most important upgrades for text-only search engines (SE). Recently, with powerful representation for pairwise text-image inputs via early interaction, the accuracy of vision-language (VL) transformers has outperformed existing methods for text-image retrieval. However, when the same paradigm is used for inference, the efficiency of the VL transformers is still too low to be applied in a real cross-modal SE. Inspired by the mechanism of human learning and using cross-modal knowledge, this paper presents a novel Vision-Language Decomposed Transformer (VLDeformer), which greatly increases the efficiency of VL transformers while maintaining their outstanding accuracy. By the proposed method, the cross-model retrieval is separated into two stages: the VL transformer learning stage, and the VL decomposition stage. The latter stage plays the role of single modal indexing, which is to some extent like the term indexing of a text SE. The model learns cross-modal knowledge from early-interaction pre-training and is then decomposed into an individual encoder. The decomposition requires only small target datasets for supervision and achieves both $1000+$ times acceleration and less than $0.6$\% average recall drop. VLDeformer also outperforms state-of-the-art visual-semantic embedding methods on COCO and Flickr30k.

PDF Abstract

Summarize Bookmark Chat (Pro)

Authors (9)

Lisai Zhang (8 papers)
Hongfa Wu (1 paper)
Qingcai Chen (36 papers)
Yimeng Deng (1 paper)
Zhonghua Li (46 papers)
Dejiang Kong (2 papers)
Zhao Cao (36 papers)
Joanna Siebert (5 papers)
Yunpeng Han (4 papers)

Citations (17)

View on Semantic Scholar

VLDeformer: Vision-Language Decomposed Transformer for Fast Cross-Modal Retrieval (2110.11338v3)

Related Papers