Scene Graph Based Fusion Network For Image-Text Retrieval (2303.11090v1)

Published 20 Mar 2023 in cs.CV and cs.AI

Abstract: A critical challenge to image-text retrieval is how to learn accurate correspondences between images and texts. Most existing methods mainly focus on coarse-grained correspondences based on co-occurrences of semantic objects, while failing to distinguish the fine-grained local correspondences. In this paper, we propose a novel Scene Graph based Fusion Network (dubbed SGFN), which enhances the images'/texts' features through intra- and cross-modal fusion for image-text retrieval. To be specific, we design an intra-modal hierarchical attention fusion to incorporate semantic contexts, such as objects, attributes, and relationships, into images'/texts' feature vectors via scene graphs, and a cross-modal attention fusion to combine the contextual semantics and local fusion via contextual vectors. Extensive experiments on public datasets Flickr30K and MSCOCO show that our SGFN performs better than quite a few SOTA image-text retrieval methods.

PDF Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

Authors (3)

Guoliang Wang (6 papers)
Yanlei Shang (3 papers)
Yong Chen (299 papers)

Citations (1)

View on Semantic Scholar

Scene Graph Based Fusion Network For Image-Text Retrieval (2303.11090v1)

Related Papers