Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cross-modal Scene Graph Matching for Relationship-aware Image-Text Retrieval (1910.05134v1)

Published 11 Oct 2019 in cs.CV

Abstract: Image-text retrieval of natural scenes has been a popular research topic. Since image and text are heterogeneous cross-modal data, one of the key challenges is how to learn comprehensive yet unified representations to express the multi-modal data. A natural scene image mainly involves two kinds of visual concepts, objects and their relationships, which are equally essential to image-text retrieval. Therefore, a good representation should account for both of them. In the light of recent success of scene graph in many CV and NLP tasks for describing complex natural scenes, we propose to represent image and text with two kinds of scene graphs: visual scene graph (VSG) and textual scene graph (TSG), each of which is exploited to jointly characterize objects and relationships in the corresponding modality. The image-text retrieval task is then naturally formulated as cross-modal scene graph matching. Specifically, we design two particular scene graph encoders in our model for VSG and TSG, which can refine the representation of each node on the graph by aggregating neighborhood information. As a result, both object-level and relationship-level cross-modal features can be obtained, which favorably enables us to evaluate the similarity of image and text in the two levels in a more plausible way. We achieve state-of-the-art results on Flickr30k and MSCOCO, which verifies the advantages of our graph matching based approach for image-text retrieval.

Cross-modal Scene Graph Matching for Relationship-aware Image-Text Retrieval

The paper focuses on a technique for image-text retrieval by leveraging scene graphs to represent and match cross-modal data. The central premise is that existing image-text retrieval tasks can be made more effective by taking into account not only the objects involved but also their relationships. This stands in contrast to previous methods that largely relied on global representations or local object-centric matching without explicitly considering relational information.

Methodology

To bridge the modality gap between images and text, the paper introduces the concept of Visual Scene Graphs (VSG) and Textual Scene Graphs (TSG). These graphs are pivotal in capturing both the objects and the inherent relationships in each modality. The task of image-text retrieval is then transformed into a scene graph matching problem. Specifically, the authors designed dedicated encoders for each modality: a Multi-modal Graph Convolution Network (MGCN) for VSG and a bi-GRU based encoder for TSG. The VSG encoder enhances node representations by aggregating useful information from other nodes, while the TSG encoder learns object and relationship features by encoding along different path types formed by word and relationship edges.

Evaluation

The proposed approach was evaluated on two well-known datasets, Flickr30k and MSCOCO, achieving state-of-the-art results. On Flickr30k, the framework displayed a significant improvement in retrieval tasks, with a 16.8% and 16.18% relative increase in recall@1 for caption and image retrieval tasks, respectively, compared to prior methods. On MSCOCO, the model achieved 10.62% and 6.65% relative enhancement in recall@1 for caption and image retrieval across the 5k test images, indicating its effectiveness in discerning complex scenes.

Implications and Future Directions

This paper demonstrates the importance of incorporating relationship-level features alongside object-level features for more nuanced cross-modal retrieval. The results underscore the potential of scene graphs as a representation mechanism in bridging visual and textual modalities. Practically, this work could enhance applications like search engines, recommendation systems, and automated content curation, where understanding and retrieving multi-modal data is crucial.

Theoretically, future research could further refine scene graph generation approaches, potentially by integrating more robust or adaptive relation extraction techniques. Additionally, exploring the application of this framework under varying degrees of data complexity and diversity could offer insights into its versatility and limit conditions. Furthermore, the integration of advanced graph neural networks could augment the framework's representation capabilities, thereby improving retrieval performance in even more complex scenarios.

In conclusion, this paper advances the field of image-text retrieval by effectively capturing and utilizing the relationships between objects, marking a distinct move away from prevalent object-centric approaches. The introduction of scene graphs into retrieval tasks signifies a step towards more comprehensive and semantically informed cross-modal retrieval systems.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Sijin Wang (2 papers)
  2. Ruiping Wang (32 papers)
  3. Ziwei Yao (1 paper)
  4. Shiguang Shan (136 papers)
  5. Xilin Chen (119 papers)
Citations (192)