Learning Cross-modal Context Graph for Visual Grounding (1911.09042v2)

Published 20 Nov 2019 in cs.CV

Abstract: Visual grounding is a ubiquitous building block in many vision-language tasks and yet remains challenging due to large variations in visual and linguistic features of grounding entities, strong context effect and the resulting semantic ambiguities. Prior works typically focus on learning representations of individual phrases with limited context information. To address their limitations, this paper proposes a language-guided graph representation to capture the global context of grounding entities and their relations, and develop a cross-modal graph matching strategy for the multiple-phrase visual grounding task. In particular, we introduce a modular graph neural network to compute context-aware representations of phrases and object proposals respectively via message propagation, followed by a graph-based matching module to generate globally consistent localization of grounding phrases. We train the entire graph neural network jointly in a two-stage strategy and evaluate it on the Flickr30K Entities benchmark. Extensive experiments show that our method outperforms the prior state of the arts by a sizable margin, evidencing the efficacy of our grounding framework. Code is available at "https://github.com/youngfly11/LCMCG-PyTorch".

View on arXiv

Authors (4)

Yongfei Liu (25 papers)
Bo Wan (17 papers)
Xiaodan Zhu (94 papers)
Xuming He (109 papers)

Citations (83)

View on Semantic Scholar

Summary

Summary of "Learning Cross-Modal Context Graph for Visual Grounding"

The paper "Learning Cross-Modal Context Graph for Visual Grounding" addresses the complex task of visual grounding, which involves localizing object regions in images corresponding to noun phrases in descriptive sentences. Visual grounding is crucial for vision-language applications such as image retrieval, image captioning, visual question answering, and visual dialogue. Despite advancements in computer vision and natural language processing, a significant challenge remains in bridging visual and language modalities due to large variations in object appearances, linguistic features, and semantic ambiguities.

This paper introduces a language-guided graph representation method that effectively captures the global context of grounding entities and their relations. The authors propose a cross-modal graph matching strategy for handling multiple-phrase visual grounding tasks. The framework employs a novel modular graph neural network designed to compute context-aware representations of phrases and object proposals through message propagation. This network ensures the generation of globally consistent localizations of grounding phrases. The system is trained using a two-stage approach, optimizing both phrase graph networks and overall network performance.

Key Components:

Backbone Network: This network extracts fundamental linguistic and visual features. It integrates a convolutional network for object proposal generation and a recurrent network for phrase encoding, creating embeddings crucial for subsequent processing.
Phrase Graph Network: By constructing a language scene graph from sentence descriptions, the network computes refined contextual phrase representations. This process involves aggregating information from nodes (phrases) and edges (relations) using attention mechanisms.
Visual Object Graph Network: This network incorporates visual context by creating a visual scene graph derived from object proposals. It adapts the language structure to guide the selection of relevant object proposals, refining their representations through context-aware message propagation.
Graph Similarity Network: This component addresses the global matching problem by predicting similarities between graph nodes and edges across language and visual modalities. The structured prediction framework utilized here provides robust solutions for the grounding task.

Experimental Validation:

The proposed method is evaluated on the Flickr30K Entities benchmark dataset, where it demonstrates significant improvements over state-of-the-art methods. The extensive experiments conducted validate the efficacy of the language-guided graph representation and cross-modal matching strategy, with the model achieving new accuracy highs in the visual grounding domain.

Implications and Future Work:

The paper contributes significantly to the visual grounding field by introducing a novel framework that enhances the global context understanding of language and visual data. This robust approach could be extended to complex visual reasoning tasks, where accurate cross-modal associations are crucial. Future work could explore the application of this framework to other challenging datasets and integrate additional capabilities for more intricate visual and linguistic contexts. The modular design of the network suggests potential for adaptation to other multimodal tasks, fostering further advances in artificial intelligence research.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - youngfly11/LCMCG-PyTorch: AAAI2020-The official implementation of "Learning Cross-modal Context Graph for Visual Grounding" (57 stars)