Learning to Generate Scene Graph from Natural Language Supervision (2109.02227v1)

Published 6 Sep 2021 in cs.CV

Abstract: Learning from image-text data has demonstrated recent success for many recognition tasks, yet is currently limited to visual features or individual visual concepts such as objects. In this paper, we propose one of the first methods that learn from image-sentence pairs to extract a graphical representation of localized objects and their relationships within an image, known as scene graph. To bridge the gap between images and texts, we leverage an off-the-shelf object detector to identify and localize object instances, match labels of detected regions to concepts parsed from captions, and thus create "pseudo" labels for learning scene graph. Further, we design a Transformer-based model to predict these "pseudo" labels via a masked token prediction task. Learning from only image-sentence pairs, our model achieves 30% relative gain over a latest method trained with human-annotated unlocalized scene graphs. Our model also shows strong results for weakly and fully supervised scene graph generation. In addition, we explore an open-vocabulary setting for detecting scene graphs, and present the first result for open-set scene graph generation. Our code is available at https://github.com/YiwuZhong/SGG_from_NLS.

PDF Abstract

Learning to Generate Scene Graph from Natural Language Supervision

The paper "Learning to Generate Scene Graph from Natural Language Supervision" presents a novel approach for generating scene graphs from image-text pairs leveraging natural language annotations. The authors propose a paradigm shift from traditional methods, which predominantly use visual supervision through annotated scene graph data, to a language-supervised methodology that utilizes readily available image descriptions, i.e., image captions, as a supervisorial resource for scene graph generation.

Methodology Overview

The core idea introduced by the authors involves the employment of off-the-shelf object detectors to identify and localize objects within images and subsequently match these detected objects to nouns and predicates derived from image captions. This matching process creates “pseudo” labels that effectively bridge the semantic gap between image content and its linguistic annotation.

The authors develop a Transformer-based model adept at processing these pseudo-labels. This model operates on the principle of masked token prediction, a method whereby parts of the input data (in this case, components of a triplet describing subject-predicate-object relationships) are masked during the training process. The model's objective is to predict the masked parts, thereby learning the underlying structure and relationships present within the data.

Experimental Results

Empirical evidence supports the efficacy of this approach. The Transformer model trained solely on image-caption pairs demonstrated a 30% relative improvement in scene graph generation over recent methods reliant on manually annotated scene graph data. Importantly, the authors also explore open-set scene graph generation, wherein scene graphs are generated for vocabulary items not seen during training, marking a step toward more generalized and flexible models.

Implications

The implications of learning scene graphs from natural language supervision are manifold. Practically, the reduction in reliance on labor-intensive human annotations paves the way for scaling up scene graph generation to more extensive and varied datasets. Theoretically, this work suggests that ample semantic insights integral to image interpretation can be extracted from language, reinforcing the potential for further cross-modal learning developments in artificial intelligence.

Speculative Outlook

Future developments could explore scaling the model to utilize massive web-scale datasets like Conceptual Captions, further enhancing its robustness and flexibility in the open-set scene graph generation domain. Moreover, integrating these models in downstream vision-language tasks, such as visual question answering and image captioning, could enhance their performance by providing a richer, structured representation of scene information.

The introduction of scene graph generation from natural language supervision as a viable alternative to conventional methods solidifies the paper as a significant contribution to both the computer vision and natural language processing fields. The prospects for such cross-disciplinary methodologies highlight a promising trajectory for AI research, emphasizing the symbiotic relationship between visual and linguistic information.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Yiwu Zhong (16 papers)
Jing Shi (123 papers)
Jianwei Yang (93 papers)
Chenliang Xu (114 papers)
Yin Li (149 papers)

Citations (70)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - YiwuZhong/SGG_from_NLS: [ICCV 2021] Official code for "Learning to Generate Scene Graph from Natural Language Supervision" (99 stars)