Bridging Knowledge Graphs to Generate Scene Graphs (2001.02314v4)

Published 7 Jan 2020 in cs.CV

Abstract: Scene graphs are powerful representations that parse images into their abstract semantic elements, i.e., objects and their interactions, which facilitates visual comprehension and explainable reasoning. On the other hand, commonsense knowledge graphs are rich repositories that encode how the world is structured, and how general concepts interact. In this paper, we present a unified formulation of these two constructs, where a scene graph is seen as an image-conditioned instantiation of a commonsense knowledge graph. Based on this new perspective, we re-formulate scene graph generation as the inference of a bridge between the scene and commonsense graphs, where each entity or predicate instance in the scene graph has to be linked to its corresponding entity or predicate class in the commonsense graph. To this end, we propose a novel graph-based neural network that iteratively propagates information between the two graphs, as well as within each of them, while gradually refining their bridge in each iteration. Our Graph Bridging Network, GB-Net, successively infers edges and nodes, allowing to simultaneously exploit and refine the rich, heterogeneous structure of the interconnected scene and commonsense graphs. Through extensive experimentation, we showcase the superior accuracy of GB-Net compared to the most recent methods, resulting in a new state of the art. We publicly release the source code of our method.

Authors (3)

Alireza Zareian (16 papers)
Svebor Karaman (17 papers)
Shih-Fu Chang (131 papers)

Citations (195)

View on Semantic Scholar

Summary

Bridging Knowledge Graphs to Generate Scene Graphs

The paper "Bridging Knowledge Graphs to Generate Scene Graphs" by Alireza Zareian, Svebor Karaman, and Shih-Fu Chang presents a novel approach to scene graph generation (SGG) by effectively integrating commonsense knowledge graphs into the process. The research tackles the complex task of extracting semantic representations from images, involving entities and their interactions, by leveraging structured commonsense knowledge to improve accuracy and interpretability.

Technical Contributions

The paper introduces a unified perspective, treating a scene graph as an instantiation of a commonsense knowledge graph. The scene graph generation task is reformulated as bridging these two graphs, with the objective of linking each scene entity and predicate to their respective classes in the commonsense graph. This connection forms a bridge that translates image-specific instances into generalized concepts.

To achieve this, the authors propose Graph Bridging Network (GB-Net), a graph-based neural architecture designed to iteratively propagate information between the scene and commonsense graphs as well as within each graph. This method simultaneously infers edges and nodes in a dynamic, contextually enriched process. Notably, GB-Net surpasses previous methods' performance metrics, establishing a new benchmark in the field.

Methodological Details

GB-Net employs a novel message-passing mechanism inspired by the Gated Graph Neural Networks (GGNN) framework. Starting with initial scene graph proposals derived from Faster R-CNN object detections, the network propagates messages to update the representation of nodes. The key innovation lies in dynamically refining the bridge edges through multiple iterations based on learned similarities between instance nodes and commonsense class nodes. This iterative refinement makes the network robust to visual ambiguities and supports comprehensive scene comprehension.

The authors incorporate multiple types of information to construct the commonsense graph, including ontological, affordance-based, and co-occurrence data. These extensive semantic connections enable the model to generalize and infer novel relationships, critical for robust scene graph generation.

Experimental Results

The researchers evaluate GB-Net on the Visual Genome dataset, covering three tasks: scene graph generation (SGGen), scene graph classification (SGCls), and predicate classification (PredCls). The results demonstrate that GB-Net consistently outperforms existing methods across 24 metrics, particularly in mean recall (mR), a crucial metric assessing the balanced recognition of predicate classes.

Through ablation studies, the paper underscores the importance of incorporating commonsense knowledge and iterative message passing, evidencing performance improvements with each added component.

Implications and Future Directions

The integration of commonsense knowledge into visual comprehension models signifies a pivotal direction for enhancing scene graph generation tasks, potentially impacting applications like Visual Question Answering, image captioning, and autonomous driving perception systems. This approach offers promising avenues for building systems that are not only visually adept but also contextually aware and semantically rich.

Further advancements may explore more diverse and expansive sources of commonsense knowledge, better handling of category imbalances, and enhanced scalability to larger datasets and real-time applications. The public release of the authors' commonsense graph and code encourages reproducibility and fosters future research to build upon this work.

In conclusion, the paper presents a significant stride toward a more generalized and semantically grounded approach to scene graph generation, leveraging structured external knowledge to interpret complex visual scenes reliably.

PDF Markdown