Image Scene Graph Generation (SGG) Benchmark (2107.12604v1)

Published 27 Jul 2021 in cs.CV

Abstract: There is a surge of interest in image scene graph generation (object, attribute and relationship detection) due to the need of building fine-grained image understanding models that go beyond object detection. Due to the lack of a good benchmark, the reported results of different scene graph generation models are not directly comparable, impeding the research progress. We have developed a much-needed scene graph generation benchmark based on the maskrcnn-benchmark and several popular models. This paper presents main features of our benchmark and a comprehensive ablation study of scene graph generation models using the Visual Genome and OpenImages Visual relationship detection datasets. Our codebase is made publicly available at https://github.com/microsoft/scene_graph_benchmark.

Citations (36)

View on Semantic Scholar

Summary

The paper introduces a novel benchmark that reliably evaluates image scene graph generation by integrating popular models across diverse datasets.
Its decoupled architecture separates object detection from relationship analysis, significantly improving performance compared to joint learning methods.
Extensive ablation studies show that frequency-based baselines can rival complex models, highlighting the significant impact of data priors on results.

Image Scene Graph Generation (SGG) Benchmark

The paper "Image Scene Graph Generation (SGG) Benchmark" addresses the necessity of a reliable benchmark for evaluating image scene graph generation models. The primary goal is to enhance fine-grained image understanding by detecting objects, attributes, and relationships, which are pivotal for tasks beyond mere object detection. The paper introduces a novel benchmark based on the maskrcnn-benchmark and several prevalent models to facilitate consistent measurement of scene graph generation capabilities.

Overview

Scene Graph Generation (SGG) is integral to understanding visual content more comprehensively. It offers a graphical structure capturing objects, their attributes, and the interrelations in an image. This has significant applications in computer vision and vision-language tasks such as image retrieval, captioning, question answering, and scene understanding.

Numerous prior efforts have been made to develop SGG models, but the absence of a standardized benchmark has hindered parallel evaluation. The paper presents a benchmark that supports multiple datasets—Visual Genome and Open Images—and integrates several popular scene graph generation methods into a cohesive framework. This includes models like Iterated Message Passing (IMP), Multi-Level Scene Detection Networks (MSDN), Graph R-CNN (GRCNN), Neural Motif (NM), and RelDN.

Innovations and Contributions

The benchmark introduces noteworthy improvements and capabilities:

Dataset Flexibility: Supports both VG and OI datasets and can be adapted to custom datasets. It includes evaluation metrics consistent with existing challenges.
Support for Popular Methods: Integrates five widely used algorithms, emphasizing the inclusion of effective methods like RelDN.
Decoupled Model Architecture: Separate object detection and relationship modules, allowing the use of any object detector. This architecture improves performance by mitigating issues from joint task learning.
Comprehensive Visualization and Tools: Utilizes a portable TSV format for datasets and develops visualization tools to aid in data manipulation.
Performance Improvements: Benchmarks demonstrate state-of-the-art results on relationships detection across OI and VG datasets. The benchmark also outlines feature extraction pipelines for downstream tasks in vision-LLMs.

Benchmarking and Results

In extensive tests, the RelDN model emerged as the most capable among the examined methods for both the Open Images and Visual Genome datasets. Notably, simple frequency baselines provided competitive performance, highlighting the challenge of distinguishing meaningful relationships from mere statistical co-occurrence.

Ablation Studies

An extensive ablation paper was conducted to parse the contribution of various components to overall model performance. Key findings include:

Decoupled Architecture Superiority: Models with separate object and relationship detection components outperform integrated models.
Impact of Object Detection Quality: Object detection quality significantly affects scene graph generation performance.
Frequency Prior Influence: The models' relation classification capabilities closely mirror frequency-based baselines, indicating a reliance on training data priors.

Implications and Future Directions

The benchmark sets a higher standard for evaluating SGG models and highlights areas needing advancement, such as reducing reliance on frequency priors and improving relationship detection directly from visual cues. Future research might focus on developing methodologies that can harness more nuanced visual information for better semantic understanding.

Overall, this paper provides a robust platform for further exploration and evaluation in the field of scene graph generation, encouraging a move towards more sophisticated and semantically rich visual understanding models.

PDF Markdown

Related Papers

GitHub

GitHub - microsoft/scene_graph_benchmark: image scene graph generation benchmark (396 stars)