FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph Parsing (2305.17497v2)

Published 27 May 2023 in cs.CL

Abstract: Textual scene graph parsing has become increasingly important in various vision-language applications, including image caption evaluation and image retrieval. However, existing scene graph parsers that convert image captions into scene graphs often suffer from two types of errors. First, the generated scene graphs fail to capture the true semantics of the captions or the corresponding images, resulting in a lack of faithfulness. Second, the generated scene graphs have high inconsistency, with the same semantics represented by different annotations. To address these challenges, we propose a novel dataset, which involves re-annotating the captions in Visual Genome (VG) using a new intermediate representation called FACTUAL-MR. FACTUAL-MR can be directly converted into faithful and consistent scene graph annotations. Our experimental results clearly demonstrate that the parser trained on our dataset outperforms existing approaches in terms of faithfulness and consistency. This improvement leads to a significant performance boost in both image caption evaluation and zero-shot image retrieval tasks. Furthermore, we introduce a novel metric for measuring scene graph similarity, which, when combined with the improved scene graph parser, achieves state-of-the-art (SOTA) results on multiple benchmark datasets for the aforementioned tasks. The code and dataset are available at https://github.com/zhuang-li/FACTUAL .

Citations (10)

View on Semantic Scholar

Summary

The paper introduces FACTUAL-MR, an intermediate representation that standardizes caption parsing to generate faithful and consistent scene graphs.
The dataset comprises 40,369 meticulously annotated examples that significantly improve parser accuracy over current methods.
The novel SoftSPICE metric refines evaluation precision, achieving state-of-the-art performance in various vision-language tasks.

An Expert Overview of "FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph Parsing"

The field of textual scene graph parsing plays a crucial role in enhancing vision-language applications by converting descriptive captions into structured scene graphs that capture the semantic interplay between objects, attributes, and relationships. The paper under review introduces FACTUAL, a novel dataset designed to address the prevailing issues of faithfulness and consistency in textual scene graph parsing.

Overview of the Research

The authors of this paper identify two significant challenges that impede the efficacy of existing scene graph parsers: a lack of faithfulness and the inconsistency of generated scene graphs. They propose a new benchmark dataset, FACTUAL, which employs an intermediate representation named FACTUAL-MR to mitigate these issues. This representation facilitates the deterministic conversion of captions into faithful and consistent scene graphs.

The dataset is an extension of the Visual Genome (VG) dataset, with re-annotated captions to align with FACTUAL-MR. The authors provide empirical evidence that a parser trained on FACTUAL surpasses current methods in both faithfulness and consistency. Furthermore, a new metric, SoftSPICE, is introduced, which when used in conjunction with the improved parser, achieves state-of-the-art results on several benchmark datasets for tasks like image caption evaluation and zero-shot image retrieval.

Key Contributions and Methodology

FACTUAL-MR Representation: The authors introduce FACTUAL-MR, which helps to standardize the annotation process, ensuring faithfulness and consistency in the generated scene graphs. This intermediate representation is meticulously designed to capture the semantics of scene descriptions, eliminating the conversion errors associated with syntactically driven methods like dependency parsing.
High-Quality Annotated Dataset: FACTUAL comprises a comprehensive set of 40,369 examples, developed through a rigorous annotation process that includes both initial annotation and expert-led verification to eliminate errors and ensure consistency. This large-scale dataset significantly enhances the ability to train more accurate parsers.
Enhanced Scene Graph Parsers: FACTUAL-T5, the parser developed and trained on this new benchmark, shows superior performance compared to existing models. Both intrinsic metrics, such as SPICE and Set Match, and extrinsic evaluations, such as image caption alignment, highlight its effectiveness.
Introduction of SoftSPICE: This novel metric refines the assessment of scene graph similarity by employing embedding-based techniques, outperforming traditional graph-based metrics and even improving existing state-of-the-art evaluation methods like CLIPScore when combined.

Evaluation and Implications

The evaluations across various datasets and tasks indicate that FACTUAL represents a substantial step forward in ensuring the faithfulness and consistency of scene graph parsers. Particularly striking is the performance boost in downstream applications such as image caption evaluation and zero-shot image retrieval. This underscores the practical importance of addressing dataset and annotation quality, which FACTUAL exemplifies.

The proposed methodology implies that rigorous semantic annotation, combined with a sophisticated representation, can dramatically enhance the accuracy of scene graph parsing. This paves the way for applications in more complex and semantically demanding scenarios across the field of AI and computer vision.

Future Directions

While FACTUAL addresses many existing limitations, future research could explore the integration of multi-modal context to further resolve ambiguities that persist in multi-object scenes. Additionally, expanding the dataset to include bounding box alignments could facilitate applications in regions of interest localization, thereby leveraging the full potential of scene graphs in semantic image analysis.

In conclusion, the FACTUAL benchmark and its associated methodologies represent an important contribution to textual scene graph parsing. Through enhancing faithfulness and consistency, this research propels the development of reliable parsers that improve the symbiotic relationship between language and vision systems.

PDF Markdown

Related Papers

GitHub

GitHub - zhuang-li/FactualSceneGraph: FACTUAL benchmark dataset, the pre-trained textual scene graph parser trained on FACTUAL. (112 stars)