Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DiscoSG: Towards Discourse-Level Text Scene Graph Parsing through Iterative Graph Refinement (2506.15583v1)

Published 18 Jun 2025 in cs.CL

Abstract: Vision-LLMs (VLMs) now generate discourse-level, multi-sentence visual descriptions, challenging text scene graph parsers originally designed for single-sentence caption-to-graph mapping. Current approaches typically merge sentence-level parsing outputs for discourse input, often missing phenomena like cross-sentence coreference, resulting in fragmented graphs and degraded downstream VLM task performance. To address this, we introduce a new task, Discourse-level text Scene Graph parsing (DiscoSG), supported by our dataset DiscoSG-DS, which comprises 400 expert-annotated and 8,430 synthesised multi-sentence caption-graph pairs for images. Each caption averages 9 sentences, and each graph contains at least 3 times more triples than those in existing datasets. While fine-tuning large PLMs (i.e., GPT-4) on DiscoSG-DS improves SPICE by approximately 48% over the best sentence-merging baseline, high inference cost and restrictive licensing hinder its open-source use, and smaller fine-tuned PLMs struggle with complex graphs. We propose DiscoSG-Refiner, which drafts a base graph using one small PLM, then employs a second PLM to iteratively propose graph edits, reducing full-graph generation overhead. Using two Flan-T5-Base models, DiscoSG-Refiner still improves SPICE by approximately 30% over the best baseline while achieving 86 times faster inference than GPT-4. It also consistently improves downstream VLM tasks like discourse-level caption evaluation and hallucination detection. Code and data are available at: https://github.com/ShaoqLin/DiscoSG

Summary

  • The paper introduces DiscoSG, a new task and dataset (DiscoSG-DS) for discourse-level text scene graph parsing, addressing the limitations of sentence-centric methods on complex, multi-sentence descriptions.
  • DiscoSG-Refiner is proposed as an iterative method utilizing smaller models for efficient and accurate discourse graph refinement, significantly improving performance over sentence-level merging.
  • The DiscoSG-Refiner improves scene graph metrics by ~30% and generation speed by 86x compared to baselines, enhancing downstream tasks like discourse caption evaluation and hallucination detection.

Overview of DiscoSG: Discourse-Level Text Scene Graph Parsing

The paper under discussion, "DiscoSG: Towards Discourse-Level Text Scene Graph Parsing through Iterative Graph Refinement," introduces a novel task named Discourse-level text Scene Graph parsing (DiscoSG) and proposes an innovative approach to tackle significant challenges in the current landscape of text scene graph parsing. The authors highlight a shift in vision-LLMs (VLMs) that now produce detailed discourse-level descriptions, which pose challenges to conventional parsers designed for single-sentence inputs. In response, this work provides a dataset, DiscoSG-DS, and a method, DiscoSG-Refiner, that aim to better align parsing methodologies with the complexity of discourse-level inputs.

Dataset and Challenges

DiscoSG-DS is a robust dataset combining 400 manually annotated examples and 8,430 synthetically generated instances tailored for discourse-level parsing. Each annotation captures nuanced linguistic phenomena across multiple sentences, including cross-sentence coreference, long-range relational dependencies, and implicit information inference. The dataset's scale and depth significantly exceed preceding datasets, evidenced by graphs containing approximately 15 times more relational triples than those found in earlier collections like Visual Genome (VG) or FACTUAL.

The intricacies of discourse parsing — namely, the requirement to resolve cross-sentence references, establish long-range connections, and infer implicit relationships — underscore the inadequacy of existing sentence-centric parsers. The authors articulate that such parsers tend to merge sentence outputs, thereby missing these crucial contextual cues.

Methodology: DiscoSG-Refiner

DiscoSG-Refiner represents an iterative approach designed to optimize parsing efficiency and accuracy within the constraints of computation and resource allocation. Utilizing two smaller Flan-T5-base models, the framework operates through three core modules: Generator, Programmer, and Interpreter. Initially, the Generator creates a base graph by processing sentence-level data. Subsequently, the Programmer, leveraging an encoder-decoder setup, refines this graph through iterative modification. This method smartly disentangles deletion and addition processes, enabling efficient adjustments to the base graph through minimal computational overhead.

The DiscoSG-Refiner's performance is noteworthy, improving SPICE, a scene graph evaluation metric, by approximately 30% compared to traditional sentence-level merging strategies. The technique not only enhances generation speed significantly, achieving 86 times the inference efficiency of models like GPT-4o, but also maintains high accuracy across various tasks.

Results and Implications

The paper's results show that DiscoSG-Refiner offers a substantial leap over existing methodologies for discourse-level scene graph parsing. The algorithms cultivated from DiscoSG-DS enhance downstream tasks, including discourse-level caption evaluation and hallucination detection, further validating the utility of scene graphs in multimodal applications. Moreover, the benchmark introduced, D-FOIL, expands the landscape for evaluating hallucination detection mechanisms within discourse-generated content.

Future Directions

The developments presented in this paper carve a path for further research in multimodal AI. While the current framework does not incorporate direct visual information from images, the dataset is constructed with images in mind, indicating a potential expansion area as models evolve. Achieving alignment between textual and visual data remains a complex task, but the groundwork laid by DiscoSG and DiscoSG-Refiner holds promise for exploration. Furthermore, the insights gleaned from optimally utilizing discourse-level data in LLMs may illuminate strategies for improving comprehension and generation tasks across diverse NLP fields.

In conclusion, this paper makes substantial contributions by addressing the shortcomings of existing scene graph parsers and providing tools and methodologies to facilitate the transition to discourse-level parsing. These innovations not only aid in improving current technologies but also open new avenues for research in AI's ability to understand and generate multi-faceted language descriptions.