Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs (2003.00387v1)

Published 1 Mar 2020 in cs.CV and cs.AI

Abstract: Humans are able to describe image contents with coarse to fine details as they wish. However, most image captioning models are intention-agnostic which can not generate diverse descriptions according to different user intentions initiatively. In this work, we propose the Abstract Scene Graph (ASG) structure to represent user intention in fine-grained level and control what and how detailed the generated description should be. The ASG is a directed graph consisting of three types of \textbf{abstract nodes} (object, attribute, relationship) grounded in the image without any concrete semantic labels. Thus it is easy to obtain either manually or automatically. From the ASG, we propose a novel ASG2Caption model, which is able to recognise user intentions and semantics in the graph, and therefore generate desired captions according to the graph structure. Our model achieves better controllability conditioning on ASGs than carefully designed baselines on both VisualGenome and MSCOCO datasets. It also significantly improves the caption diversity via automatically sampling diverse ASGs as control signals.

PDF Abstract

Fine-grained Control of Image Caption Generation with Abstract Scene Graphs

The paper "Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs" introduces a novel approach for enhancing image captioning systems by integrating user intentions into the caption generation process. Traditional image captioning models primarily focus on generating descriptions that encapsulate the entirety of an image's content without accommodating specific user interests or desired levels of detail. This intention-agnostic nature of captioning models often results in generic and repetitive descriptions that lack diversity and personalization.

To address this challenge, the authors propose a framework that utilizes Abstract Scene Graphs (ASGs) to allow for fine-grained control over the image captioning process. An ASG is a directed graph composed of object, attribute, and relationship nodes, representing the elements a user is interested in describing and the level of detail desired. Unlike fully semantic scene graphs, ASGs are abstract and do not require concrete semantic labels, making them simpler to construct manually or automatically.

The core contribution of the paper is the ASG2Caption model, which is specifically designed to leverage ASGs in the caption generation process. This model features a role-aware graph encoder and a language decoder tailored for graphs, enabling it to incorporate user-defined intentions from ASGs into the generated captions. The role-aware encoder captures both the intention roles and semantic content of graph nodes, enhancing the model's ability to distinguish between object, attribute, and relationship nodes.

The language decoder employs a graph-based attention mechanism that considers both the semantic content and the structural information of the ASG. This mechanism includes a graph flow attention component, which ensures that the generated sequence respects the user-intended order indicated by the ASG structure. Additionally, a graph updating mechanism is incorporated to track and manage the access status of graph nodes during the caption generation process, preventing redundancy and omission of intended details.

Empirical evaluations are conducted on the VisualGenome and MSCOCO datasets. The results demonstrate that the ASG2Caption model substantially improves controllability in image captioning when conditioned on ASGs, outperforming baseline models across various evaluation metrics. The model exhibits enhanced capability in describing provided objects, attributes, and relationships as specified by the ASGs, achieving lower error rates in graph structure alignment scores. Furthermore, the approach shows promise in generating diverse captions by sampling different ASGs, resulting in richer and more varied descriptions.

The implications of this research are significant for advancing interactive and context-sensitive image captioning systems. By enabling users to specify their interests and desired detail levels through ASGs, the framework offers a pathway toward more personalized and diverse image descriptions. These capabilities could be critical in applications ranging from assistive technologies for visually impaired users to intelligent digital content creation tools that cater to specific user preferences.

Future directions for this research could include the development of more sophisticated methods for automatic ASG generation, improving the scalability and usability of this approach in real-world applications. Additionally, further exploration into integrating more complex user intentions and contextual information could further refine the granularity and applicability of controllable image captioning systems.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Shizhe Chen (52 papers)
Qin Jin (94 papers)
Peng Wang (832 papers)
Qi Wu (323 papers)

Citations (201)

View on Semantic Scholar

Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs (2003.00387v1)

Fine-grained Control of Image Caption Generation with Abstract Scene Graphs

Related Papers