Unpaired Image Captioning via Scene Graph Alignments (1903.10658v4)

Published 26 Mar 2019 in cs.CV

Abstract: Most of current image captioning models heavily rely on paired image-caption datasets. However, getting large scale image-caption paired data is labor-intensive and time-consuming. In this paper, we present a scene graph-based approach for unpaired image captioning. Our framework comprises an image scene graph generator, a sentence scene graph generator, a scene graph encoder, and a sentence decoder. Specifically, we first train the scene graph encoder and the sentence decoder on the text modality. To align the scene graphs between images and sentences, we propose an unsupervised feature alignment method that maps the scene graph features from the image to the sentence modality. Experimental results show that our proposed model can generate quite promising results without using any image-caption training pairs, outperforming existing methods by a wide margin.

PDF Abstract

Unpaired Image Captioning via Scene Graph Alignments

Recent advancements in image captioning have primarily relied on paired image-caption datasets to train models, utilizing architectures such as convolutional neural networks (CNNs) for image encoding and recurrent neural networks (RNNs) for sentence generation. However, acquiring large-scale paired datasets in multiple languages poses a significant challenge. This paper introduces a novel method for generating image captions without the necessity of image-caption pairs. By leveraging scene graph representations, the paper presents an innovative framework that integrates an image scene graph generator, sentence scene graph generator, scene graph encoder, sentence decoder, and a feature alignment module.

Framework and Methodology

The proposed methodology confronts the absence of paired datasets by employing scene graphs as intermediates that facilitate the translation from images to sentences. Scene graphs, comprising nodes and edges representing objects, attributes, and relationships, are constructed for both images and sentences. The principal components of the framework are outlined as follows:

Scene Graph Generation: The initial step involves generating scene graphs for images and sentences. For images, the authors use a scene graph detector to identify objects, attributes, and relationships. For sentences, a dependency parser converts text data into structured scene graphs.
Scene Graph Encoder: The encoder captures multi-dimensional feature vectors from the scene graphs. It employs distinct spatial graph convolutional networks for encoding objects, attributes, and relations, aggregating contextual information from neighboring nodes.
Sentence Decoder: The attention-equipped RNN decoder translates the scene graph encodings into sentence predictions. This step involves computing relevance scores to prioritize elements in the scene graph that are critical for generating captions.
Cross-Modal Feature Alignment: A core innovation is the use of CycleGAN to align scene graph features between visual and textual modalities. The adversarial network facilitates unsupervised mapping, maintaining consistency across image and sentence feature transformations.

This approach is detailed theoretically, where the task of converting unpaired image-caption data relies on encoding scene graph features and modifying them to fit standard sentence decoders.

Results and Evaluation

The performance of this unpaired image captioning model is remarkable compared to existing techniques. Utilizing evaluation metrics like BLEU, METEOR, ROUGE, and CIDEr, the method demonstrates substantial improvements owing to its robust feature alignment process via adversarial learning. Despite operating without paired datasets, the proposed framework achieves notable results by maximizing the semantic information captured in scene graph representations.

Implications and Future Directions

The paper's contribution is significant in addressing the dataset bottleneck in machine learning. By avoiding dependence on extensive paired datasets, it opens up possibilities for multi-lingual image captioning across diverse domains. This method's adaptability can significantly reduce the resources needed for data collection in various languages, thus accelerating model deployment.

Looking forward, the utilization of more advanced mapping techniques such as optimal transport, and exploration of different datasets, could further enhance the model's capabilities. With ongoing improvements in scene graph generation and interpretation, unpaired image captioning frameworks are poised to tackle more complex problems in computer vision and natural language processing. The continuing evolution of unsupervised and semi-supervised methods presents opportunities to broaden the applications of this research beyond standard caption generation, potentially influencing fields like visual storytelling, content creation, and AI-driven media synthesis.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Jiuxiang Gu (73 papers)
Shafiq Joty (187 papers)
Jianfei Cai (163 papers)
Handong Zhao (38 papers)
Xu Yang (222 papers)
Gang Wang (406 papers)

Citations (162)

View on Semantic Scholar

Unpaired Image Captioning via Scene Graph Alignments (1903.10658v4)