Image Captioning and Visual Question Answering Based on Attributes and External Knowledge (1603.02814v2)

Published 9 Mar 2016 in cs.CV

Abstract: Much recent progress in Vision-to-Language problems has been achieved through a combination of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). This approach does not explicitly represent high-level semantic concepts, but rather seeks to progress directly from image features to text. In this paper we first propose a method of incorporating high-level concepts into the successful CNN-RNN approach, and show that it achieves a significant improvement on the state-of-the-art in both image captioning and visual question answering. We further show that the same mechanism can be used to incorporate external knowledge, which is critically important for answering high level visual questions. Specifically, we design a visual question answering model that combines an internal representation of the content of an image with information extracted from a general knowledge base to answer a broad range of image-based questions. It particularly allows questions to be asked about the contents of an image, even when the image itself does not contain a complete answer. Our final model achieves the best reported results on both image captioning and visual question answering on several benchmark datasets.

PDF Abstract

Insights into Vision-to-Language Tasks: Incorporating High-Level Concepts and External Knowledge

The paper in focus addresses key challenges in the field of Vision-to-Language (V2L) tasks, specifically in image captioning and visual question answering (VQA). Leveraging the robust architectures of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), the authors present a methodology that enhances existing models by incorporating high-level semantic attributes and external knowledge, achieving state-of-the-art performance on major benchmarks.

Image Captioning Using Attributes

The authors critically evaluate the conventional CNN-RNN approach, noting its direct transition from low-level image features to text generation. They propose an alternative that integrates a layer of semantic attributes, extracted using a CNN-based classification model trained on words commonly found in image captions. This method is akin to building a high-level abstraction, akin to those humans naturally employ when describing scenes. Their experiments across datasets, including Flickr8k, Flickr30k, and MS COCO, show substantial performance gains in BLEU, METEOR, and CIDEr scores, emphasizing the importance of a semantic layer in improving caption quality. The paper highlights that fine-tuning the CNN on the target image-description task yields improved results, but the attributive layer provides a greater performance boost.

Visual Question Answering with External Knowledge

For VQA, the authors extend their framework to incorporate information from large-scale knowledge bases (KBs), such as DBpedia, thereby addressing the need for external or common-sense knowledge, particularly for high-level questions. Their attribute-based model predicts image content, which is then combined with generated image captions. Subsequently, relevant information is mined from a KB to respond to visual questions. This is of significant importance for questions that require more context than what is visually apparent. The authors demonstrate this with a complex question sample where understanding the context of "umbrellas" involves recognizing their use for shade—knowledge that transcends visual detection.

Strong Numerical Outcomes and Model Performance

The paper reports a 70.98% accuracy on the Toronto COCO-QA dataset and state-of-the-art results of 59.50% on the VQA Challenge, underscoring the competitive edge of integrating high-level attributes and external knowledge. This boost is most noteworthy in high-level question categories such as "why," where the model leverages external knowledge extensively.

Implications and Future Directions

The incorporation of high-level semantic attributes and external knowledge into the CNN-RNN frameworks represents a pivotal advance for V2L tasks. Such integration facilitates a deeper contextual understanding and improves textual generation by bridging the gap between visual input and textual output with more intelligence-informed mechanisms.

Looking forward, the research sets a precedent for developing more sophisticated VQA models that can effectively mine and apply external knowledge bases. Further research could explore the development of more intricate and inferentially capable knowledge bases, potentially revolutionizing AI's capability in addressing complex queries.

The paper delivers a compelling argument that equipping neural networks with structured intermediary representations and leveraging external knowledge can significantly enhance the performance and versatility of V2L models, paving the way for future innovations in this domain.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Qi Wu (323 papers)
Chunhua Shen (404 papers)
Anton van den Hengel (188 papers)
Peng Wang (831 papers)
Anthony Dick (24 papers)

Citations (350)

View on Semantic Scholar