Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations (1602.07332v1)

Published 23 Feb 2016 in cs.CV and cs.AI

Abstract: Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in an image. When asked "What vehicle is the person riding?", computers will need to identify the objects in an image as well as the relationships riding(man, carriage) and pulling(horse, carriage) in order to answer correctly that "the person is riding a horse-drawn carriage". In this paper, we present the Visual Genome dataset to enable the modeling of such relationships. We collect dense annotations of objects, attributes, and relationships within each image to learn these models. Specifically, our dataset contains over 100K images where each image has an average of 21 objects, 18 attributes, and 18 pairwise relationships between objects. We canonicalize the objects, attributes, relationships, and noun phrases in region descriptions and questions answer pairs to WordNet synsets. Together, these annotations represent the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answers.

Citations (5,370)

View on Semantic Scholar

Summary

The paper introduces a comprehensive dataset with over 100K images featuring detailed annotations of objects, attributes, relationships, regions, and QA pairs.
The methodology leverages crowdsourced annotations and semantic linkages to WordNet to effectively connect language and vision for improved scene understanding.
Experimental results demonstrate enhanced attribute and relationship prediction as well as superior region description generation compared to traditional datasets.

Detailed Exploration of Visual Genome: Shaping Cognitive Understanding in Computer Vision

The paper "Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations," authored by R. Krishna et al., introduces a substantial dataset designed to bridge the gap between image recognition and image understanding. While significant advancements have been made in image classification and object detection, cognitive tasks such as image description and question answering remain challenging. The authors argue that comprehending interactions and relationships between objects is crucial for cognitive tasks, a requirement that existing datasets tailored for perceptual tasks do not fulfill adequately. This exposition offers an in-depth examination of the Visual Genome dataset, detailing its composition, significance, and potential research avenues it enables.

Composition and Structure

Visual Genome is composed of over 100,000 images, each annotated with a dense set of descriptive data:

Objects: Each image contains an average of 21 objects, contributing to a total of 4.1 million objects annotated in the dataset.
Attributes: Attributes describe properties of objects, with an average of 18 attributes per image, populating the dataset with 1.6 million attributes.
Relationships: Beyond object detection, understanding interactions is critical. The dataset includes an average of 18 relationships per image, totaling 1.8 million relationships.
Region Descriptions: Each image is divided into multiple regions, each described with phrases, leading to 4.2 million region descriptions.
Question-Answer Pairs: To model comprehensive scene understanding, the dataset incorporates 1.7 million QA pairs.

The dataset meticulously connects these annotations to WordNet synsets, ensuring semantic consistency across multimodal data. This canonicalization to WordNet IDs links objects, attributes, and relationships, facilitating cross-image context training.

Implications and Research Potential

The paper thoroughly outlines the potential applications of the Visual Genome dataset, which are pivotal for both theoretical research and practical implementations:

Dense Image Captioning: The dataset enables the development of models that move beyond generating a single caption per image to generating descriptions for multiple regions, enriching the narrative capacity of automated systems.
Visual Question Answering (VQA): By providing detailed annotations, Visual Genome offers a dataset conducive to training models on complex question-answering tasks, pushing the boundaries of simple object and scene classification to higher-order reasoning.
Relationship Extraction: The vast number of annotated relationships allow for the construction of models that can infer interactions, thereby improving action and spatial relationship recognition.
Semantic Image Retrieval: The connected scene graphs and region descriptions enhance the precision of image retrieval systems, facilitating advanced search functions that consider object interactions.

Experimental Results

Several baseline experiments on Visual Genome illustrate the dataset's complexity and the challenges inherent in the tasks:

Attribute Prediction: A feature representation trained on this dataset achieved a top-one accuracy of 18.97% and a top-five accuracy of 43.11% for attribute prediction, indicating the nuanced variability of attributes.
Relationship Prediction: Predicting object relationships is inherently difficult due to high intra-class variability. The baseline models obtained a top-one accuracy of 8.74% and a top-five accuracy of 26.69%.
Region Description Generation: Utilizing NeuralTalk, models trained on Visual Genome outperform those trained on datasets like Flickr30K, achieving significantly higher BLEU, CIDEr, and METEOR scores, and 43.03% human-evaluated accuracy.
Question Answering: With the simplicity of a frequency-based approach, the baseline models predict ranging from 0.034 to 0.78 accuracy for the top-100 predicted answers, highlighting the dataset's long-tail distribution and the inherent difficulty of the task.

Future Directions

Looking ahead, the Visual Genome dataset opens numerous avenues for future exploration:

Advanced Visual Search Engines: Leveraging the dataset for building spellbinding image search tools that can understand nuanced user queries about image contents.
Augmented Reality: Enhancing AR systems with comprehensive scene understanding, enabling interactions with virtual objects in a coherent and contextually accurate manner.
Autonomous Systems: Equipping self-driving cars and robots with the ability to infer and understand complex scenes, enhancing their operational safety and efficiency.

Conclusion

The Visual Genome dataset introduces a paradigm shift in visual understanding by focusing on the dense annotation of images with rich semantic roles. This paper by Krishna et al. meticulously presents a dataset designed to foster advancements in cognitive computer vision tasks, setting the stage for a deeper and more nuanced understanding of visual data. With its extensive annotations, Visual Genome not only complements but extends beyond traditional datasets, laying a strong foundation for future developments in AI and machine learning.

PDF Markdown