Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures (1601.03896v2)

Published 15 Jan 2016 in cs.CL and cs.CV

Abstract: Automatic description generation from natural images is a challenging problem that has recently received a large amount of interest from the computer vision and natural language processing communities. In this survey, we classify the existing approaches based on how they conceptualize this problem, viz., models that cast description as either generation problem or as a retrieval problem over a visual or multimodal representational space. We provide a detailed review of existing models, highlighting their advantages and disadvantages. Moreover, we give an overview of the benchmark image datasets and the evaluation measures that have been developed to assess the quality of machine-generated image descriptions. Finally we extrapolate future directions in the area of automatic image description generation.

Citations (355)

View on Semantic Scholar

Summary

The paper presents a comprehensive survey categorizing image description models into generation- and retrieval-based approaches.
It details key datasets, including Flickr and MS COCO, that facilitate training and evaluation of captioning systems.
It underscores the need for improved evaluation metrics and diverse datasets to enhance multimodal understanding.

Overview of Automatic Description Generation from Images: A Survey

The paper "Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures" by Bernardi et al. presents a comprehensive analysis of the field of automatic image description generation, a domain intersecting computer vision (CV) and NLP. This intersection is crucial due to the increasing need for systems that interpret and articulate visual content using natural language. The authors categorize existing methodologies, discuss the available datasets, and evaluate the metrics used for assessing model performance.

Model Classifications

The paper bifurcates models into two primary paradigms:

Generation-Based Models: These approaches convert visual features into textual descriptions directly. Sub-categories include:
- Direct Generation Models: These utilize predefined pipelines where visual features like objects, scenes, and actions are detected and then translated into textual descriptions using language generation methods ranging from templates to neural networks.
- Recurrent Neural Networks (RNNs): Leveraging LSTMs, these models capture sequential dependencies between visual input and textual output, facilitating dynamic sentence generation.
Retrieval-Based Models: These models view image description as a retrieval problem, leveraging large datasets to find and customize existing human-generated descriptions for new images:
- Visual Space Retrieval: Here, the system matches a new image with the most visually similar images in a database, reusing or synthesizing descriptions from these matches.
- Multimodal Space Retrieval: By embedding images and descriptions into a common space, these models retrieve or align the most semantically similar text for a given image, offering versatility in bi-directional retrieval tasks.

Datasets and Evaluation Measures

The paper provides an overview of several key datasets utilized in the field:

Flickr8K and Flickr30K: Collections with multiple human-generated captions per image, emphasizing diversity and richness of descriptions.
MS COCO: A large-scale dataset that has become a de facto standard, providing comprehensive annotations useful for training complex models.
IAPR TC-12: One of the earlier datasets providing multilingual capabilities by including captions in multiple languages.

For evaluation, the authors discuss the use of human judgments and automated metrics like BLEU, ROUGE, and the more recent CIDEr for quantifying the quality of image captions. They note limitations of existing metrics, particularly in capturing the nuanced, contextual nature of human languages when applied to image captions, advocating for improved evaluation paradigms.

Implications and Future Directions

The research has practical implications in enhancing accessibility technologies for vision-impaired users and enriching content-based image retrieval systems. Theoretically, it pushes the development of models capable of robust multimodal understanding and generation.

The authors stress the need for:

Larger and more diverse datasets to train models capable of generalizing across varied contexts.
Improved automatic evaluation metrics that align more closely with human judgments, especially considering the creative and subjective aspects of image descriptions.
Exploration of related tasks such as visual question-answering to push the boundaries of models in interacting with visual contexts.

Conclusion

The survey by Bernardi et al. effectively synthesizes the state-of-the-art in automatic image description, outlining both the strides made and the challenges remaining. As the fields of CV and NLP continue to integrate, the development of more sophisticated and context-aware models and datasets will be critical in advancing this interdisciplinary area. Future research will undoubtedly benefit from addressing the highlighted limitations, ushering in systems that better emulate human-like image understanding and description.

PDF Markdown