From Show to Tell: A Survey on Deep Learning-based Image Captioning (2107.06912v3)

Published 14 Jul 2021 in cs.CV and cs.CL

Abstract: Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, large research efforts have been devoted to image captioning, i.e. describing images with syntactically and semantically meaningful sentences. Starting from 2015 the task has generally been addressed with pipelines composed of a visual encoder and a LLM for text generation. During these years, both components have evolved considerably through the exploitation of object regions, attributes, the introduction of multi-modal connections, fully-attentive approaches, and BERT-like early-fusion strategies. However, regardless of the impressive results, research in image captioning has not reached a conclusive answer yet. This work aims at providing a comprehensive overview of image captioning approaches, from visual encoding and text generation to training strategies, datasets, and evaluation metrics. In this respect, we quantitatively compare many relevant state-of-the-art approaches to identify the most impactful technical innovations in architectures and training strategies. Moreover, many variants of the problem and its open challenges are discussed. The final goal of this work is to serve as a tool for understanding the existing literature and highlighting the future directions for a research area where Computer Vision and Natural Language Processing can find an optimal synergy.

Authors (6)

Matteo Stefanini (7 papers)
Marcella Cornia (61 papers)
Lorenzo Baraldi (68 papers)
Silvia Cascianelli (23 papers)
Giuseppe Fiameni (18 papers)
Rita Cucchiara (142 papers)

Citations (217)

View on Semantic Scholar

Summary

An Insightful Overview of Deep Learning-Based Image Captioning

The paper "From Show to Tell: A Survey on Deep Learning-based Image Captioning" by Stefanini et al. explores the intricate task of image captioning, a pivotal component in connecting vision and language within the field of AI. The focus of this survey is on deep learning strategies, exploring the evolution of image captioning methodologies from their inception to contemporary approaches, including visual encoding and LLMing, drawing connections between computer vision and NLP.

Evolution of Image Captioning Techniques

The authors trace the progression of image captioning techniques starting from early models that utilized global CNN features and transition into more sophisticated models incorporating attention mechanisms. Initially, image captioning relied heavily on template-based approaches and retrieval models which evolved into deep learning-based generative models characterized by a two-stage process: visual encoding and LLMing.

Visual Encoding

The survey categorizes visual encoding advancements into four main strategies:

Global CNN Features: This initial approach used high-level image descriptors, introducing the basic yet foundational concept of image feature representation.
Attention over Grid of CNN Features: This paradigm shift allowed models to dynamically focus on specific image parts, providing flexibility and granularity that were not present in global representations.
Attention over Visual Regions: Leveraging object detectors like Faster R-CNN allowed the scene to be segmented into meaningful object proposals. This innovation marked a significant leap in encoding visual context relevant to language generation.
Self-attention and Transformer Architectures: Inspired by successes in NLP, these have altered the landscape of both encoding images and synthesizing language, providing robust performance with their capability to capture intricate relationships among visual elements.

LLMing

The architecture of LLMs has similarly evolved:

LSTM-Based Models: These recurrent architectures initially dominated, providing sequential word generation capabilities crucial for creating coherent sentences.
Transformer Networks: These architectures capitalize on self-attention mechanisms to handle long-range dependencies and context modeling more efficiently, reshaping language generation in the process.
BERT-like Architectures and Large-Scale Pre-Training: Incorporating early fusion layers and employing large-scale vision-and-language pre-training have further refined model performance, aligning semantic content more closely between image and text.

Quantitative Evaluation and Future Directions

The survey emphasizes not only architectural advances, but also the importance of datasets and evaluation metrics in shaping research progress. Conventional metrics like BLEU, METEOR, and CIDEr remain central but are supplemented by new approaches that address diversity and model generalization, reflecting a need for systems that can effectively describe images beyond the specific datasets on which they were trained.

Stefanini et al. also highlight several challenges and potential future directions for image captioning. The task remains unsolved, with issues surrounding generalization, diversity, and bias mitigation still looming large. The paper suggests that future research could benefit from focusing on robust, interpretable models that can reason across diverse, real-world scenarios while ensuring broader applicability and fairness.

Conclusion

In synthesizing the development of image captioning from rudimentary beginnings to the cutting-edge use of Transformers, this survey provides a valuable guide for researchers in the field. It offers a clear roadmap for consolidating advances while encouraging exploration of novel architectures and training protocols. As AI systems continue to evolve, the synergy between vision and language remains paramount, promising to further close the gap between human and machine understanding of the world.

PDF Markdown