Towards Visual Grounding: A Survey (2412.20206v1)

Published 28 Dec 2024 in cs.CV

Abstract: Visual Grounding is also known as Referring Expression Comprehension and Phrase Grounding. It involves localizing a natural number of specific regions within an image based on a given textual description. The objective of this task is to emulate the prevalent referential relationships in social conversations, equipping machines with human-like multimodal comprehension capabilities. Consequently, it has extensive applications in various domains. However, since 2021, visual grounding has witnessed significant advancements, with emerging new concepts such as grounded pre-training, grounding multimodal LLMs, generalized visual grounding, and giga-pixel grounding, which have brought numerous new challenges. In this survey, we initially examine the developmental history of visual grounding and provide an overview of essential background knowledge. We systematically track and summarize the advancements and meticulously organize the various settings in visual grounding, thereby establishing precise definitions of these settings to standardize future research and ensure a fair comparison. Additionally, we delve into several advanced topics and highlight numerous applications of visual grounding. Finally, we outline the challenges confronting visual grounding and propose valuable directions for future research, which may serve as inspiration for subsequent researchers. By extracting common technical details, this survey encompasses the representative works in each subtopic over the past decade. To the best, this paper presents the most comprehensive overview currently available in the field of grounding. This survey is designed to be suitable for both beginners and experienced researchers, serving as an invaluable resource for understanding key concepts and tracking the latest research developments. We keep tracing related works at https://github.com/linhuixiao/Awesome-Visual-Grounding.

PDF Abstract

Towards Visual Grounding: A Survey

The paper "Towards Visual Grounding: A Survey" offers a comprehensive analysis of the past, present, and evolving future of visual grounding (VG). The term "visual grounding" refers to the task of localizing specific objects in an image as referenced by a given textual description. This survey thoroughly investigates the technological evolution and categorization of methodologies, datasets, and settings in VG.

In the field of visual grounding, the paper identifies several significant trends and transitions. The authors categorize the technical evolution into three primary eras: the preliminary stage (before 2014), the early stage (2014-2020), and the surge stage (2021 onwards). Early approaches were predominantly reliant on CNNs for visual processing and LSTMs for text processing, focusing on matching regional proposals with textual descriptions. The transition towards utilizing attention mechanisms, notably the Transformer architecture, marked a significant shift in methodology, allowing for more comprehensive multimodal interactions.

Recent years have seen a surge in the adoption of vision-language pre-training (VLP) models, such as CLIP, which facilitate robust cross-modal feature alignment. This paradigm shift underscores the integration of advanced pre-training methods to enhance fine-grained cross-modal alignments, effectively overcoming the limitations of traditional single-modality pre-training.

A major component of the paper explores the categorization of experimental settings within VG, which fundamentally alters the training and evaluation methodologies. These settings include fully supervised approaches, weakly supervised strategies that rely on image-text pairs without bounding box annotations, semi-supervised methods that utilize a mix of labeled and unlabeled data, and zero-shot settings where models are tested on novel object categories without direct training. The paper also introduces "Generalized Visual Grounding (GVG)" that covers scenarios where multiple or no objects are present according to the textual description, reflecting a more realistic application context.

The survey extensively discusses the implications of these methodologies. The constraints posed by existing datasets, such as the classical RefCOCO/+/g databases, highlight the need for more challenging and comprehensive benchmarking tools to truly evaluate the capabilities of modern VG models. As suggested by the authors, the importance of datasets in driving technological progress cannot be overstated; there is a need for developing more challenging datasets that move beyond the limitations of current test beds.

The authors also explore the broader implications and applications of VG in real-world scenarios. They underscore the potential for VG to contribute to various domains such as robotics, human-computer interaction, remote sensing, and medical imaging, indicating a trajectory toward more specialized applications where precise grounding and semantic comprehension are crucial.

The survey concludes with an insight into the challenges and future directions for VG research. This includes addressing current dataset limitations, refining task definitions to better represent real-world complexities, and improving methodologies for more effective integration into practical applications. The authors advocate for advancing research into large-scale pre-training and more complex video stream scenarios, which will likely play a critical role in developing general-purpose AI systems with enhanced multimodal understanding capabilities.

Overall, the paper serves as a valuable resource for researchers within the field, providing a structured and detailed overview of key developments in visual grounding, and laying the framework for future exploration and innovation in this area.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Linhui Xiao (5 papers)
Xiaoshan Yang (19 papers)
Xiangyuan Lan (25 papers)
Yaowei Wang (149 papers)
Changsheng Xu (100 papers)

Related Papers

Find Related Papers

GitHub

GitHub - linhuixiao/Awesome-Visual-Grounding: [TPAMI under review] Towards Visual Grounding: A Survey (7 stars)