GREC: Generalized Referring Expression Comprehension

Published 30 Aug 2023 in cs.CV | (2308.16182v2)

Abstract: The objective of Classic Referring Expression Comprehension (REC) is to produce a bounding box corresponding to the object mentioned in a given textual description. Commonly, existing datasets and techniques in classic REC are tailored for expressions that pertain to a single target, meaning a sole expression is linked to one specific object. Expressions that refer to multiple targets or involve no specific target have not been taken into account. This constraint hinders the practical applicability of REC. This study introduces a new benchmark termed as Generalized Referring Expression Comprehension (GREC). This benchmark extends the classic REC by permitting expressions to describe any number of target objects. To achieve this goal, we have built the first large-scale GREC dataset named gRefCOCO. This dataset encompasses a range of expressions: those referring to multiple targets, expressions with no specific target, and the single-target expressions. The design of GREC and gRefCOCO ensures smooth compatibility with classic REC. The proposed gRefCOCO dataset, a GREC method implementation code, and GREC evaluation code are available at https://github.com/henghuiding/gRefCOCO.

Abstract PDF HTML Upgrade to Chat

Authors (4)

References (41)

Citations (10)

View on Semantic Scholar

Summary

The paper introduces GREC, a novel benchmark that extends REC tasks to handle multi-target and no-target expressions.
It presents a new gRefCOCO dataset and evaluation metrics like Precision@(F₁=1, IoU≥0.5) and No-target accuracy to measure performance.
Experiments reveal that dynamic thresholding strategies outperform traditional fixed top-K selection in complex visual scenes.

Generalized Referring Expression Comprehension (GREC): Expanding the Scope of Visual Grounding

The research paper titled "GREC: Generalized Referring Expression Comprehension" presents a significant advancement in the field of Referring Expression Comprehension (REC), a task that involves identifying and localizing objects within images based on natural language expressions. Traditional REC approaches are limited to handling expressions that refer to a single target object. This limitation restricts the applicability of REC systems, particularly in complex visual scenes that involve multiple objects or when no objects are referenced at all. This paper proposes a novel benchmark, Generalized Referring Expression Comprehension (GREC), which expands the conventional REC framework to support expressions referring to any number of target objects.

Benchmark and Dataset Introduction

GREC is introduced as an extension to the classic REC task, providing a framework that allows for multi-target and no-target referring expressions. The conventional REC paradigm fails to accommodate expressions that do not correspond to any object (no-target) or refer to multiple objects within the image (multi-target). These scenarios are critical for practical applications in video production, human-machine interaction, and autonomous systems, where language descriptions often involve multiple or zero targets in complex scenes.

To operationalize GREC, the researchers developed a new large-scale dataset named gRefCOCO. This dataset extends the popular RefCOCO dataset by including both multi-target and no-target expressions. It serves as a comprehensive resource for testing and developing models capable of handling the additional complexities introduced by the GREC task. The dataset is designed for seamless compatibility with classic REC tasks, thus facilitating comprehensive evaluations and comparisons.

Methodology and Evaluation

The GREC framework necessitates modifications to existing REC models to enable them to process expressions with varying numbers of references. This requires outputting an arbitrary number of bounding boxes, which may include zero boxes for no-target expressions. The authors introduced new evaluation metrics, including Precision@(F₁=1, IoU≥0.5) and No-target accuracy (N-acc). These metrics provide a more nuanced assessment of model performance by considering the number of correctly identified targets in multi-target expressions and the accuracy of identifying no-target scenarios.

Extensive experiments revealed that conventional REC methods struggle to extend their capabilities to the GREC scenario. The paper details an ablation study showing that strategies relying on selecting top-K bounding boxes do not perform well for GREC datasets. Dynamic threshold-based bounding box selection strategies, where outputs are dictated by confidence scores rather than a fixed number, offer superior results in managing the challenges posed by GREC.

Implications and Future Work

The introduction of GREC marks a significant shift in the REC paradigm, providing a more realistic and flexible framework for object localization in natural scenes. By accommodating multi-target and no-target expressions, GREC not only enhances the robustness of models in face of diverse inputs but also enables new applications such as multi-object retrieval and image set filtering based on descriptive expressions.

From a theoretical perspective, GREC suggests the need for future research to develop more sophisticated neural architectures capable of capturing context and semantic nuances that single-target models inherently overlook. Incorporating advanced natural language understanding methods and multi-task learning frameworks could offer potential pathways for improvement. Furthermore, exploring the relationships between referring expressions in natural dialogue and physical scene composition may yield insights into building more intuitive interfaces for human-computer interaction.

Overall, the GREC framework and its accompanying dataset gRefCOCO represent substantial contributions to the field of visual grounding, offering a platform for future exploration and technological advancements. As models and computational limits evolve, the ability to handle complex and dynamic language inputs with precision and adaptability will be crucial for the next generation of AI systems.

Markdown Report Issue