Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GREC: Generalized Referring Expression Comprehension (2308.16182v2)

Published 30 Aug 2023 in cs.CV

Abstract: The objective of Classic Referring Expression Comprehension (REC) is to produce a bounding box corresponding to the object mentioned in a given textual description. Commonly, existing datasets and techniques in classic REC are tailored for expressions that pertain to a single target, meaning a sole expression is linked to one specific object. Expressions that refer to multiple targets or involve no specific target have not been taken into account. This constraint hinders the practical applicability of REC. This study introduces a new benchmark termed as Generalized Referring Expression Comprehension (GREC). This benchmark extends the classic REC by permitting expressions to describe any number of target objects. To achieve this goal, we have built the first large-scale GREC dataset named gRefCOCO. This dataset encompasses a range of expressions: those referring to multiple targets, expressions with no specific target, and the single-target expressions. The design of GREC and gRefCOCO ensures smooth compatibility with classic REC. The proposed gRefCOCO dataset, a GREC method implementation code, and GREC evaluation code are available at https://github.com/henghuiding/gRefCOCO.

Generalized Referring Expression Comprehension (GREC): Expanding the Scope of Visual Grounding

The research paper titled "GREC: Generalized Referring Expression Comprehension" presents a significant advancement in the field of Referring Expression Comprehension (REC), a task that involves identifying and localizing objects within images based on natural language expressions. Traditional REC approaches are limited to handling expressions that refer to a single target object. This limitation restricts the applicability of REC systems, particularly in complex visual scenes that involve multiple objects or when no objects are referenced at all. This paper proposes a novel benchmark, Generalized Referring Expression Comprehension (GREC), which expands the conventional REC framework to support expressions referring to any number of target objects.

Benchmark and Dataset Introduction

GREC is introduced as an extension to the classic REC task, providing a framework that allows for multi-target and no-target referring expressions. The conventional REC paradigm fails to accommodate expressions that do not correspond to any object (no-target) or refer to multiple objects within the image (multi-target). These scenarios are critical for practical applications in video production, human-machine interaction, and autonomous systems, where language descriptions often involve multiple or zero targets in complex scenes.

To operationalize GREC, the researchers developed a new large-scale dataset named gRefCOCO. This dataset extends the popular RefCOCO dataset by including both multi-target and no-target expressions. It serves as a comprehensive resource for testing and developing models capable of handling the additional complexities introduced by the GREC task. The dataset is designed for seamless compatibility with classic REC tasks, thus facilitating comprehensive evaluations and comparisons.

Methodology and Evaluation

The GREC framework necessitates modifications to existing REC models to enable them to process expressions with varying numbers of references. This requires outputting an arbitrary number of bounding boxes, which may include zero boxes for no-target expressions. The authors introduced new evaluation metrics, including Precision@(F₁=1, IoU≥0.5) and No-target accuracy (N-acc). These metrics provide a more nuanced assessment of model performance by considering the number of correctly identified targets in multi-target expressions and the accuracy of identifying no-target scenarios.

Extensive experiments revealed that conventional REC methods struggle to extend their capabilities to the GREC scenario. The paper details an ablation paper showing that strategies relying on selecting top-K bounding boxes do not perform well for GREC datasets. Dynamic threshold-based bounding box selection strategies, where outputs are dictated by confidence scores rather than a fixed number, offer superior results in managing the challenges posed by GREC.

Implications and Future Work

The introduction of GREC marks a significant shift in the REC paradigm, providing a more realistic and flexible framework for object localization in natural scenes. By accommodating multi-target and no-target expressions, GREC not only enhances the robustness of models in face of diverse inputs but also enables new applications such as multi-object retrieval and image set filtering based on descriptive expressions.

From a theoretical perspective, GREC suggests the need for future research to develop more sophisticated neural architectures capable of capturing context and semantic nuances that single-target models inherently overlook. Incorporating advanced natural language understanding methods and multi-task learning frameworks could offer potential pathways for improvement. Furthermore, exploring the relationships between referring expressions in natural dialogue and physical scene composition may yield insights into building more intuitive interfaces for human-computer interaction.

Overall, the GREC framework and its accompanying dataset gRefCOCO represent substantial contributions to the field of visual grounding, offering a platform for future exploration and technological advancements. As models and computational limits evolve, the ability to handle complex and dynamic language inputs with precision and adaptability will be crucial for the next generation of AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Real-time referring expression comprehension by single-stage grounding network. arXiv preprint arXiv:1812.03426, 2018.
  2. Transvg: End-to-end visual grounding with transformers. In ICCV, 2021.
  3. Phraseclick: toward achieving flexible interactive segmentation by phrase and click. In ECCV, 2020.
  4. MeViS: A large-scale benchmark for video segmentation with motion expressions. In ICCV, 2023.
  5. MOSE: A new dataset for video object segmentation in complex scenes. In ICCV, 2023.
  6. Vision-language transformer and query generation for referring segmentation. In ICCV, 2021.
  7. VLT: Vision-language transformer and query generation for referring segmentation. IEEE TPAMI, 45(6), 2023.
  8. The iapr tc-12 benchmark: A new evaluation resource for visual information systems. In International workshop ontoImage, volume 2, 2006.
  9. Global knowledge calibration for fast open-vocabulary segmentation. In ICCV, 2023.
  10. Primitive generation and semantic-related alignment for universal zero-shot segmentation. In CVPR, 2023.
  11. Semantic-promoted debiasing and background disambiguation for zero-shot instance segmentation. In CVPR, 2023.
  12. Learning to compose and reason with language tree structures for visual grounding. IEEE TPAMI, 44(2), 2022.
  13. Modeling relationships in referential expressions with compositional modular networks. In CVPR, 2017.
  14. Natural language object retrieval. In CVPR, 2016.
  15. Refclip: A universal teacher for weakly supervised referring expression comprehension. In CVPR, 2023.
  16. Mdetr-modulated detection for end-to-end multi-modal understanding. In ICCV, 2021.
  17. ReferItGame: Referring to objects in photographs of natural scenes. In EMNLP, 2014.
  18. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 123(1), 2017.
  19. Transformer-based visual segmentation: A survey. arXiv:2304.09854, 2023.
  20. A real-time cross-modality correlation filtering method for referring expression comprehension. In CVPR, 2020.
  21. Microsoft coco: Common objects in context. In ECCV, 2014.
  22. GRES: Generalized referring expression segmentation. In CVPR, 2023.
  23. Multi-modal mutual attention and iterative interaction for referring image segmentation. IEEE TIP, 2023.
  24. Instance-specific feature propagation for referring segmentation. IEEE TMM, 2022.
  25. Learning to assemble neural module tree networks for visual grounding. In ICCV, 2019.
  26. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
  27. Multi-task collaborative network for joint referring expression comprehension and segmentation. In CVPR, 2020.
  28. Generation and comprehension of unambiguous object descriptions. In CVPR, 2016.
  29. A proposal-free one-stage framework for referring expression comprehension and generation via dense cross-attention. IEEE TMM, 2022.
  30. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In CVPR, 2019.
  31. Phrasecut: Language-based image segmentation in the wild. In CVPR, 2020.
  32. Betrayed by captions: Joint caption grounding and generation for open vocabulary instance segmentation. In ICCV, 2023.
  33. Towards robust referring image segmentation. arXiv preprint arXiv:2209.09554, 2022.
  34. Towards open vocabulary learning: A survey. arXiv:2306.15880, 2023.
  35. Universal instance perception as object discovery and retrieval. In CVPR, 2023.
  36. Improving one-stage visual grounding by recursive sub-query construction. In ECCV, 2020.
  37. A fast and accurate one-stage approach to visual grounding. In ICCV, 2019.
  38. Modeling context in referring expressions. In ECCV, 2016.
  39. Prototypical matching and open set rejection for zero-shot semantic segmentation. In ICCV, 2021.
  40. Grounding referring expressions in images by variational context. In CVPR, 2018.
  41. Parallel attention: A unified framework for visual object discovery through dialogs and queries. In CVPR, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Shuting He (23 papers)
  2. Henghui Ding (87 papers)
  3. Chang Liu (864 papers)
  4. Xudong Jiang (69 papers)
Citations (10)
Github Logo Streamline Icon: https://streamlinehq.com