GREC: Generalized Referring Expression Comprehension (2308.16182v2)
Abstract: The objective of Classic Referring Expression Comprehension (REC) is to produce a bounding box corresponding to the object mentioned in a given textual description. Commonly, existing datasets and techniques in classic REC are tailored for expressions that pertain to a single target, meaning a sole expression is linked to one specific object. Expressions that refer to multiple targets or involve no specific target have not been taken into account. This constraint hinders the practical applicability of REC. This study introduces a new benchmark termed as Generalized Referring Expression Comprehension (GREC). This benchmark extends the classic REC by permitting expressions to describe any number of target objects. To achieve this goal, we have built the first large-scale GREC dataset named gRefCOCO. This dataset encompasses a range of expressions: those referring to multiple targets, expressions with no specific target, and the single-target expressions. The design of GREC and gRefCOCO ensures smooth compatibility with classic REC. The proposed gRefCOCO dataset, a GREC method implementation code, and GREC evaluation code are available at https://github.com/henghuiding/gRefCOCO.
- Real-time referring expression comprehension by single-stage grounding network. arXiv preprint arXiv:1812.03426, 2018.
- Transvg: End-to-end visual grounding with transformers. In ICCV, 2021.
- Phraseclick: toward achieving flexible interactive segmentation by phrase and click. In ECCV, 2020.
- MeViS: A large-scale benchmark for video segmentation with motion expressions. In ICCV, 2023.
- MOSE: A new dataset for video object segmentation in complex scenes. In ICCV, 2023.
- Vision-language transformer and query generation for referring segmentation. In ICCV, 2021.
- VLT: Vision-language transformer and query generation for referring segmentation. IEEE TPAMI, 45(6), 2023.
- The iapr tc-12 benchmark: A new evaluation resource for visual information systems. In International workshop ontoImage, volume 2, 2006.
- Global knowledge calibration for fast open-vocabulary segmentation. In ICCV, 2023.
- Primitive generation and semantic-related alignment for universal zero-shot segmentation. In CVPR, 2023.
- Semantic-promoted debiasing and background disambiguation for zero-shot instance segmentation. In CVPR, 2023.
- Learning to compose and reason with language tree structures for visual grounding. IEEE TPAMI, 44(2), 2022.
- Modeling relationships in referential expressions with compositional modular networks. In CVPR, 2017.
- Natural language object retrieval. In CVPR, 2016.
- Refclip: A universal teacher for weakly supervised referring expression comprehension. In CVPR, 2023.
- Mdetr-modulated detection for end-to-end multi-modal understanding. In ICCV, 2021.
- ReferItGame: Referring to objects in photographs of natural scenes. In EMNLP, 2014.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 123(1), 2017.
- Transformer-based visual segmentation: A survey. arXiv:2304.09854, 2023.
- A real-time cross-modality correlation filtering method for referring expression comprehension. In CVPR, 2020.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- GRES: Generalized referring expression segmentation. In CVPR, 2023.
- Multi-modal mutual attention and iterative interaction for referring image segmentation. IEEE TIP, 2023.
- Instance-specific feature propagation for referring segmentation. IEEE TMM, 2022.
- Learning to assemble neural module tree networks for visual grounding. In ICCV, 2019.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
- Multi-task collaborative network for joint referring expression comprehension and segmentation. In CVPR, 2020.
- Generation and comprehension of unambiguous object descriptions. In CVPR, 2016.
- A proposal-free one-stage framework for referring expression comprehension and generation via dense cross-attention. IEEE TMM, 2022.
- Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In CVPR, 2019.
- Phrasecut: Language-based image segmentation in the wild. In CVPR, 2020.
- Betrayed by captions: Joint caption grounding and generation for open vocabulary instance segmentation. In ICCV, 2023.
- Towards robust referring image segmentation. arXiv preprint arXiv:2209.09554, 2022.
- Towards open vocabulary learning: A survey. arXiv:2306.15880, 2023.
- Universal instance perception as object discovery and retrieval. In CVPR, 2023.
- Improving one-stage visual grounding by recursive sub-query construction. In ECCV, 2020.
- A fast and accurate one-stage approach to visual grounding. In ICCV, 2019.
- Modeling context in referring expressions. In ECCV, 2016.
- Prototypical matching and open set rejection for zero-shot semantic segmentation. In ICCV, 2021.
- Grounding referring expressions in images by variational context. In CVPR, 2018.
- Parallel attention: A unified framework for visual object discovery through dialogs and queries. In CVPR, 2018.