GRES: Generalized Referring Expression Segmentation (2306.00968v1)

Published 1 Jun 2023 in cs.CV

Abstract: Referring Expression Segmentation (RES) aims to generate a segmentation mask for the object described by a given language expression. Existing classic RES datasets and methods commonly support single-target expressions only, i.e., one expression refers to one target object. Multi-target and no-target expressions are not considered. This limits the usage of RES in practice. In this paper, we introduce a new benchmark called Generalized Referring Expression Segmentation (GRES), which extends the classic RES to allow expressions to refer to an arbitrary number of target objects. Towards this, we construct the first large-scale GRES dataset called gRefCOCO that contains multi-target, no-target, and single-target expressions. GRES and gRefCOCO are designed to be well-compatible with RES, facilitating extensive experiments to study the performance gap of the existing RES methods on the GRES task. In the experimental study, we find that one of the big challenges of GRES is complex relationship modeling. Based on this, we propose a region-based GRES baseline ReLA that adaptively divides the image into regions with sub-instance clues, and explicitly models the region-region and region-language dependencies. The proposed approach ReLA achieves new state-of-the-art performance on the both newly proposed GRES and classic RES tasks. The proposed gRefCOCO dataset and method are available at https://henghuiding.github.io/GRES.

PDF Abstract

Generalized Referring Expression Segmentation: A New Benchmark Approach

The paper "GRES: Generalized Referring Expression Segmentation" introduces a novel benchmark in the field of Referring Expression Segmentation (RES) by expanding its traditional constraints. The authors propose the Generalized Referring Expression Segmentation (GRES), a framework that allows expressions to refer to an arbitrary number of target objects. This advancement is designed to address the limitations of classic RES, which primarily focus on single-target expressions, thereby limiting practical applications. The paper is a comprehensive exploration of the GRES paradigm and its capabilities, supported by a newly constructed dataset, gRefCOCO, and a robust baseline method, ReLA.

Motivation and Contribution

Classic RES methodologies are primarily constrained by their inability to process multi-target and no-target expressions effectively. Traditional datasets, including ReferIt and RefCOCO, are designed to support single-target scenarios, which do not reflect the complexity of real-world applications. The introduction of GRES targets these limitations by allowing expressions to handle multiple or zero targets, thereby increasing flexibility and applicability of RES in diverse practical scenarios such as human-machine interaction and video productions.

The paper's contributions can be summarized as follows:

GRES Framework: By defining GRES, the authors extend RES capabilities to handle any number of target objects in a single expression, dramatically enhancing the semantic flexibility and practical utility.
gRefCOCO Dataset: The creation of gRefCOCO, a large-scale dataset that includes multi-target, no-target, and single-target expressions, marks a significant step forward. It stands as the first dataset that fully supports GRES, facilitating comprehensive experimental analysis and development.
ReLA Baseline Method: The ReLA approach employs region-based modeling to adaptively divide images into regions with sub-instance clues and interweave region-region and region-language dependencies. This method achieves state-of-the-art performance in both GRES and classic RES tasks, illustrating the effectiveness and adaptability of GRES-centered techniques.

Methodological Insights

The ReLA method emphasizes the need for complex relationship modeling essential to tackle the challenges presented by multi-target expressions. It employs an adaptive region-based model that integrates both image and language data to predict the segmentation mask. The model leverages two innovative attention mechanisms within its architecture: Region-Image Cross Attention (RIA) and Region-Language Cross Attention (RLA). These components collectively focus on dynamically aggregating features relevant to specified regions and effectively capturing interactions within and across modalities.

Empirical Evaluation

The paper reports consistent improvements of the ReLA method across various datasets, showcasing its capability to address the complexities of GRES. Notably, ReLA demonstrates superior performance on the gRefCOCO dataset with significant improvements in conventional metrics like cumulative IoU (cIoU) and the proposed generalized IoU (gIoU). The method's efficacy is further illustrated in classic RES datasets, where it surpasses existing state-of-the-art methods, proving its adaptability and robustness.

Implications and Future Directions

The implications of GRES and the gRefCOCO dataset are manifold:

Practical Flexibility: GRES enhances the robustness of RES applications in real-world scenarios where language expressions may naturally target multiple or no objects, providing greater user intent adaptability.
Expanded Use Cases: The integration of GRES is likely to broaden RES applicability in fields such as automated image captioning, content retrieval, and interactive robotics, where understanding nuanced expressions is crucial.
Research Advancements: By addressing key limitations, the GRES framework opens avenues for further exploration in multi-modal interaction and complex semantic parsing, setting a new baseline for future AI advancements.

In concluding, the paper successfully contributes to the ongoing evolution of RES by presenting a structured and scalable approach to handle more intricate and natural human expressions. The introduction of GRES and gRefCOCO provides a solid foundation for subsequent innovations and practical deployment in related AI sectors.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Chang Liu (864 papers)
Henghui Ding (87 papers)
Xudong Jiang (69 papers)

Citations (100)

View on Semantic Scholar

Related Papers

Find Related Papers