Towards Unified Referring Expression Segmentation Across Omni-Level Visual Target Granularities
In the field of computer vision and natural language processing integration, Referring Expression Segmentation (RES) holds significant importance. RES involves segmenting an image to highlight entities that match a given textual description. Historically, RES methods have primarily focused on object-level grounding, where a referring expression corresponds directly to an entire object. However, many practical scenarios require nuanced interpretations spanning various levels of granularity, including multi-object and part-level references.
The paper, “Towards Unified Referring Expression Segmentation Across Omni-Level Visual Target Granularities,” introduces a comprehensive framework to address this challenge by establishing a multi-granularity RES (MRES) task. The research critically analyzes the limitations of current approaches, which mainly depend on datasets and models that specialize in object-level grounding, neglecting the complexities of visual scenes where expressions can refer to multiple objects or specific object parts.
To expand the scope of RES, the authors introduce a multigranularity dataset, MRES-32M, encompassing over 32.2 million masks and captions annotated across 1 million images. This dataset significantly surpasses existing datasets in scale and diversity, providing comprehensive part-level annotations and enabling models to achieve a more detailed understanding of visual scenes. Alongside the dataset, the paper presents a benchmark named RefCOCOm, derived from the existing RefCOCO dataset, which augments it with manually annotated part-level references.
A novel modeling framework, UniRES++, is proposed to leverage the extensive data offered by the MRES-32M dataset. UniRES++ is designed as a unified multimodal LLM, seamlessly integrating multi-granularity targets by employing a Multi-Granularity Vision Flow for capturing in-depth visual features. Furthermore, the incorporation of a Multi-Granularity Feature Exploitation component enables dynamic interaction among various levels of granularity, optimizing the utilization of captured features. Such a design efficiently melds object-level and part-level integration under a shared architecture, ensuring that the model performs optimally across all granularities.
Empirical experiments demonstrate UniRES++’s state-of-the-art performance across four benchmarks: RefCOCOm for MRES, gRefCOCO for generalized RES, and traditional datasets like RefCOCO, RefCOCO+, and RefCOCOg. This comprehensive evaluation underlines the model's effectiveness in achieving precise segmentation for singular object targets and more complex generalized and part-aware segmentation tasks.
The implications of this research span both practical and theoretical domains. Practically, the capability of multi-granularity RES systems offers the potential for more nuanced human-machine interactions, allowing devices to more intuitively understand and respond to user inputs that involve specificities like "the handle of the third cup" or "the leaves of the second tree." Theoretically, advancing towards multi-granularity comprehension encourages withstanding challenges posed by dense textual annotations and detailed segmentation requirements across intertwined visual hierarchies.
Future directions indicated by this paper involve further exploring dynamic component integrations and scaling model architectures to afford more refined segmentation capabilities. Also, fostering advancements through public accessibility to the extensive datasets and the robust UniRES++ model may catalyze further innovations within the research community. This direction could enrich the field of embodied artificial intelligence, enabling systems to process and understand human language in conjunction with complex visual stimuli more effectively.