Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Unified Referring Expression Segmentation Across Omni-Level Visual Target Granularities (2504.01954v1)

Published 2 Apr 2025 in cs.CV

Abstract: Referring expression segmentation (RES) aims at segmenting the entities' masks that match the descriptive language expression. While traditional RES methods primarily address object-level grounding, real-world scenarios demand a more versatile framework that can handle multiple levels of target granularity, such as multi-object, single object or part-level references. This introduces great challenges due to the diverse and nuanced ways users describe targets. However, existing datasets and models mainly focus on designing grounding specialists for object-level target localization, lacking the necessary data resources and unified frameworks for the more practical multi-grained RES. In this paper, we take a step further towards visual granularity unified RES task. To overcome the limitation of data scarcity, we introduce a new multi-granularity referring expression segmentation (MRES) task, alongside the RefCOCOm benchmark, which includes part-level annotations for advancing finer-grained visual understanding. In addition, we create MRES-32M, the largest visual grounding dataset, comprising over 32.2M masks and captions across 1M images, specifically designed for part-level vision-language grounding. To tackle the challenges of multi-granularity RES, we propose UniRES++, a unified multimodal LLM that integrates object-level and part-level RES tasks. UniRES++ incorporates targeted designs for fine-grained visual feature exploration. With the joint model architecture and parameters, UniRES++ achieves state-of-the-art performance across multiple benchmarks, including RefCOCOm for MRES, gRefCOCO for generalized RES, and RefCOCO, RefCOCO+, RefCOCOg for classic RES. To foster future research into multi-grained visual grounding, our RefCOCOm benchmark, MRES-32M dataset and model UniRES++ will be publicly available at https://github.com/Rubics-Xuan/MRES.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Jing Liu (526 papers)
  2. Wenxuan Wang (128 papers)
  3. Yisi Zhang (12 papers)
  4. Yepeng Tang (7 papers)
  5. Xingjian He (25 papers)
  6. Longteng Guo (31 papers)
  7. Tongtian Yue (13 papers)
  8. Xinlong Wang (56 papers)

Summary

Towards Unified Referring Expression Segmentation Across Omni-Level Visual Target Granularities

In the field of computer vision and natural language processing integration, Referring Expression Segmentation (RES) holds significant importance. RES involves segmenting an image to highlight entities that match a given textual description. Historically, RES methods have primarily focused on object-level grounding, where a referring expression corresponds directly to an entire object. However, many practical scenarios require nuanced interpretations spanning various levels of granularity, including multi-object and part-level references.

The paper, “Towards Unified Referring Expression Segmentation Across Omni-Level Visual Target Granularities,” introduces a comprehensive framework to address this challenge by establishing a multi-granularity RES (MRES) task. The research critically analyzes the limitations of current approaches, which mainly depend on datasets and models that specialize in object-level grounding, neglecting the complexities of visual scenes where expressions can refer to multiple objects or specific object parts.

To expand the scope of RES, the authors introduce a multigranularity dataset, MRES-32M, encompassing over 32.2 million masks and captions annotated across 1 million images. This dataset significantly surpasses existing datasets in scale and diversity, providing comprehensive part-level annotations and enabling models to achieve a more detailed understanding of visual scenes. Alongside the dataset, the paper presents a benchmark named RefCOCOm, derived from the existing RefCOCO dataset, which augments it with manually annotated part-level references.

A novel modeling framework, UniRES++, is proposed to leverage the extensive data offered by the MRES-32M dataset. UniRES++ is designed as a unified multimodal LLM, seamlessly integrating multi-granularity targets by employing a Multi-Granularity Vision Flow for capturing in-depth visual features. Furthermore, the incorporation of a Multi-Granularity Feature Exploitation component enables dynamic interaction among various levels of granularity, optimizing the utilization of captured features. Such a design efficiently melds object-level and part-level integration under a shared architecture, ensuring that the model performs optimally across all granularities.

Empirical experiments demonstrate UniRES++’s state-of-the-art performance across four benchmarks: RefCOCOm for MRES, gRefCOCO for generalized RES, and traditional datasets like RefCOCO, RefCOCO+, and RefCOCOg. This comprehensive evaluation underlines the model's effectiveness in achieving precise segmentation for singular object targets and more complex generalized and part-aware segmentation tasks.

The implications of this research span both practical and theoretical domains. Practically, the capability of multi-granularity RES systems offers the potential for more nuanced human-machine interactions, allowing devices to more intuitively understand and respond to user inputs that involve specificities like "the handle of the third cup" or "the leaves of the second tree." Theoretically, advancing towards multi-granularity comprehension encourages withstanding challenges posed by dense textual annotations and detailed segmentation requirements across intertwined visual hierarchies.

Future directions indicated by this paper involve further exploring dynamic component integrations and scaling model architectures to afford more refined segmentation capabilities. Also, fostering advancements through public accessibility to the extensive datasets and the robust UniRES++ model may catalyze further innovations within the research community. This direction could enrich the field of embodied artificial intelligence, enabling systems to process and understand human language in conjunction with complex visual stimuli more effectively.