Generalized Referring Expression Segmentation

Updated 15 January 2026

GRES is a framework that extends classical referring expression segmentation by supporting zero, single, and multi-target instance segmentation using natural language prompts.
It employs region-based decomposition, instance-aware queries, and adaptive counting heads to robustly predict segmentation masks and handle no-target cases.
Evaluation on benchmarks like gRefCOCO shows that metrics such as cIoU, gIoU, and no-target accuracy significantly improve with advanced GRES architectures.

Generalized Referring Expression Segmentation (GRES) extends classical referring expression segmentation to handle expressions that may refer to zero, one, or multiple disjoint object instances within an image, and requires robust, instance-aware, and relationship-sensitive segmentation under free-form natural language prompts. Formally, given an image $I \in \mathbb{R}^{H \times W \times 3}$ and a referring expression $T$ , the goal is to output a segmentation mask $M \in \{0,1\}^{H \times W}$ such that all pixels corresponding to every object described by $T$ are labeled as foreground, with explicit handling for multi-target and non-target (empty) cases. Unlike classical RES, which assumes a one-to-one mapping between expression and object, GRES admits variable and unknown referent cardinality, demanding greater semantic flexibility and instance-level separation capabilities (Liu et al., 2023, Ding et al., 1 Aug 2025, Ding et al., 8 Jan 2026, Dai et al., 17 Sep 2025).

1. Problem Formulation and Core Differences from RES

Classical Referring Expression Segmentation (RES) is defined as mapping $(I, T) \rightarrow M$ where $T$ refers to exactly one entity present in $I$ . GRES generalizes this by allowing:

Zero-target: $T$ does not correspond to any object in $I$ (e.g., "the elephant in a living room" where no elephant is present). The correct output is an all-zero mask.
Single-target: $T$ refers to one object (RES case).
Multi-target: $T$ refers to a set of objects (e.g., "the two men on the left", "all red cars").

The model must decide not only which pixels to segment but also whether any pixel should be foreground. This requires a no-target indicator $E \in \{0,1\}$ in addition to the mask $M$ , with $E=1$ indicating "no target" (Liu et al., 2023, Ding et al., 8 Jan 2026).

Evaluation metrics extend standard RES measures:

Cumulative IoU (cIoU) and Generalized IoU (gIoU): Both account for correct empty predictions, and multi-instance unions.
No-target accuracy (N-acc): Fraction of no-target queries correctly outputting an empty mask.
Precision@X (Pr@X): IoU-based precision for each sample, treating empty targets rigorously (Ding et al., 8 Jan 2026, Liu et al., 2023).

2. Dataset Development and Benchmarking

GRES benchmarking relies on datasets with multi-target and no-target expression annotation. The most influential is gRefCOCO (Liu et al., 2023, Ding et al., 8 Jan 2026):

Images: 19,994 (MS-COCO refs, UNC splits)
Expressions: $\sim$ 278k total; split into single-target, multi-target (80k), and no-target (32k)
Annotations: Per-instance segmentation masks, bounding boxes, no-target labels

Annotation proceeds with crowd-sourcing: selecting objects/group, writing a free-form expression, ensuring that no-target cases remain relevant (semantically grounded) yet do not match any annotated instance (Liu et al., 2023). Purpose-built datasets also exist for aerial imagery (Aerial-D: 1.5M expressions over 37k images) (Marnoto et al., 8 Dec 2025), group-wise settings (GRD, group of images with cross-image negatives) (Wu et al., 2023), and for 3D vision (Multi3DRes: 61k expressions over 800 scenes) (Wu et al., 2024).

3. Architectural Principles and Representative Methodologies

GRES architectures diverge from RES in several key respects, necessitated by the need for multi-cardinality, negative-case, and instance-aware prediction:

Region-based Decomposition: ReLA adaptively partitions the image using learnable region queries, then models region–region and region–language dependencies through stacked cross-attention and self-attention mechanisms (Liu et al., 2023, Ding et al., 8 Jan 2026). Region filters are dynamically weighted and fused, supporting both union masks and auxiliary no-target prediction.
Instance-aware Query Architectures: Methods such as InstanceVG (Dai et al., 17 Sep 2025) and InstAlign (Nguyen et al., 2024) employ multi-query, DETR-style instance heads, with each query intended to represent one candidate instance. They leverage one-to-one bipartite (Hungarian) matching with strong instance-level alignment losses (e.g., Dice+CE), and explicitly align queries not only to masks but to text phrases using cross-modal transformer blocks.
Counting and Cardinality Heads: Frameworks like CoHD (Luo et al., 2024) and HieA2G (Wang et al., 2 Jan 2025) decouple scene understanding via hierarchical semantic decoders, then append adaptive counting heads or category-level existence classifiers for robust referent number determination.
LLM-driven Multi-Target Segmentation: GSVA (Xia et al., 2023) and RAS (Cao et al., 5 Jun 2025) leverage large multimodal LLMs to handle multi-mask and null outputs, introducing special tokens (e.g., [SEG] for each mask, [REJ] for no-target in GSVA) and mask-based binary selection heads, thus bypassing the need for autoregressive set prediction.
Latent Description Generation: Latent-VG (Yu et al., 7 Aug 2025) augments the input text with multiple, stochastically generated latent expressions, each encoding complementary cues, and averages predictions to improve disambiguation and robustness.

The following table provides an organized comparison of representative approaches:

Approach	Core Mechanism	Instance Separation	No-target Handling	Noteworthy Metrics
ReLA	Region queries + attention	Soft region fusion	No-target head	cIoU, gIoU, N-acc
InstanceVG	Instance-aware point queries	1-1 query-instance	Exist score branch	gIoU, cIoU, per-instance
InstAlign	Phrase-object transformers	Query-text matching	Global+sent text	gIoU, cIoU, N-acc
CoHD / HieA2G	Hierarchy + counting head	Query-cardinality	Existence classif.	cIoU, gIoU, N-acc
GSVA	LLM + [SEG]/[REJ] tokens	LLM mask response	[REJ] token	gIoU, cIoU, N-acc
Latent-VG	Latent text/gen. expressions	Multi-pred. fusion	Dedicated empty token	mIoU, N-acc

4. Loss Functions, Matching, and Training Protocols

GRES training regimes must:

Perform robust instance-to-query matching (Hungarian matching, bipartite Soviet algorithms).
Apply instance-level segmentation losses (per-query Dice + BCE), no-target binary losses, and for LLM-based methods, cross-entropy or Dice on each predicted mask (possibly with language modeling losses for output tokens) (Dai et al., 17 Sep 2025, Xia et al., 2023).
Handle negative examples with tailored negative-mask supervision or explicit "empty" queries.
Employ auxiliary objectives (e.g., region minimaps, alignment losses, count/contrastive terms for Adaptive Grounding Counter in HieA2G).

For example, InstanceVG (Dai et al., 17 Sep 2025) uses a multi-term objective:

$L_{total} = \lambda_{detr} L_{detr} + \lambda_{seg} L_{seg} + \lambda_{instance} L_{ins-seg} + \lambda_{exist} L_{exist}$

where $L_{detr}$ is DETR-style detection loss, $L_{ins-seg}$ is per-instance segmentation, $L_{seg}$ is global mask loss, and $L_{exist}$ is existence BCE.

Ablation studies highlight that joint GREC+GRES optimization, instance-level alignment, adaptive queries, and explicit negative handling each contribute 1–3 point boosts in cIoU/gIoU (Dai et al., 17 Sep 2025, Nguyen et al., 2024, Wang et al., 2 Jan 2025, Luo et al., 2024).

5. Quantitative Results and Comparative Evaluation

State-of-the-art models achieve substantial gains on gRefCOCO and related benchmarks. Typical performance on gRefCOCO-val, cIoU/gIoU (as reported in (Dai et al., 17 Sep 2025, Ding et al., 8 Jan 2026, Xia et al., 2023)):

Method	cIoU	gIoU	N-acc (%)
ReLA	62.4	63.6	56
CoHD	65.17	68.42	63.68
MABP	65.72	68.86	62.18
InstAlign	68.94	74.34	79.72
RAS (LMM)	70.48	74.64	69.05
Latent-VG	68.23	72.45	70.42
GSVA-13B (LLM)	68.01	70.04	65.36
InstanceVG	69.31	73.36	—

Top methods, such as InstanceVG, InstAlign, RAS, and Latent-VG, report improvements of 4–12 points in gIoU or N-acc over predecessors, with further gains on Ref-ZOM, R-RefCOCO/+, and 3D-GRES datasets (Dai et al., 17 Sep 2025, Nguyen et al., 2024, Cao et al., 5 Jun 2025, Yu et al., 7 Aug 2025). No-target detection accuracy remains a challenge, with the best models achieving 65–80% (Nguyen et al., 2024). Set-based ablations, hierarchy modeling, adaptive mask grouping, and multi-task training regimes are consistently shown to yield additive improvements.

6. Extensions: 3D-GRES, Domain and Groupwise Generalization

GRES concepts have been extended to:

3D-GRES: Predicting instance sets for zero/one/multiple targets over 3D point clouds, supported by architectures such as MDIN (Text-driven Sparse Queries, Multi-object Decoupling) and IPDN (fusing 2D-CLIP and 3D features with prompt-aware decoding) (Wu et al., 2024, Chen et al., 9 Jan 2025).
Aerial and Remote Sensing: The Aerial-D dataset and RSRefSeg model provide GRES benchmarks and unified architectures for large-scale, multi-condition aerial imagery, handling dense scenes and variable historic imaging through synthetic filtering and LLM-augmented expression generation (Marnoto et al., 8 Dec 2025).
Group-wise and Video: Grouped GRES extends RES/GRES to multiple images, associating queries to subsets of an image collection, with models like GRSer exploiting negative-aware mirror training and within-group visual prototypes (Wu et al., 2023).

7. Open Challenges and Future Directions

Despite substantial progress, GRES remains an open problem for:

Robust no-target and fine-grained multi-target discrimination, especially with ambiguous or compositional expressions (Ding et al., 8 Jan 2026).
Hierarchical and relational expression grounding (e.g., possession, exclusion, spatial logic).
Compositional generalization: Meta-learning frameworks (MCRES) encourage models to generalize to novel attribute-object pairings never seen in training (Xu et al., 2023).
Scalable open-modal interaction: LLM-based mask grouping, omnimodal prompts, and 3D/audio/temporal extensions are rapidly evolving; efficient unification across modalities and reasoning types remains unsolved (Xia et al., 2023, Cao et al., 5 Jun 2025, Ding et al., 1 Aug 2025).
Unified benchmarks and protocols that reflect real-world tasks with open world, conversational, and interactive referent grounding (Ding et al., 1 Aug 2025, Ding et al., 8 Jan 2026).

GRES thus provides both a practical extension of RES and an open-ended, high-complexity challenge for multimodal understanding, dataset construction, and large-model–based visual-linguistic reasoning. Continued advances are anticipated in instance segmentation, negative case modeling, relational language grounding, and unified evaluation and application across imaging modalities.