Referring-Expression Segmentation

Updated 17 March 2026

Referring-Expression Segmentation is the task of creating accurate pixel-level masks for image regions based on detailed natural language descriptions.
Recent advancements include multi-target, 3D, and temporal variants that leverage transformer architectures and cross-modal attention for improved precision.
Diverse supervision regimes—from fully to omni-supervised methods—drive the development of robust, scalable models across various real-world applications.

Referring-Expression Segmentation is the task of generating a precise segmentation mask for the region(s) in an image or scene singled out by a natural-language expression. In contrast to closed-vocabulary segmentation or instance detection, RES demands pixel-level localization conditioned on free-form linguistic descriptions that may reference objects, regions, parts, attributes, spatial relationships, and more. The field bridges vision–language understanding, structured prediction, and multimodal representation learning, and has rapidly evolved in both dataset ambition and methodological sophistication.

1. Problem Formulation and Task Variants

The classical formulation considers, for an image $I\in\mathbb{R}^{H\times W\times3}$ and expression $T$ , the prediction of a binary mask $M\in\{0,1\}^{H\times W}$ such that $M(p)=1$ if pixel $p$ belongs to the object described by $T$ (Liu et al., 2023). Early benchmarks (e.g., RefCOCO, RefCOCO+, RefCOCOg) operated under the assumption that the expression refers to a single object instance present in the image.

Recent extensions generalize the task in several dimensions:

Multi-target and no-target GRES: Generalized RES (GRES) allows $T$ to refer to zero, one, or multiple objects, producing both a foreground mask and a "no-target" flag (Liu et al., 2023, Ding et al., 8 Jan 2026, Nguyen et al., 2024).
Group-wise GRES: The expression is evaluated over a collection of related images, requiring both referent existence prediction and mask generation per image (Wu et al., 2023).
Part-level and MRES: Multi-granularity RES (MRES) tasks consider expressions that may refer to object parts or subregions, necessitating mask prediction at a finer semantic granularity (Wang et al., 2023).
3D and temporal variants: 3D-RES and 3D-GRES extend RES to point cloud data, while video RES (e.g., RefVOS) addresses temporally consistent masks for video sequences (Chen et al., 3 Mar 2026, Wu et al., 2024, Bellver et al., 2020).
Aerial and domain-adapted RES: Datasets such as Aerial-D extend RES to aerial and satellite images with diverse resolutions and challenging object densities (Marnoto et al., 8 Dec 2025).

The expansion of task definition reflects underlying practical demands and has guided the design of new architectures and benchmarks.

2. Model Architectures and Methodological Advances

A wide array of network designs have been proposed for referring-expression segmentation:

Cross-modal fusion: Contemporary models integrate visual and textual information via bilinear attention, cross-modal self-attention (CMSA), or segment-level feature alignment. CMSA captures fine-grained dependencies between all visual spatial positions and word tokens, while gated multi-level fusion (GMLF) enables adaptive mixing of features at multiple network depths (Ye et al., 2021).
Transformers and region-based modules: Transformer-based RES decoders dominate current benchmarks, often employing a vision transformer (ViT/CLIP) for image encoding, a contextual language encoder (BERT/MPNet), and stacking interleaved self- and cross-attention blocks for fusion (Wang et al., 2023, Yu et al., 7 Aug 2025, Nguyen et al., 2024). Region-based methods (e.g., ReLA) adaptively partition the image into soft spatial regions and explicitly model both region–language and region–region relationships, crucial for multi-target and complex relational queries (Liu et al., 2023, Ding et al., 8 Jan 2026).
Instance-aware multi-query design: Models such as InstAlign and 3D-GRES's MDIN maintain a bank of object-level queries/tokens, each predicting a candidate mask and aligned directly to text phrases or spatial words, supervised via instance-level Hungarian matching and phrase–object alignment losses (Nguyen et al., 2024, Wu et al., 2024).
Latent expression generation: Latent-VG enriches vision–language alignment by generating multiple "latent expressions," each capturing diverse visual attributes or subregion cues inherent in the target, then aggregates predictions for robust masking (Yu et al., 7 Aug 2025).
Temporal extension for video/VOS: Temporal self-attention (CFSA) and ConvLSTM-based architectures propagate referring cues across frames for consistency in video RES (Ye et al., 2021, Bellver et al., 2020).
Part-level and multi-granularity: Unified models (e.g., UniRES) apply group tokens at multiple vision transformer depths, enabling both object- and part-level grouping, with language-guided region filters for granularity-adaptive segmentation (Wang et al., 2023).

These architectural advances enable the field to handle open-vocabulary, compositional, and multi-object referring tasks in both 2D and 3D domains.

3. Supervision Regimes: Fully, Weakly, Semi, and Omni-Supervised Learning

The supervision spectrum in RES has expanded to address annotation cost and practical scalability:

Fully supervised RES: Classical models require pixel-level mask annotations for each (image, expression) pair, which is expensive at scale (Chen et al., 2019, Hu et al., 2016).
Weakly supervised RES: TSEG achieves segmentation with only image-level expression associations, leveraging a multi-label patch assignment (MPA) to infer masks from patch–text similarities, closing a substantial fraction of the performance gap relative to full supervision (Strudel et al., 2022).
Semi-supervised RES: RESMatch pioneers teacher–student consistency-based SSL for RES, introducing revised augmentation strategies and text augmentation to counter the domain shift and annotation sparsity in referring-expression settings (Zang et al., 2024).
Omni-supervised learning: Omni-RES fuses fully labeled, point-labeled, box-labeled, and unlabeled data using a teacher–student paradigm with active pseudo-label refinement (APLR); weak labels act as yardsticks to select/refine high-fidelity pseudo-masks, achieving nearly full-supervised performance with as little as 10% mask supervision (Huang et al., 2023).
Bootstrapping and pseudo-labeling: SafaRi employs cross-modal attention regularization and zero-shot proposal scoring (SpARC) to validate pseudo-masks, enabling high accuracy in both full and weakly supervised regimes with small annotation budgets (Nag et al., 2024).

This supervision diversity reflects practical constraints and application needs, and defines the trajectory toward scalable, robust RES systems.

4. Datasets and Evaluation Protocols

RES research relies on a spectrum of datasets and metrics, with evolving annotation richness:

2D image datasets: RefCOCO, RefCOCO+, RefCOCOg (single-object focus) (Liu et al., 2023, Chen et al., 2019); gRefCOCO and GRD for generalized/multi-target/no-target settings (Liu et al., 2023, Wu et al., 2023, Ding et al., 8 Jan 2026).
Multi-granularity and part-level: RefCOCOm extends RefCOCO with dense part-level masks and expressions (Wang et al., 2023); MRES-32M is a large-scale, automatically mined multi-granularity set for pretraining (Wang et al., 2023).
3D scene and point cloud datasets: ScanRefer and DetailRefer for instance and phrase-level 3D-RES/DRES (Chen et al., 3 Mar 2026); Multi3DRes for GRES in 3D with zero/one/multi-target queries (Wu et al., 2024).
Aerial and domain-specific datasets: Aerial-D curates over 1.5M expressions on 37K images (across object, group, semantic levels), using algorithmic and LLM-driven expression synthesis (Marnoto et al., 8 Dec 2025).
Group-wise and video datasets: GRD for group-wise GRES (Wu et al., 2023); DAVIS-2017 and A2D with sentence-level queries for video RES (Bellver et al., 2020).
Metrics: Mean IoU (mIoU), overall IoU (oIoU), cIoU, gIoU (for generalized settings), accuracy@τ, and for no-target expresssions, N-acc (correct empty prediction) and T-acc (correct non-empty) are used (Liu et al., 2023, Nguyen et al., 2024, Wu et al., 2023, Wang et al., 2023, Marnoto et al., 8 Dec 2025).

Dataset design now embraces compositional queries, phrase–mask associations, no-target cases, and part-level grounding, enabling comprehensive evaluation of model robustness and generalization.

5. Key Challenges and Model Limitations

Despite significant advances, several challenges remain open:

Complex relationship modeling: Multi-object and exclusionary expressions ("all people except the man in blue") stress models' ability to handle logical constructs and relational reasoning (Liu et al., 2023, Nguyen et al., 2024, Ding et al., 8 Jan 2026).
Instance-level differentiation: Global mask approaches often merge distinct instances or fail to disentangle overlapping attributes; instance-aware token approaches (e.g., InstAlign, MDIN) are necessary but still face challenges with attribute compositionality (Nguyen et al., 2024, Wu et al., 2024).
Fine-grained part segmentation: Grounding at part level (e.g., "the right ear of the cat") tests the model's spatial resolution, attention mechanism, and dataset quality (Wang et al., 2023).
No-target detection and open-vocabulary generalization: Accurate rejection of absent referents (high N-acc) is nontrivial and sensitive to subtle expression-image mismatches (Liu et al., 2023, Nguyen et al., 2024, Wu et al., 2023).
Video and temporal reasoning: Existing models extract limited benefit from motion and static verbs; temporally consistent and action-aware architectures are underdeveloped (Bellver et al., 2020).
Weak and semi-supervised robustness: Pseudo-label quality, annotation sparsity, and noise in point/box labels remain bottlenecks despite innovations in APLR and consistency regularization (Huang et al., 2023, Zang et al., 2024, Nag et al., 2024).

A plausible implication is that further advances will require deeper integration of compositional language understanding, explicit reasoning over scene graph or instance graphs, and improved uncertainty modeling.

6. Future Directions and Open Research Problems

Referring-Expression Segmentation continues to expand along several axes:

Compositional and hierarchical reasoning: Adopting meta-learning frameworks (e.g., MCRES) targeting generalization to novel word/phrase compositions closes the gap between seen and unseen expression combinations (Xu et al., 2023).
Instance- and phrase-level supervision: 3D-DRES and detailed 3D-GRES establish new paradigms for explicit phrase→object mapping, opening research into hierarchical and nuanced 3D-vision–language grounding (Chen et al., 3 Mar 2026, Wu et al., 2024).
Scaling and pretraining: Large-scale datasets (MRES-32M, Aerial-D) and adaptation of foundation models (CLIP, SAM, SigLIP2, LLMs) catalyze zero-shot transfer, robust segmentation under degraded imagery, and cross-domain generalization (Wang et al., 2023, Marnoto et al., 8 Dec 2025).
Unified multitask architectures: Joint training for segmentation, comprehension (REC), and generation (REG) (as in GREx) promotes backward compatibility with classic tasks and generalizes to generation and descriptive feedback loops (Ding et al., 8 Jan 2026).
Weakly and omni-supervised frameworks: Increasing focus on cost-effective annotation protocols and principled bootstrapping (Omni-RES, SafaRi, TSEG, RESMatch) is expected to broaden deployability (Strudel et al., 2022, Huang et al., 2023, Nag et al., 2024, Zang et al., 2024).
Relational, group-wise, and interactive scenarios: Emerging work on groups of images, across modalities and user interactions, signals the convergence of RES with open-world and human-centric AI tasks (Wu et al., 2023, Marnoto et al., 8 Dec 2025).

7. Representative Results and Method Comparisons

The following table compares recent models on gRefCOCO, the canonical large-scale GRES dataset, using cumulative IoU (cIoU) and generalized IoU (gIoU) metrics (Liu et al., 2023, Nguyen et al., 2024, Ding et al., 8 Jan 2026):

Method	cIoU (val)	gIoU (val)	No-target Acc.
MattNet	47.51	48.24	41.15
LAVT	57.64	58.40	49.32
ReLA	62.91	63.98	56.29
HDC	65.42	68.23	63.38
InstAlign	68.94	74.34	79.72

On RefCOCO testA in the weakly- and semi-supervised regime (using 10% fully labeled data), Omni-RES achieves 80.66% oIoU with box and point refinement (ReLA backbone), surpassing fully supervised and prior semi-supervised methods by 5–15 percentage points (Huang et al., 2023, Zang et al., 2024).

Leading approaches for part-level RefCOCOm (mIoU, part-only) include:

UniRES: 19.6 (val)
LAVT: 15.3 (val)
Prior SOTA: ≤16.2 (val), confirming the gap for fine-grained queries (Wang et al., 2023).

These results demonstrate steady increases in accuracy with advances in architectural design, data scale, and supervision diversity.

References (arXiv IDs):

(Ye et al., 2021) Referring Segmentation in Images and Videos with Cross-Modal Self-Attention Network
(Chen et al., 2019) Referring Expression Object Segmentation with Caption-Aware Consistency
(Wang et al., 2023) Unveiling Parts Beyond Objects:Towards Finer-Granularity Referring Expression Segmentation
(Huang et al., 2023) Towards Omni-supervised Referring Expression Segmentation
(Strudel et al., 2022) Weakly-supervised segmentation of referring expressions
(Zang et al., 2024) RESMatch: Referring Expression Segmentation in a Semi-Supervised Manner
(Yu et al., 7 Aug 2025) Latent Expression Generation for Referring Image Segmentation and Grounding
(Nguyen et al., 2024) Instance-Aware Generalized Referring Expression Segmentation
(Liu et al., 2023) GRES: Generalized Referring Expression Segmentation
(Wu et al., 2023) Advancing Referring Expression Segmentation Beyond Single Image
(Nag et al., 2024) SafaRi:Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation
(Xu et al., 2023) Meta Compositional Referring Expression Segmentation
(Chen et al., 3 Mar 2026) 3D-DRES: Detailed 3D Referring Expression Segmentation
(Wu et al., 2024) 3D-GRES: Generalized 3D Referring Expression Segmentation
(Marnoto et al., 8 Dec 2025) Generalized Referring Expression Segmentation on Aerial Photos
(Ding et al., 8 Jan 2026) GREx: Generalized Referring Expression Segmentation, Comprehension, and Generation
(Bellver et al., 2020) RefVOS: A Closer Look at Referring Expressions for Video Object Segmentation

This corpus establishes Referring-Expression Segmentation as a versatile, rapidly advancing domain within multimodal structured prediction, with ongoing research targeting compositionality, open-world robustness, and real-world scalability.