Referring Expression Recognition

Updated 17 April 2026

Referring Expression Recognition is a task that grounds natural language expressions by localizing objects in images through multi-step reasoning over attributes, spatial relations, and context.
Various methodologies, including joint embeddings, attention mechanisms, graph neural networks, and adaptive networks, are employed to fuse textual and visual information.
Evaluation uses metrics like IoU and F1 on benchmarks (e.g., RefCOCO, CLEVR-Ref+) while addressing challenges such as dataset bias, shortcut learning, and the need for interpretability.

Referring expression recognition (RER), often termed referring expression comprehension (REC), is the task of grounding a natural language expression that uniquely identifies an object (or set of objects) in an image. The core challenge is to align the open-ended, compositional semantics of free-form language with the structured visual content of complex scenes, often requiring multi-step reasoning over attributes, spatial relations, and context. RER systems must parse and fuse language and vision to localize the referent(s), and rigorous evaluation demands coverage of compositionality, dataset bias, efficiency, and interpretability.

1. Formal Task Definition and Foundations

Given an image $I$ and a referring expression $S$ (a sequence of words), the objective of RER is to localize the object(s) referred to by $S$ . The most common instantiations are bounding-box localization (REC) and pixelwise segmentation (referring expression segmentation, RES). For the standard single-target REC, the output is a region $o^* \in \{o_1,\ldots,o_N\}$ selected from a candidate set, maximizing a score $S(o|I,S)$ :

$o^* = \arg\max_{o_i} S(o_i | I, S)$

The scoring function typically measures compatibility via a learned cross-modal embedding, attention, or reasoning architecture. Recent generalizations have extended the task to multi-target and no-target expressions, with output being a set of regions or an empty mask (Ding et al., 8 Jan 2026).

Training objectives include maximum likelihood, cross-entropy over region distributions, and ranking losses. Metrics are predominantly accuracy at an intersection over union (IoU) threshold for REC, and mean IoU for RES. Generalized settings require F $_1$ matches (exact match of all and only target objects at IoU thresholds), and explicit no-target recognition (Ding et al., 8 Jan 2026).

2. Model Architectures and Methodological Taxonomy

The evolution of RER methodology proceeds through several increasingly expressive model families:

Joint Embedding Models: Early approaches concatenate or otherwise fuse region visual features (e.g., CNN fc7 activations) with RNN/LSTM-encoded text embeddings, and score pairs via inner product or MLP (Yu et al., 2016). These can incorporate additional features for relative positioning and category-level comparisons.
Multiple Instance Learning (MIL): To model spatial and contextual relationships (e.g., "the man to the left of the car"), MIL-based architectures consider region–context pairs and employ pooling (max, noisy-or) to marginalize latent context objects (Nagaraja et al., 2016). These improve disambiguation in scenes with relational or ambiguous cues.
Attention and Modular Networks: Attention mechanisms enable fine-grained weighting over words and visual regions, improving alignment in compositional or ambiguous cases. Modular networks parse expressions into sub-phrases or graph structures, dynamically assembling neural modules (e.g., “Locate”, “Relate”, “Intersect") corresponding to linguistic structure (Cirik et al., 2018, Yang et al., 2020). Explicit syntactic parsing can map noun phrases to region modules and prepositional phrases to relation modules; these models are interpretable and compositional (Cirik et al., 2018).
Scene Graph and Graph Neural Models: Graph-enhanced models represent the image and the referring expression as matched graphs (nodes: objects/noun phrases, edges: relations). Neural module networks operate over these semantic graphs, aligning the linguistic scene graph to the visual scene graph and propagating attention through corresponding structures (Yang et al., 2020).
Proposal-Free (One-stage) Models: Inspired by modern object detection (e.g., YOLO), these architectures directly regress bounding boxes or segmentation masks from global image–text features, without explicit region proposals (Chen et al., 2018). Typically, a multimodal interactor computes attention over image regions conditioned on text, followed by direct regression.
Dynamic and Adaptive Networks: Recognizing that the reasoning and visual pathways required for different expressions vary greatly, recent architectures extract language-adaptive subnets from a large supernet using gating mechanisms conditioned on the input expression. These subnets enable efficient, expression-specific inference by pruning layers and channels dynamically (Su et al., 2023).
Contrastive Latent Expression Models: To mitigate the mismatch between sparse language and dense visual cues, methods generate multiple “latent expressions” representing complementary visual attributes and enforce alignment between all such variants and the original text using margin-based contrastive loss, leading to robust segmentation and reasoning (Yu et al., 7 Aug 2025).

3. Dataset Landscape, Biases, and Generalization

The field relies on several core datasets:

RefCOCO, RefCOCO+, RefCOCOg: Anchored in MS COCO, these benchmarks focus on single-object recognition, with RefCOCO+ restricting positional terms and RefCOCOg promoting longer, more complex expressions. Only objects appearing at least twice in an image are annotated, introducing bias toward frequent categories and under-sampling rare or singleton objects (Cirik et al., 2018).
Ref-Reasoning: Automatically generated from GQA/Visual Genome scene graphs, this dataset introduces deeply compositional referring expressions spanning 1–5 graph nodes and systematically balances reasoning depth. Its expressions require grounding attributes, spatial, and relational clauses (Yang et al., 2020).
Cops-Ref and CLEVR-Ref+: Cops-Ref employs a synthetic expression engine to produce recursively-nested, programmatic logical expressions operating over real-world scene graphs; evaluation mandates discrimination among “hard” distractor images, demanding full-chain reasoning (Chen et al., 2020). CLEVR-Ref+ uses synthetic 3D scenes and automatically generated programs, providing ground truth for all intermediate reasoning steps (Liu et al., 2019).
Generalized Datasets: GREx/gRefCOCO benchmarks support multi-target and no-target expressions, introducing new modes (GRES/GREC/GREG) for segmentation, comprehension, and generation. These extend classical datasets, enabling comprehensive generalization analysis (Ding et al., 8 Jan 2026).
Domain-Specific and Robustness Datasets: SOREC targets extremely small objects in driving contexts (Goto et al., 4 Oct 2025), KB-Ref introduces the need for visual commonsense knowledge (Wang et al., 2020), Aerial-D addresses robustness to varying resolution and historic image styles in aerial imagery (Marnoto et al., 8 Dec 2025), and Ref-Adv constructs high-reasoning, low-shortcut benchmarks that expose shallow model behavior even in state-of-the-art MLLMs (Dong et al., 27 Feb 2026).

Systematic ablations demonstrate that on classic datasets, models can achieve high accuracy without fully exploiting linguistic structure—image-only or category-only models often rival or surpass linguistically structured architectures, exposing annotation shortcuts and bias (Cirik et al., 2018). By contrast, Cops-Ref and Ref-Adv expose substantial accuracy drops when models are forced to reason compositionally or deprived of shortcut cues (Chen et al., 2020, Dong et al., 27 Feb 2026).

4. Training Objectives, Losses, and Optimization Techniques

Most REC systems employ cross-entropy or ranking losses over image region distributions, often with hard negative mining, triplet or margin-based penalties, and auxiliary tasks:

Cross-Entropy (Softmax) Loss: Given predicted scores $\lambda_{ref,i}$ for candidate regions $o_i$ , the model applies:

$p_i = \frac{\exp(\lambda_{ref, i})}{\sum_{n=1}^N \exp(\lambda_{ref, n})}$

The loss is $S$ 0, where $S$ 1 indexes the true referent (Yang et al., 2020).

Ranking/Triplet Loss: Ensures that correct region-expression pairs score higher than negative pairs by a margin $S$ 2 (Yu et al., 2016):

$S$ 3

Contrastive and Margin Losses: Used in latent expression generation to allow multiple paraphrased variants to stay in a margin of the original embedding but not collapse (Yu et al., 7 Aug 2025).
Auxiliary Supervision: Guided attention (center-focused losses), attribute prediction (multi-label BCE), and supporting-object supervision (auxiliary modules or annotations) can be integrated to improve model localization and interpretability (Chen et al., 2018, Cirik et al., 2018).

Optimization strategies frequently combine SGD or Adam(W) with scheduled learning rate decay, weight decay regularization, and early stopping (Yu et al., 2016, Su et al., 2023).

5. Advances in Reasoning, Interpretability, and Efficiency

Addressing the need for transparent and robust reasoning, modern methods implement compositional and interpretable grounding pipelines:

Module Networks and Graph-Structured Reasoning: By parsing expressions into trees or graphs, models execute stepwise neural computations aligned with semantic structure, producing explicit intermediate attention maps for each phrase or relational clause (Cirik et al., 2018, Yang et al., 2020). This enables per-phrase tracing of visual evidence, supporting detailed error analysis and module-level supervision.
Scene Graph Alignment: Joint visual-semantic graphs capture multi-object relations explicitly, allowing for higher accuracy on expressions requiring relational and compositional inference, especially when outperforming flat or monolithic approaches (Yang et al., 2020).
Domain Adaptivity and Dynamic Pruning: Architectures such as LADS dynamically prune the backbone and transformer submodules at inference time, conditioned on the specific referring expression, yielding significant improvements in speed and reducing computational cost while preserving or improving accuracy (Su et al., 2023).
Contrastive Latent Expression Generation: Methods generating multiple latent variants of the referring expression from visual cues capture under-specified attributes and improve robustness, notably for GRES (Yu et al., 7 Aug 2025).

The integration of auxiliary tasks (e.g., language generation with consistency loss, attribute discrimination, count prediction) further improves feature alignment and model generalization (Chen et al., 2019, Ding et al., 8 Jan 2026).

6. Remaining Challenges and Future Directions

Despite progress in expressive model design and dataset construction, key challenges persist:

Dataset Bias and Shortcut Exploitation: Empirical analyses show that many RER systems rely on shallow cues or dataset-specific shortcuts rather than robust, compositional semantics. On classically used benchmarks, strong performance can arise from category or salience heuristics (Cirik et al., 2018, Dong et al., 27 Feb 2026). Benchmarks like Cops-Ref, CLEVR-Ref+, and Ref-Adv now exist to systematically probe for true visual–linguistic reasoning.
Compositional, Multi-Object, and No-Target Reasoning: Extending RER to multi-target and no-target settings exposes substantial performance drops in standard models, demonstrating the need for explicit relationship modeling, counting, rejection mechanisms, and novel generation strategies (Ding et al., 8 Jan 2026).
Robustness to Small, Dense, or Outlier Domains: Precision on small objects (e.g., for autonomous driving) or in aerial imagery falls well below that for large, salient objects. Targeted datasets (SOREC, Aerial-D) and architectural adaptations (iterative zooming, domain-aware segmentation) directly address these limitations but often require elaborate search or adaptation (Goto et al., 4 Oct 2025, Marnoto et al., 8 Dec 2025).
Commonsense and Knowledge Integration: Non-visual referents (e.g., affordances, functions) demand integration of external knowledge bases, which sharply raises difficulty and exposes the limits of vision-only models (Wang et al., 2020).
Interpretability and Intermediate Supervision: Ongoing efforts include the provision and use of intermediate groundings (e.g., supporting-object localization accuracy, functional program traces, module visualizations) to make systems debuggable and trainable at an interpretable level (Cirik et al., 2018, Liu et al., 2019).
Scaling and Pre-training: Vision–language pre-training (transformers, contrastive models) has increased overall performance but at the expense of interpretability, with full end-to-end reasoning chains either implicit or opaque (Qiao et al., 2020).

Key future directions include (i) combining pre-training with interpretable modular or graph-based pipelines, (ii) constructing datasets and evaluation protocols that force genuine multi-step reasoning, and (iii) exploring cross-domain, multilingual, and embodied extensions of the RER paradigm (Qiao et al., 2020, Ding et al., 8 Jan 2026, Dong et al., 27 Feb 2026).

Key References:

"Graph-Structured Referring Expression Reasoning in The Wild" (Yang et al., 2020)
"Visual Referring Expression Recognition: What Do Systems Actually Learn?" (Cirik et al., 2018)
"Using Syntax to Ground Referring Expressions in Natural Images" (Cirik et al., 2018)
"Cops-Ref: A new Dataset and Task on Compositional Referring Expression Comprehension" (Chen et al., 2020)
"CLEVR-Ref+: Diagnosing Visual Reasoning with Referring Expressions" (Liu et al., 2019)
"GREx: Generalized Referring Expression Segmentation, Comprehension, and Generation" (Ding et al., 8 Jan 2026)
"Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks" (Dong et al., 27 Feb 2026)
"Referring Expression Comprehension: A Survey of Methods and Datasets" (Qiao et al., 2020)