Referring Expression Comprehension

Updated 26 December 2025

Referring Expression Comprehension is the task of precisely localizing objects using natural language expressions by mapping detailed linguistic cues to visual content.
REC research leverages diverse modeling paradigms, including joint embedding, modular networks, and graph-based reasoning, to effectively fuse visual and textual data.
Advanced techniques address challenges such as compositionality, negative query handling, and small-object grounding, driving innovation in dataset design and evaluation protocols.

Referring expression comprehension (REC) denotes the task of localizing, in an image or video, the unique object (or set of objects) identified by a natural language referring expression. REC has evolved into a central multimodal reasoning problem, driving the development of algorithms adept at fusing visual content with arbitrary linguistic descriptions under open-world and compositional settings. Distinct from classical object detection, REC systems must parse free-form language—including spatial references, attributes, relationships, and sometimes external knowledge—and precisely ground these semantics to a visual region without prior constraints on the queried categories. Multiple research streams have pushed REC forward, encompassing dataset construction, neural architectures, interpretability, challenge-specific benchmarks, and task variants that require deeper forms of reasoning, compositionality, and negative grounding.

1. Formal Task Definition and Foundations

REC is typically framed as follows: given an image $I$ and a referring expression $r = (w_1, \dots, w_T)$ , the goal is to predict a bounding box $b^*$ corresponding to the object referred to by $r$ . Formally, a scoring function $s_\theta(I, r, b_i)$ is either optimized over a discrete set of region proposals or used to regress box parameters directly:

$b^* = \arg\max_{b_i} s_\theta(I, r, b_i)$

Variants exist in evaluation: REC may also seek segmentation masks, produce distributed heatmaps, or operate over temporal domains (video REC) (Qiao et al., 2020, Jiang et al., 2023, Kurita et al., 2023). For negative expressions, the model must correctly abstain from grounding if no region matches the query (Liu et al., 23 Sep 2024, Yang et al., 27 Feb 2025). The primary evaluation metric is accuracy: the proportion of queries for which the predicted box has Intersection-over-Union (IoU) above 0.5 with ground-truth.

Traditionally, REC was posed as a two-stage problem: first, an object detector proposes regions; second, a cross-modal scoring model matches the expression to candidate boxes (Qiao et al., 2020, Yu et al., 2016). Contemporary research favors one-stage approaches, which integrate proposal and grounding into a single, end-to-end pipeline for improved speed and reduced error propagation (Liao et al., 2019, Zhang et al., 2022).

2. Modeling Paradigms and Algorithmic Innovations

The evolution of REC architectures spans several interconnected paradigms, each targeting specific complexities of vision-language reasoning.

Joint Embedding Models: Early and canonical REC systems utilize parallel visual and language encoders (e.g., CNN for regions, RNN/BERT for text) followed by fusion through dot-product, bilinear pooling, or MCB (Qiao et al., 2020). The grounded region is the one maximizing cross-modal similarity scores. Models such as GroundeR and SLR exemplify this approach (Yu et al., 2016).

Modular and Compositional Networks: To address complex expressions containing multiple semantic units (subjects, relationships, locations), modular networks parse the referring expression into fragments and route them to specialized network modules (e.g., subject/attribute, spatial location, inter-object relation modules) (Qiao et al., 2020, Chen et al., 2020). Notably, MAttNet integrates a soft parser with attention-based modules, aggregating weighted scores for compositional grounding.

Graph-Structured and Multi-hop Reasoning: Recognizing the graph-like structure of scenes and language, dynamic graph attention models construct explicit object-relation graphs for each image, then execute multi-step, language-conditioned message passing. The dynamic graph attention network (DGA) decomposes expressions into soft constituent phrases and performs interpretable reasoning steps over the visual graph, enabling it to handle nested or relational queries (Yang et al., 2019).

Cross-Modality Correlation and One-Stage Methods: To maximize efficiency, methods like RCCF model the referring expression as a 1×1 visual filter kernel that slides over image features, outputting a center heatmap and directly regressing box coordinates in a single stage. This allows real-time performance (≥40 FPS) and competitive accuracy, especially where proposal-based systems are detector-limited (Liao et al., 2019).

Language-Adaptive Inference: Recognizing the sparsity of cues in expressions, dynamic subnet approaches (e.g., LADS) use a language-aware gating network to prune network layers and channels per query. This yields substantial computational savings and often increased accuracy by reducing irrelevant feature interference (Su et al., 2023).

Contrastive and Synonym-Aware Training: To encourage robustness and transferability, models incorporate contrastive objectives over synonymous expressions, collapsing paraphrase codes and separating mismatches. This yields improved performance, particularly in cross-dataset and few-shot transfer scenarios (Chen et al., 2021).

Multimodal Transformers and Large Models: The advent of vision-language pretraining (e.g., UNITER, ViLBERT, VL-BERT) and MLLMs has driven state-of-the-art performance, especially in image- and video-grounding (Chen et al., 24 Jun 2024, Jiang et al., 2023, Liu et al., 23 Sep 2024). Video REC extensions integrate temporal modeling, content-conditioned region queries, and entity-level alignment losses (Jiang et al., 2023). Instruction tuning and targeted prompt design are increasingly necessary for deploying MLLMs effectively on fine-grained and compositional REC benchmarks (Yang et al., 27 Feb 2025).

Commonsense-Enhanced and Knowledge-Based Models: KB-Ref and its successors demonstrate that affording models with structured external knowledge is critical for resolving non-visual or affordance-centric expressions. Architectures fuse attended facts from large knowledge bases with visual and textual embeddings, showing clear gains over purely visual approaches (Wang et al., 2020, Zhang et al., 2023).

Small Object and Hard Case Specialization: The SOREC benchmark and progressive zooming adapters represent targeted efforts to drive REC into the challenging regime of extremely small targets. Parameter-efficient modules (PIZA) are integrated into large models, learning iterative search policies over crops conditioned on language (Goto et al., 4 Oct 2025).

Zero-Shot and Combinatorial Generalization: Contemporary research leverages LLMs for rule-based or prompt-based triplet extraction (subject, predicate, object) and aligns scene and expression structure via large vision-language alignment models. Contrastive fine-tuning on relational datasets injects explicit compositionality (Han et al., 2023).

3. Datasets and Evaluation Protocols

The REC research community has invested heavily in constructing increasingly challenging and diagnostic datasets:

Dataset	Images / Instances	Expressions	Expression Type	Notable Features
ReferItGame	19,894 / 96,654	130,525	Natural (short, simple)	1 per class/image
RefCOCO	3,000 / ~7,600	21,586+	Natural (short, interactive)	TestA: people, TestB: other objects
RefCOCO+	3,000 / ~7,600	21,373+	Appearance only	No location words
RefCOCOg	3,900 / ~7,600	14,498+	Long, complex	No interaction
CLEVR-Ref+	70,000 / synth	700,000	Programmatically generated	Compositional programs, bias-controlled
Cops-Ref	75,299	148,712	Deep compositional	Logical forms, cross-image negatives
FineCops-Ref	~35,000	191,852+	3-level compositionality	Text/image negatives, LLM-edited, graded difficulty
SOREC	24,828 / small	100,000	Long, small-object-focused	<1% area boxes
KB-Ref	16,917 / 41,930	43,284	Commonsense/affordance-based	Requires external facts
Ref-L4	9,735 / 18,653	45,341	Variable length, diverse	Cleaned, 365 categories
RefEgo	12,038 video	24,000	Egocentric, spatio-temporal	Tracklet-level, absent cases
GeoRef	392 diagrams	3,776	Geometric, formal language	Points, regions, relations

Evaluation protocols have shifted from single-image IoU-based accuracy (IoU>0.5) to more granular metrics: multi-threshold accuracy (e.g., Acc₀.₅, Acc₀.₇₅, Acc₀.₉, mAcc), negative rejection rates (Recall@1), segmentation IoU for mask outputs, absent-object frame handling in video, and compositional reasoning breakdowns by distractor or logical form (Chen et al., 24 Jun 2024, Liu et al., 23 Sep 2024, Liu et al., 25 Sep 2025, Kurita et al., 2023).

Dataset construction has also innovated along lines of bias control (CLEVR-Ref+), compositional/logical diversity (Cops-Ref, FineCops-Ref), hard negative mining, and inclusion of external knowledge (KB-Ref, GeoRef) (Liu et al., 2019, Chen et al., 2020, Yang et al., 27 Feb 2025, Wang et al., 2020).

4. Challenges: Compositionality, Negative Rejection, and Benchmark Integrity

Despite significant progress, several persistent challenges remain central to REC research:

Compositionality and Multi-Hop Reasoning: Many benchmarks reveal that state-of-the-art models (including MLLMs) still experience sharp accuracy drops as linguistic complexity increases (e.g., from L1 to L3 in FineCops-Ref: up to 30–40 pp) (Liu et al., 23 Sep 2024, Yang et al., 27 Feb 2025). Relational cues, attribute conjunctions, and logical forms such as AND, OR, SAME, ORDER require multi-hop or program-like reasoning; most generic architectures are brittle under these conditions (Chen et al., 2020, Liu et al., 2019).

Negative Query Handling: Traditional datasets lack negatives (expressions with no correct referent). FineCops-Ref, Cops-Ref, and video-based datasets with absent frames drive models to reject inapplicable queries. Even the best models have Recall@1 for negatives near chance (~50%) (Liu et al., 23 Sep 2024, Yang et al., 27 Feb 2025).

Dataset Bias and Label Quality: Manual audits of standard REC sets (RefCOCO, etc.) indicate high annotation noise (14–24% error rates), inflating or hiding true model capabilities. Cleaning these splits can raise measured accuracy by 1.5–3 pp (Chen et al., 24 Jun 2024). Synthetic datasets like CLEVR-Ref+ and Cops-Ref reduce such bias, but domain transfer remains limited (Liu et al., 2019, Chen et al., 2020).

Small Object Grounding: The SOREC dataset exposes failures for small object localization, for which even fine-tuned MLLMs often default to zero/irrelevant regions unless equipped with zooming policies or efficient search heads (Goto et al., 4 Oct 2025).

Commonsense and Knowledge Integration: External KB grounding remains a low-scoring regime: ECIFA attained 59% vs. human 90% on KB-Ref (Wang et al., 2020); CK-Transformer improves to 66%, but a 24–30 pp gap persists (Zhang et al., 2023).

Video and Temporal Reasoning: While models like ConFormer advance region-phrase alignment and temporal consistency, egocentric or unconstrained video scenarios expose weaknesses (tracking failures, absent object drift, 40+ pp gap to human) (Jiang et al., 2023, Kurita et al., 2023).

5. Specialized Solutions, Interpretability, and MLLM-Driven Advances

Recent years have produced targeted solutions for these challenges:

Reasoning-Aligned Architectures: Dynamic graph attention networks and neural module networks (e.g., IEP-Ref), as well as Transformers equipped with episodic memory or compositional modules, enable interpretable, stepwise grounding aligned with the linguistic decomposition of expressions. These models expose intermediate reasoning traces, allowing explicit visualization and debugging (Yang et al., 2019, Liu et al., 2019).

Hybrid Architectures and Specialist-MLLM Collaboration: The slow-fast adaptation (SFA) and candidate region selection (CRS) paradigms collaborate specialists (e.g., open-vocabulary detectors) and MLLMs for efficient and accurate compositional grounding. Simple cases are routed to light models, complex ones to MLLMs, and multi-choice prompts reduce generation time while improving precision (Yang et al., 27 Feb 2025).

Parameter-Efficient Tuning for Hard Cases: Small object zoom adapters (PIZA) enable standard models or LLMs to handle ultra-fine localization through autoregressive cropping and process coding, achieving superior results with modest additional parameters (Goto et al., 4 Oct 2025).

Zero-Shot and Relationally-Informed REC: Pipeline approaches parse both expression and scene into relational triplets, matching them structurally and fine-tuning VLA models with triplet-aware contrastive objectives, yielding up to 19 pp gains in zero-shot accuracy over prior models (Han et al., 2023).

Geo-Visual and Mathematical Reasoning: Synthetic supervision pipelines (e.g., Penrose) for diagrams plus RL-based group-policy optimization facilitate grounding in geometry, and model accuracy in REC upstream directly predicts problem-solving downstream (Liu et al., 25 Sep 2025).

Benchmark Construction and Interpretation: Ongoing work refines annotation, generates negative and hard/hard-negative testbeds, designs controlled splits (CLEVR-Ref+, Cops-Ref, FineCops-Ref, GeoRef), and advocates multi-threshold, per-category, per-difficulty evaluation as standard (Chen et al., 24 Jun 2024, Liu et al., 23 Sep 2024, Yang et al., 27 Feb 2025, Liu et al., 2019).

6. Outlook and Future Directions

The current arc of REC research suggests several converging directions:

Hybrid Structured & Pretrained Reasoning: Fusing large-scale vision-language representations with graph/module-level symbolic reasoning remains the most promising strategy for robust, interpretable, and generalizable REC (Qiao et al., 2020, Liu et al., 2019, Yang et al., 2019).
Challenging Benchmark Evolution: Benchmarks must scale in compositional depth, negative reasoning, open-domain coverage, and diversity (object scale, class rarity, expression complexity) (Liu et al., 23 Sep 2024, Yang et al., 27 Feb 2025, Goto et al., 4 Oct 2025). Real-world and domain-transfer scenarios are critical.
Explicit Negative and Abstention Modeling: Methods must progress beyond unconditional bounding box outputs, robustly handling “no object,” absence, or ambiguity, under both image and video settings (Liu et al., 23 Sep 2024, Yang et al., 27 Feb 2025, Kurita et al., 2023).
Scene Graph and Reasoning Trace Integration: Injection of scene-graph encoders, chain-of-thought prompting, and intermediate reasoning trace supervision are prescribed for improved multi-hop compositional REC (Liu et al., 23 Sep 2024, Liu et al., 2019).
External Knowledge and Commonsense Grounding: REC tasks demanding affordance, functionality, or background knowledge will necessitate richer knowledge bases, retrieval-augmentation, and cross-modal KB-grounding architectures (Wang et al., 2020, Zhang et al., 2023).
Multimodal LLM Instruction Tuning and Efficiency: The next phase will focus on efficient, instruction-tuned MLLMs capable of both few-shot transfer and collaborative specialist coordination, as well as low-shot adaptation for emerging domains such as geometry and science (Yang et al., 27 Feb 2025, Liu et al., 25 Sep 2025).

In summary, REC serves as both a canonical challenge and a proving ground for multimodal AI, synthesizing advances in language understanding, visual reasoning, and their integration under increasingly rigorous and nuanced criteria. The field is transitioning from shallow matching of simple cues toward structured, interpretable, and scalable cross-modal grounding with robust compositional and commonsense capabilities (Qiao et al., 2020, Liu et al., 23 Sep 2024, Goto et al., 4 Oct 2025, Zhang et al., 2023, Han et al., 2023, Yang et al., 27 Feb 2025, Chen et al., 24 Jun 2024, Liu et al., 2019, Chen et al., 2020).