Papers
Topics
Authors
Recent
2000 character limit reached

Referring Expression Comprehension (REC)

Updated 15 December 2025
  • REC is a vision–language task that precisely grounds natural language descriptions to visual objects, serving as a basis for multimodal alignment and compositional reasoning.
  • Models employ two-stage and one-stage pipelines—using CNNs, Transformers, and dynamic reasoning—to accurately map text to image regions.
  • Evaluation relies on IoU metrics and benchmark datasets like RefCOCO to assess accuracy, robustness, and the handling of negative queries.

Referring Expression Comprehension (REC) is a canonical vision–language task requiring precise grounding of natural-language descriptions to visual objects or regions in images and videos. Formally, given an image II and a referring expression TT (natural language), REC asks a model to output a bounding box BpredB_{pred} that spatially localizes the object described by TT. REC serves as a diagnostic for both multimodal alignment and compositional reasoning, with application across search, robotics, dialog, and embodied AI.

1. Task Formulation and Objectives

REC formalizes the mapping

Bpred=f(I,T)B_{pred} = f(I, T)

where BpredB_{pred} is evaluated against a ground-truth box BgtB_{gt} via Intersection over Union:

IoU(Bpred,Bgt)=Area(BpredBgt)Area(BpredBgt)\text{IoU}(B_{pred}, B_{gt}) = \frac{\text{Area}(B_{pred} \cap B_{gt})}{\text{Area}(B_{pred} \cup B_{gt})}

Success is defined at standard thresholds, most commonly IoU0.5\text{IoU} \geq 0.5.

The objective is to maximize the conditional probability of localizing the referent:

R=argmaxRR  pθ(RI,T)R^* = \arg\max_{R \in \mathcal{R}} \; p_\theta(R \mid I, T)

where R\mathcal{R} is the set of candidate regions or boxes. Notably, modern REC models may regress BpredB_{pred} directly or rank among proposals; recent work has extended REC to multi-entity, relation-aware, and negative-query regimes (Liu et al., 23 Sep 2024, Hu et al., 22 Jul 2025).

2. Model Architectures and Computational Principles

Two-Stage and One-Stage Pipelines

Traditional REC pipelines are two-stage: (1) region proposal generation (using class-agnostic or detector-based methods); and (2) proposal ranking or matching via multimodal fusion. Recent trends favor one-stage architectures, merging proposal and selection modules for efficiency (Zhou et al., 2019, Yang et al., 2020).

Multimodal Fusion Strategies

REC models encode vision and language independently, then fuse via joint embedding, modular networks, or graph-based structures:

Adaptive and Dynamic Reasoning

Recent approaches adapt network structure or reasoning depth to the complexity of the referring expression:

  • Language Adaptive Dynamic Subnets: LADS generates expression-specific sub-networks for minimal, efficient reasoning (Su et al., 2023).
  • Dynamic Multi-step Reasoning: One-stage models dynamically select the number of reasoning "hops" using RL and state-tracking (Zhang et al., 2022).
  • Content-conditioned Queries: Video REC (ConFormer) synthesizes query vectors directly from region features to overcome taxonomy bottlenecks (Jiang et al., 2023, Cao et al., 2022).

3. Training Paradigms, Pretraining, and Evaluation Protocols

Supervised and Contrastive Training

REC models conventionally minimize a combination of box regression loss (e.g. 1\ell_1, GIoU) and cross-entropy or ranking objectives for region selection. Advanced variants use contrastive losses for mining cross-frame or cross-modal correspondences (Cao et al., 2022, Jiang et al., 2023), regularization/gating via mutual information (Su et al., 2023), or integrate knowledge via multi-modal fact retrieval and cross-attention (Zhang et al., 2023).

Pretraining strategies include:

Zero-Shot and Verification-based Inference

Recent work demonstrates REC as hypothesis verification, where general-purpose VLMs are prompted to answer atomically for each proposal—substantially surpassing selection-based approaches, even outperforming REC-trained models in strict zero-shot regimes (Liu et al., 12 Sep 2025).

Evaluation Metrics

Standard metrics:

  • Accuracy@IoUt\geq t: fraction of predictions with spatial overlap above tt.
  • Mean accuracy (mAcc\mathrm{mAcc}): averaged over multiple thresholds.
  • Recall@kk and AUROC: for negative/positive retrieval scenarios, measuring open-set grounding and anti-hallucination (Liu et al., 23 Sep 2024, Jin et al., 12 Aug 2025).

4. Benchmark Datasets and Data Collection Methodologies

Core Benchmarks

Dataset Properties

Dataset Domain Objects Expr. Len. Notable Features
RefCOCO(+/g) MSCOCO 80+ 3–8 Standard, known label noise
FineCops-Ref COCO/GQA 1,200+ 17–19 Difficulty, negatives
Ref-L4 COCO+O365 365 24.2 Large, long, cleaned splits
RefDrone VisDrone 10 9.0 Multi-target, small-scale
SOREC SODA-D 20+ 25.5 Small objects, high-res

5. Advanced Reasoning, Negative Samples, and Knowledge Integration

Compositional and Relational Reasoning

REC progress has moved from shallow cue matching to multi-hop, multi-entity, and relational models. For example, multi-hop scene graph traversal, entity span detection, and relation alignment are addressed in ReMeREC and FineCops-Ref (Liu et al., 23 Sep 2024, Hu et al., 22 Jul 2025). The Entity Inter-relationship Reasoner augments multi-entity localization by explicit affinity modeling and relation count prediction (Hu et al., 22 Jul 2025).

Negative-Aware Robustness

FineCops-Ref and KnowDR-REC systematically construct negative queries and images via controlled perturbation (REPLACE, SWAP). Negative-aware metrics such as Recall@1 and anti-hallucination accuracy reveal persistent failure to "refuse" absence queries (ErrorRate often >0.7) (Liu et al., 23 Sep 2024, Jin et al., 12 Aug 2025). MLLMs frequently hallucinate boxes despite subtle negative edits.

Knowledge Integration

Commonsense and external knowledge bases enable models to disambiguate referents beyond visual or spatial cues. CK-Transformer fuses region features with top-K retrieved facts per candidate via bi-modal similarity and cross-attention, yielding substantial improvements on KB-Ref (Zhang et al., 2023). KnowDR-REC demonstrates that knowledge-driven REC remains challenging, with interpretability and robustness tightly coupled to multimodal reasoning fidelity (Jin et al., 12 Aug 2025).

6. Efficiency, Adaptation, and Real-Time Systems

REC research has emphasized:

  • Parameter-Efficient Fine-Tuning: Adapters (CoOp, LoRA, Adapter+) integrated with task-specific modules (e.g., PIZA for small objects), achieving full fine-tuning quality at 2–3 orders lower parameter cost (Goto et al., 4 Oct 2025).
  • Language Adaptive Subnetworks: LADS deploys dynamic, text-conditioned routing, activating minimal subnets and gating layers/filters per expression (Su et al., 2023).
  • Collaborative Specialist-MLLM Pipelines: SFA (slow-fast adaptation) and CRS (candidate-region selection) balance between fast, low-level detectors and reasoning-rich MLLMs, yielding accuracy gains and FLOP efficiency (Yang et al., 27 Feb 2025).
  • Real-Time Inference: One-stage models like RealGIN deliver near SoTA accuracy at ∼10× speedup over proposal-ranking frameworks (Zhou et al., 2019).

Current frontiers include:

  • Scaling to Long, Complex Expressions and Rare Categories: Large benchmarks such as Ref-L4 and FineCops-Ref expose saturation and robustness failures on small or complex objects and categories (Chen et al., 24 Jun 2024, Liu et al., 23 Sep 2024).
  • Robust Negative Rejection and Abstention: Most systems overfit to positive detection; explicit abstention logic and uncertainty quantification are open demands (Liu et al., 23 Sep 2024, Jin et al., 12 Aug 2025).
  • Multi-entity and Relational Grounding: Joint entity/disjoint relation extraction under ambiguous and free-form real-world prompts (Hu et al., 22 Jul 2025).
  • Geometric Grounding and Reasoning: Extending REC to synthetic or mathematical diagrams requires hybrid symbolic–visual pipelines and RL-inspired policy optimization (Liu et al., 25 Sep 2025).
  • Interpretability and Reasoning Traces: Aligning box prediction with explicit reasoning traces, knowledge graph walks, or human-readable multi-hop justifications (Jin et al., 12 Aug 2025).

A plausible implication is that REC evaluation will increasingly focus on hardness-controllable benchmarks, compositional and knowledge-rich reasoning, open-set robustness, and collaborative or adaptive model design. REC remains a critical lens on the capabilities and limitations of contemporary multimodal systems, serving both as a diagnostic challenge and as a practical bridge to downstream embodied and interactive tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Referring Expression Comprehension (REC).