Referring Expression Comprehension (REC)
- REC is a vision–language task that precisely grounds natural language descriptions to visual objects, serving as a basis for multimodal alignment and compositional reasoning.
- Models employ two-stage and one-stage pipelines—using CNNs, Transformers, and dynamic reasoning—to accurately map text to image regions.
- Evaluation relies on IoU metrics and benchmark datasets like RefCOCO to assess accuracy, robustness, and the handling of negative queries.
Referring Expression Comprehension (REC) is a canonical vision–language task requiring precise grounding of natural-language descriptions to visual objects or regions in images and videos. Formally, given an image and a referring expression (natural language), REC asks a model to output a bounding box that spatially localizes the object described by . REC serves as a diagnostic for both multimodal alignment and compositional reasoning, with application across search, robotics, dialog, and embodied AI.
1. Task Formulation and Objectives
REC formalizes the mapping
where is evaluated against a ground-truth box via Intersection over Union:
Success is defined at standard thresholds, most commonly .
The objective is to maximize the conditional probability of localizing the referent:
where is the set of candidate regions or boxes. Notably, modern REC models may regress directly or rank among proposals; recent work has extended REC to multi-entity, relation-aware, and negative-query regimes (Liu et al., 23 Sep 2024, Hu et al., 22 Jul 2025).
2. Model Architectures and Computational Principles
Two-Stage and One-Stage Pipelines
Traditional REC pipelines are two-stage: (1) region proposal generation (using class-agnostic or detector-based methods); and (2) proposal ranking or matching via multimodal fusion. Recent trends favor one-stage architectures, merging proposal and selection modules for efficiency (Zhou et al., 2019, Yang et al., 2020).
Multimodal Fusion Strategies
REC models encode vision and language independently, then fuse via joint embedding, modular networks, or graph-based structures:
- Joint Embedding: CNN encodes regions; RNN or Transformer encodes text; embeddings are fused and scored (Qiao et al., 2020).
- Modular Networks: Sentences are decomposed into subject, location, relation modules, each matched to visual cues (Qiao et al., 2020, Wang et al., 2020, Hu et al., 22 Jul 2025).
- Graph/Scene Reasoning: Scene graphs or cross-modal transformers support multi-hop and relation reasoning, with explicit handling of context and inter-object associations (Liu et al., 23 Sep 2024, Hu et al., 22 Jul 2025).
Adaptive and Dynamic Reasoning
Recent approaches adapt network structure or reasoning depth to the complexity of the referring expression:
- Language Adaptive Dynamic Subnets: LADS generates expression-specific sub-networks for minimal, efficient reasoning (Su et al., 2023).
- Dynamic Multi-step Reasoning: One-stage models dynamically select the number of reasoning "hops" using RL and state-tracking (Zhang et al., 2022).
- Content-conditioned Queries: Video REC (ConFormer) synthesizes query vectors directly from region features to overcome taxonomy bottlenecks (Jiang et al., 2023, Cao et al., 2022).
3. Training Paradigms, Pretraining, and Evaluation Protocols
Supervised and Contrastive Training
REC models conventionally minimize a combination of box regression loss (e.g. , GIoU) and cross-entropy or ranking objectives for region selection. Advanced variants use contrastive losses for mining cross-frame or cross-modal correspondences (Cao et al., 2022, Jiang et al., 2023), regularization/gating via mutual information (Su et al., 2023), or integrate knowledge via multi-modal fact retrieval and cross-attention (Zhang et al., 2023).
Pretraining strategies include:
- Text-conditioned region prediction (TRP), which directly aligns box prediction and mask distribution with referring texts (Zheng et al., 2022).
- Vision-conditioned masked language modeling, useful for joint REG/REC models (Zheng et al., 2022).
- Scene graph augmentations and negative-aware training for compositional and robust grounding (Liu et al., 23 Sep 2024, Yang et al., 27 Feb 2025).
Zero-Shot and Verification-based Inference
Recent work demonstrates REC as hypothesis verification, where general-purpose VLMs are prompted to answer atomically for each proposal—substantially surpassing selection-based approaches, even outperforming REC-trained models in strict zero-shot regimes (Liu et al., 12 Sep 2025).
Evaluation Metrics
Standard metrics:
- Accuracy@IoU: fraction of predictions with spatial overlap above .
- Mean accuracy (): averaged over multiple thresholds.
- Recall@ and AUROC: for negative/positive retrieval scenarios, measuring open-set grounding and anti-hallucination (Liu et al., 23 Sep 2024, Jin et al., 12 Aug 2025).
4. Benchmark Datasets and Data Collection Methodologies
Core Benchmarks
- RefCOCO, RefCOCO+, RefCOCOg: MSCOCO-based, covering short and long expressions and splits by referent type (Qiao et al., 2020, Chen et al., 24 Jun 2024). Notable recent work exposes substantial label noise in these datasets (up to 24% in RefCOCO+) and provides cleaned splits (Chen et al., 24 Jun 2024).
- ReferItGame, Flickr30K Entities: early phrase grounding datasets (Qiao et al., 2020).
- CLEVR-Ref+: synthetic, diagnostic for multi-hop reasoning (Qiao et al., 2020).
- Cops-Ref: expression logic and compositional variety (Qiao et al., 2020).
- FineCops-Ref: controllable difficulty (levels 1–3), compositional reasoning, negative text/image samples to test rejection (Liu et al., 23 Sep 2024, Yang et al., 27 Feb 2025).
- Ref-L4: large-scale, long-expressions (avg 24.2 words), 365 categories, calibrated splits and cleaned protocol (Chen et al., 24 Jun 2024).
- RefDrone: aerial REC, multi-scale, multi-target, and no-target cases (Sun et al., 1 Feb 2025).
- SOREC: small objects in driving scenes, long expressions, with PIZA adapters (Goto et al., 4 Oct 2025).
- ReMeX: multi-entity, relation-aware REC (Hu et al., 22 Jul 2025).
- KnowDR-REC: knowledge-intensive, negative-aware, reasoning-coupled visual grounding (Jin et al., 12 Aug 2025).
- GeoRef: geometric diagrams with structured and synthetic supervision (Liu et al., 25 Sep 2025).
Dataset Properties
| Dataset | Domain | Objects | Expr. Len. | Notable Features |
|---|---|---|---|---|
| RefCOCO(+/g) | MSCOCO | 80+ | 3–8 | Standard, known label noise |
| FineCops-Ref | COCO/GQA | 1,200+ | 17–19 | Difficulty, negatives |
| Ref-L4 | COCO+O365 | 365 | 24.2 | Large, long, cleaned splits |
| RefDrone | VisDrone | 10 | 9.0 | Multi-target, small-scale |
| SOREC | SODA-D | 20+ | 25.5 | Small objects, high-res |
5. Advanced Reasoning, Negative Samples, and Knowledge Integration
Compositional and Relational Reasoning
REC progress has moved from shallow cue matching to multi-hop, multi-entity, and relational models. For example, multi-hop scene graph traversal, entity span detection, and relation alignment are addressed in ReMeREC and FineCops-Ref (Liu et al., 23 Sep 2024, Hu et al., 22 Jul 2025). The Entity Inter-relationship Reasoner augments multi-entity localization by explicit affinity modeling and relation count prediction (Hu et al., 22 Jul 2025).
Negative-Aware Robustness
FineCops-Ref and KnowDR-REC systematically construct negative queries and images via controlled perturbation (REPLACE, SWAP). Negative-aware metrics such as Recall@1 and anti-hallucination accuracy reveal persistent failure to "refuse" absence queries (ErrorRate often >0.7) (Liu et al., 23 Sep 2024, Jin et al., 12 Aug 2025). MLLMs frequently hallucinate boxes despite subtle negative edits.
Knowledge Integration
Commonsense and external knowledge bases enable models to disambiguate referents beyond visual or spatial cues. CK-Transformer fuses region features with top-K retrieved facts per candidate via bi-modal similarity and cross-attention, yielding substantial improvements on KB-Ref (Zhang et al., 2023). KnowDR-REC demonstrates that knowledge-driven REC remains challenging, with interpretability and robustness tightly coupled to multimodal reasoning fidelity (Jin et al., 12 Aug 2025).
6. Efficiency, Adaptation, and Real-Time Systems
REC research has emphasized:
- Parameter-Efficient Fine-Tuning: Adapters (CoOp, LoRA, Adapter+) integrated with task-specific modules (e.g., PIZA for small objects), achieving full fine-tuning quality at 2–3 orders lower parameter cost (Goto et al., 4 Oct 2025).
- Language Adaptive Subnetworks: LADS deploys dynamic, text-conditioned routing, activating minimal subnets and gating layers/filters per expression (Su et al., 2023).
- Collaborative Specialist-MLLM Pipelines: SFA (slow-fast adaptation) and CRS (candidate-region selection) balance between fast, low-level detectors and reasoning-rich MLLMs, yielding accuracy gains and FLOP efficiency (Yang et al., 27 Feb 2025).
- Real-Time Inference: One-stage models like RealGIN deliver near SoTA accuracy at ∼10× speedup over proposal-ranking frameworks (Zhou et al., 2019).
7. Open Problems, Trends, and Future Directions
Current frontiers include:
- Scaling to Long, Complex Expressions and Rare Categories: Large benchmarks such as Ref-L4 and FineCops-Ref expose saturation and robustness failures on small or complex objects and categories (Chen et al., 24 Jun 2024, Liu et al., 23 Sep 2024).
- Robust Negative Rejection and Abstention: Most systems overfit to positive detection; explicit abstention logic and uncertainty quantification are open demands (Liu et al., 23 Sep 2024, Jin et al., 12 Aug 2025).
- Multi-entity and Relational Grounding: Joint entity/disjoint relation extraction under ambiguous and free-form real-world prompts (Hu et al., 22 Jul 2025).
- Geometric Grounding and Reasoning: Extending REC to synthetic or mathematical diagrams requires hybrid symbolic–visual pipelines and RL-inspired policy optimization (Liu et al., 25 Sep 2025).
- Interpretability and Reasoning Traces: Aligning box prediction with explicit reasoning traces, knowledge graph walks, or human-readable multi-hop justifications (Jin et al., 12 Aug 2025).
A plausible implication is that REC evaluation will increasingly focus on hardness-controllable benchmarks, compositional and knowledge-rich reasoning, open-set robustness, and collaborative or adaptive model design. REC remains a critical lens on the capabilities and limitations of contemporary multimodal systems, serving both as a diagnostic challenge and as a practical bridge to downstream embodied and interactive tasks.