Papers
Topics
Authors
Recent
Search
2000 character limit reached

Referring Reasoning Task: Cross-Modal Visual Logic

Updated 1 March 2026
  • Referring Reasoning Task is a cross-modal visual reasoning challenge that grounds natural language expressions in images with multi-hop logic and relational analysis.
  • It is evaluated using precise region overlap metrics like IoU and rejection accuracy to address challenges such as negation and absence.
  • Recent methods leverage structured graph models, neuro-symbolic pipelines, and transformer-based encoders to enhance interpretability and compositional reasoning.

A referring reasoning task is a form of cross-modal visual reasoning where a system must ground a natural-language referring expression—often containing compositional attributes and relational structure—in a visual context, typically yielding a resolved region (e.g., bounding box or segmentation mask) in the image or video. Unlike shallow retrieval-style tasks, referring reasoning tasks explicitly probe a model's ability to analyze multi-step logic, compositionality, spatial relationships, and, increasingly, negation or absence, enabling precise assessment of both perceptual and deductive aspects of grounding.

1. Task Definition and Evaluation Protocols

The prototypical referring reasoning task generalizes classical referring expression comprehension (REC), requiring not only identification of an object (b∗b^*) matching a natural-language query (rr), but—in modern formulations—explicit multi-hop reasoning over attributes, relations, or temporal anchors (Chen et al., 2020, Tumu et al., 8 Nov 2025, Gao et al., 6 Dec 2025). The mapping is typically formalized as a function

f:(I,r)↦b∗f: (I, r) \mapsto b^*

where II is the image (or video), rr the referring expression, and b∗b^* (or set of boxes/masks) localizes the referent(s).

Performance is typically evaluated via region overlap criteria:

2. Core Reasoning Dimensions and Problem Structure

Expressions in referring reasoning tasks exhibit compositionality, often mapped to a formal or semantic graph:

Table: Example logic forms and their mapping to reasoning complexity.

Logic Type Expression Example Reasoning Complexity
Attribute "the red apple" atomic/zero-hop
1-hop Relation "the mug left of the teapot" single relation
Chain/Order "third from left" ordered/ordinal
Conjunction "left of X and near Y" multi-hop/chained
Negation "not on the table" Boolean/negated filter

3. Model Architectures and Reasoning Strategies

Referring reasoning models fall along a spectrum of architectural priors:

  • Structured Graph Models: Scene graph-based networks (e.g., SGMN) explicitly parse both expression and image into graph structures and implement module-based reasoning along the parsed graph, yielding interpretable intermediate steps (Yang et al., 2020). Dynamic graph attention networks and GGNNs enable multi-hop relation binding (Yang et al., 2019, Tumu et al., 8 Nov 2025).
  • Neuro-Symbolic Pipelines: Recent neuro-symbolic approaches leverage LLMs to generate structured programs (composed of FIND, PROPERTY, LOCATE, RELATION, etc.) that are then executed step-by-step in a symbolic or hybrid pipeline, with lightweight verifier modules for early rejection of inconsistent chains (Park et al., 19 Jan 2026). Verification at each operator prevents propagation of false positive detections when no referent exists.
  • Transformer-based Joint Encoders: Fully transformer-based approaches jointly encode visual and linguistic information, using cross-attention between multi-modal tokens to realize contextualized, fine-grained grounding and segmentation, sometimes with explicit multi-task heads for both REC and referring segmentation (RES) (Li et al., 2021, Liu et al., 22 Sep 2025).
  • Chain-of-Thought Generative Models: Recent LLM-based approaches (e.g., Rex-Thinker) recast grounding as a chain-of-thought (CoT) reasoning task: the model emits explicit decompositions and justified decisions for or against each candidate region, culminating in either an answer set or abstention when warranted (Jiang et al., 4 Jun 2025, Gao et al., 6 Dec 2025, Jiang et al., 6 Jan 2026).
  • Reinforcement Fine-Tuning (RFT) and Group Relative PPO: To robustify compositional generalization, RL objectives tailored for multi-step reasoning—using dynamic, group-wise, or IoU-sensitive rewards—are increasingly applied. This yields improved handling of long reasoning chains, negative cases, and small, hard objects (Gao et al., 6 Dec 2025, Liu et al., 25 Sep 2025, Jiang et al., 4 Jun 2025, Zhou et al., 4 Jun 2025).

4. Benchmark Datasets and Evaluation Paradigms

The last five years have seen the introduction of multiple large-scale datasets specifically constructed to measure multi-step, fine-grained, and negative/rejection-aware reasoning:

  • Cops-Ref: Engineered for controlled compositionality via scene graphs and six logic types, with strong distractor protocols to enforce reasoning (Chen et al., 2020).
  • Ref-Reasoning: Relies on real-scene graphs and compositional logic templates (up to 5 steps); supporting interpretable, intermediate, and final supervision (Yang et al., 2020).
  • RefBench-PRO: Decomposes REC into perception and multi-axis reasoning, including explicit reject/no-target queries (Gao et al., 6 Dec 2025).
  • FineCops-Ref: Fine-grained multi-level difficulty, hard negative images/expressions, and multi-hop expression paths for rigorous generalization and rejection measurement (Liu et al., 2024).
  • CLEVR-Ref+: Synthetic dataset allowing bias elimination, programmatic functional annotation, and direct evaluation of intermediate module outputs (Liu et al., 2019).
  • GeoRef, RefSpatial-Bench, and others: Extend referring reasoning to geometric diagrams, 3D/robotics settings, and video with both spatial and temporal reference resolution (Liu et al., 25 Sep 2025, Zhou et al., 4 Jun 2025, Zhou et al., 3 Sep 2025).
  • R2SM: Introduces intent-driven modal/amodal mask reasoning (modal vs. amodal segmentation selection) to test linguistic comprehension of occlusion and intent (Shih et al., 2 Jun 2025).
  • PixelQA, PixelQA/VideoRefer, R2-AVSBench: Extend grounded reasoning to pixel-level, multi-frame, or audio-visual contexts, often requiring joint segmentation and question answering (Liu et al., 22 Sep 2025, Zhou et al., 6 Aug 2025).

5. Empirical Findings and Model Limitations

Comprehensive studies consistently demonstrate:

Table: Quantitative illustration - Category-wise accuracy on CopsRef (from (Tumu et al., 8 Nov 2025)).

Relation Category GDINO (%) MGA-Net (%) Note
Topological 83.0 67.3 highest
Absolute 82.1 80.7
Proximity 80.7 62.8
Directional 65.5 52.9 lowest

6. Research Gaps and Future Directions

Despite progress, major open challenges persist:

Explicit chain-of-thought, neuro-symbolic, and modular attention architectures, coupled with task- and reward-specific RL fine-tuning and synthetic hard-negative supervision, are converging toward robust, generalizable, and interpretable referring reasoning systems across diverse modalities and environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Referring Reasoning Task.