Referring Reasoning Task: Cross-Modal Visual Logic

Updated 1 March 2026

Referring Reasoning Task is a cross-modal visual reasoning challenge that grounds natural language expressions in images with multi-hop logic and relational analysis.
It is evaluated using precise region overlap metrics like IoU and rejection accuracy to address challenges such as negation and absence.
Recent methods leverage structured graph models, neuro-symbolic pipelines, and transformer-based encoders to enhance interpretability and compositional reasoning.

A referring reasoning task is a form of cross-modal visual reasoning where a system must ground a natural-language referring expression—often containing compositional attributes and relational structure—in a visual context, typically yielding a resolved region (e.g., bounding box or segmentation mask) in the image or video. Unlike shallow retrieval-style tasks, referring reasoning tasks explicitly probe a model's ability to analyze multi-step logic, compositionality, spatial relationships, and, increasingly, negation or absence, enabling precise assessment of both perceptual and deductive aspects of grounding.

1. Task Definition and Evaluation Protocols

The prototypical referring reasoning task generalizes classical referring expression comprehension (REC), requiring not only identification of an object ( $b^*$ ) matching a natural-language query ( $r$ ), but—in modern formulations—explicit multi-hop reasoning over attributes, relations, or temporal anchors (Chen et al., 2020, Tumu et al., 8 Nov 2025, Gao et al., 6 Dec 2025). The mapping is typically formalized as a function

$f: (I, r) \mapsto b^*$

where $I$ is the image (or video), $r$ the referring expression, and $b^*$ (or set of boxes/masks) localizes the referent(s).

Performance is typically evaluated via region overlap criteria:

IoU accuracy: Fraction of predictions with Intersection over Union (IoU) exceeding 0.5 against ground truth.
Mean/Generalized IoU/Precision@1: Varying by dataset; always demanding precise localization (Chen et al., 2020, Tumu et al., 8 Nov 2025, Gao et al., 6 Dec 2025).
Rejection accuracy: For expressions with no valid referent, measures ability to abstain from spurious predictions (Liu et al., 2024, Gao et al., 6 Dec 2025, Jiang et al., 4 Jun 2025, Park et al., 19 Jan 2026).
Compositional and reasoning breakdowns: Novel tasks report performance per-relation, per-logic-form, or by number of reasoning steps (Tumu et al., 8 Nov 2025, Chen et al., 2020, Gao et al., 6 Dec 2025, Yang et al., 2020).

2. Core Reasoning Dimensions and Problem Structure

Expressions in referring reasoning tasks exhibit compositionality, often mapped to a formal or semantic graph:

Logic forms: Atomic attributes, binary relations, multi-hop chains, conjunction/disjunction, negation, and ordinal or comparative predicates (Chen et al., 2020, Tumu et al., 8 Nov 2025, Gao et al., 6 Dec 2025, Liu et al., 2019, Liu et al., 2024).
Spatial semantics: Detailed taxonomies distinguish directional, topological, proximity, absolute, and projective relations; many datasets identify such types per-instance (Tumu et al., 8 Nov 2025, Gao et al., 6 Dec 2025, Tumu et al., 4 Feb 2025).
Commonsense and pragmatic knowledge: Expressions may require inference beyond what is strictly observable, leveraging external knowledge or world models (Zhang et al., 2023, Gao et al., 6 Dec 2025).
Negation/Absence: Requires explicit handling of no-target cases, acceptably abstaining or outputting "no box" (Liu et al., 2024, Gao et al., 6 Dec 2025, Jiang et al., 4 Jun 2025, Park et al., 19 Jan 2026).

Table: Example logic forms and their mapping to reasoning complexity.

Logic Type	Expression Example	Reasoning Complexity
Attribute	"the red apple"	atomic/zero-hop
1-hop Relation	"the mug left of the teapot"	single relation
Chain/Order	"third from left"	ordered/ordinal
Conjunction	"left of X and near Y"	multi-hop/chained
Negation	"not on the table"	Boolean/negated filter

3. Model Architectures and Reasoning Strategies

Referring reasoning models fall along a spectrum of architectural priors:

Structured Graph Models: Scene graph-based networks (e.g., SGMN) explicitly parse both expression and image into graph structures and implement module-based reasoning along the parsed graph, yielding interpretable intermediate steps (Yang et al., 2020). Dynamic graph attention networks and GGNNs enable multi-hop relation binding (Yang et al., 2019, Tumu et al., 8 Nov 2025).
Neuro-Symbolic Pipelines: Recent neuro-symbolic approaches leverage LLMs to generate structured programs (composed of FIND, PROPERTY, LOCATE, RELATION, etc.) that are then executed step-by-step in a symbolic or hybrid pipeline, with lightweight verifier modules for early rejection of inconsistent chains (Park et al., 19 Jan 2026). Verification at each operator prevents propagation of false positive detections when no referent exists.
Transformer-based Joint Encoders: Fully transformer-based approaches jointly encode visual and linguistic information, using cross-attention between multi-modal tokens to realize contextualized, fine-grained grounding and segmentation, sometimes with explicit multi-task heads for both REC and referring segmentation (RES) (Li et al., 2021, Liu et al., 22 Sep 2025).
Chain-of-Thought Generative Models: Recent LLM-based approaches (e.g., Rex-Thinker) recast grounding as a chain-of-thought (CoT) reasoning task: the model emits explicit decompositions and justified decisions for or against each candidate region, culminating in either an answer set or abstention when warranted (Jiang et al., 4 Jun 2025, Gao et al., 6 Dec 2025, Jiang et al., 6 Jan 2026).
Reinforcement Fine-Tuning (RFT) and Group Relative PPO: To robustify compositional generalization, RL objectives tailored for multi-step reasoning—using dynamic, group-wise, or IoU-sensitive rewards—are increasingly applied. This yields improved handling of long reasoning chains, negative cases, and small, hard objects (Gao et al., 6 Dec 2025, Liu et al., 25 Sep 2025, Jiang et al., 4 Jun 2025, Zhou et al., 4 Jun 2025).

4. Benchmark Datasets and Evaluation Paradigms

The last five years have seen the introduction of multiple large-scale datasets specifically constructed to measure multi-step, fine-grained, and negative/rejection-aware reasoning:

Cops-Ref: Engineered for controlled compositionality via scene graphs and six logic types, with strong distractor protocols to enforce reasoning (Chen et al., 2020).
Ref-Reasoning: Relies on real-scene graphs and compositional logic templates (up to 5 steps); supporting interpretable, intermediate, and final supervision (Yang et al., 2020).
RefBench-PRO: Decomposes REC into perception and multi-axis reasoning, including explicit reject/no-target queries (Gao et al., 6 Dec 2025).
FineCops-Ref: Fine-grained multi-level difficulty, hard negative images/expressions, and multi-hop expression paths for rigorous generalization and rejection measurement (Liu et al., 2024).
CLEVR-Ref+: Synthetic dataset allowing bias elimination, programmatic functional annotation, and direct evaluation of intermediate module outputs (Liu et al., 2019).
GeoRef, RefSpatial-Bench, and others: Extend referring reasoning to geometric diagrams, 3D/robotics settings, and video with both spatial and temporal reference resolution (Liu et al., 25 Sep 2025, Zhou et al., 4 Jun 2025, Zhou et al., 3 Sep 2025).
R2SM: Introduces intent-driven modal/amodal mask reasoning (modal vs. amodal segmentation selection) to test linguistic comprehension of occlusion and intent (Shih et al., 2 Jun 2025).
PixelQA, PixelQA/VideoRefer, R2-AVSBench: Extend grounded reasoning to pixel-level, multi-frame, or audio-visual contexts, often requiring joint segmentation and question answering (Liu et al., 22 Sep 2025, Zhou et al., 6 Aug 2025).

5. Empirical Findings and Model Limitations

Comprehensive studies consistently demonstrate:

Compositional generalization and failure modes: Performance drops sharply as the number of relations per expression increases; multi-hop chains, conjunction ("and"), disjunction ("or"), and negation remain open challenges (Chen et al., 2020, Tumu et al., 8 Nov 2025, Gao et al., 6 Dec 2025).
Spatial semantics: Task-specific models outperform large VLMs on directional, projective, and topological relations when trained with compositional inductive biases; VLMs perform well on attribute-based and category-only tasks, but degrade in relational and ambiguous cases (Tumu et al., 8 Nov 2025, Tumu et al., 4 Feb 2025).
Negation and no-target: Hallucinated predictions (false-positive bounding boxes) abound unless explicit operator-level verification, rejection supervision, or RL with rejection rewards is applied. Recent neuro-symbolic and chain-of-thought approaches exhibit dramatic improvements (Liu et al., 2024, Gao et al., 6 Dec 2025, Park et al., 19 Jan 2026, Jiang et al., 4 Jun 2025).
Commonsense and external knowledge: Integrating curated knowledge bases (e.g., CK-Transformer) meaningfully boosts ability to resolve expressions requiring world knowledge beyond the perceptual signal (Zhang et al., 2023).
Interpretability and transparency: Modular and program-based pipelines support direct visualization and human auditing of intermediate reasoning steps or module outputs, promoting diagnosis and reliability (Liu et al., 2019, Yang et al., 2020, Jiang et al., 4 Jun 2025, Park et al., 19 Jan 2026).

Table: Quantitative illustration - Category-wise accuracy on CopsRef (from (Tumu et al., 8 Nov 2025)).

Relation Category	GDINO (%)	MGA-Net (%)	Note
Topological	83.0	67.3	highest
Absolute	82.1	80.7
Proximity	80.7	62.8
Directional	65.5	52.9	lowest

6. Research Gaps and Future Directions

Despite progress, major open challenges persist:

Metric and continuous spatial reasoning: Most benchmarks address qualitative spatial language; extending to metric expressions (e.g., "two feet to the right of") is needed (Tumu et al., 8 Nov 2025, Tumu et al., 4 Feb 2025).
Dynamic, recursive, or symbolic architectures: Advances in recursive, pushdown, or neuro-symbolic model composition are expected to further improve multi-step, compositional generalization (Tumu et al., 8 Nov 2025, Yang et al., 2020, Park et al., 19 Jan 2026).
Robust handling of negation, absence, and hallucination: Systematic augmentation with negative and contrastive queries, and the inclusion of explicit abstention signals in model design and pre-training pipelines, remain crucial (Liu et al., 2024, Gao et al., 6 Dec 2025, Jiang et al., 4 Jun 2025, Park et al., 19 Jan 2026).
Transfer and modularity: Models trained explicitly on referring reasoning show measurable transfer to downstream VQA, geometric, and robotic settings, but transfer is still limited by architectural alignment and task specification (Liu et al., 25 Sep 2025, Zhou et al., 4 Jun 2025, Liu et al., 22 Sep 2025).
Scalability and efficiency: Operator-level verification and program decoupling yield significant gains in throughput while offering interpretability and robust no-target handling; optimizing for efficient batch program generation and execution is an ongoing area (Park et al., 19 Jan 2026).

Explicit chain-of-thought, neuro-symbolic, and modular attention architectures, coupled with task- and reward-specific RL fine-tuning and synthetic hard-negative supervision, are converging toward robust, generalizable, and interpretable referring reasoning systems across diverse modalities and environments.