Dialogue-Based GREC in Multi-Turn Scenes

Updated 9 December 2025

Dialogue-Based GREC is a framework that extends classic referring expression comprehension to multi-turn dialogues, enabling disambiguation and coreference resolution.
It leverages multi-tier synthetic data and advanced vision-language models to handle context-sensitive grounding of ambiguous or co-referential expressions.
Empirical results show that integrating dialogue context and clarificational exchanges significantly improves precision and referent detection performance.

Dialogue-Based Generalized Referring Expressions Comprehension (GREC) encompasses the set of methods and benchmarks where a system receives a dialogue containing one or more referring expressions and must precisely ground each expression—possibly ambiguous, co-referential, or relating to multiple targets—onto a set of visual entities in a scene. The objective extends classic referring expression comprehension beyond one-shot, single-turn utterances to handle multi-turn, potentially open-ended conversations where expressions may refer to any number of objects, require disambiguation, or depend on context or clarifications. Contemporary research focuses on both the foundational definitions underpinning dialogue-based GREC and the practical challenges of dataset construction, modeling, evaluation metrics, and cross-domain generalization.

1. Formal Task Definition in Dialogue-Based GREC

Dialogue-Based GREC generalizes classic REC by introducing multi-turn conversational context, potentially resolving unlimited targets, ambiguity, and coreference. Formally, given:

An input image $I$ , typically containing $N$ candidate objects $O = \{o_1, \ldots, o_N\}$ , each associated with visual features $x_o = \varphi(I, o)$ via a convolutional or transformer encoder $\varphi$ .
A multi-turn dialogue context $D = \{u_1, \ldots, u_T\}$ , where each $u_t$ is a free-form natural language utterance.
At turn $T_{\text{ref}} \leq T$ , an utterance $r$ containing one or more referring expressions.

The model seeks a grounding function $f: (I, D, r) \rightarrow \hat{O} \subseteq O$ which outputs the (possibly empty) set of target objects $\hat{O}$ whose bounding boxes or segmentations correspond to the referents in $r$ as resolved in the context $D$ (Shao et al., 2 Dec 2025). The model may either score each candidate $o$ via $s(o \mid I, D, r)$ and select those exceeding threshold $\tau$ , or output $\arg\max_{o \in O} P(o \mid I, D, r)$ for single-target cases. This formalism subsumes cases with zero, one, or multiple targets, and necessitates scalable set prediction.

A distinctive property is the integration of contextual dialogue—including clarificational exchanges, corrections, and anaphoric terms ("that one", "the others")—which demands a memory-augmented or context-sensitive model capable of accumulating, updating, and refining entity sets over turns (He et al., 2023).

2. Dataset Construction Strategies

Annotated dialogue-grounding data is sparse; thus, synthetic data frameworks play a critical role in research progress. The three-tier data synthesis methodology (Shao et al., 2 Dec 2025) trades off controllability and realism as follows:

Tier 1: Template-Based Synthesis

Exhaustively produces single-turn expressions covering color, position, and ordinal patterns via a finite grammar, paired with bounding-box metadata. For example, "the red column of blocks" or "the first blue block of the tall tower."
Yields $N_{\text{template}} \approx 19k$ samples under strict control.

Tier 2: Prompted LLM-Based Synthesis

Introduces linguistic variability through controlled prompts to GPT-4, generating utterances and semantic slots in structured JSON. Each retained sample must satisfy parser checks for downstream interpretability.

Tier 3: Full Dialogue with Coreference

Synthesizes multi-turn dialogues by chaining Tier 2 expressions, simulating natural clarification, correction, and coreference. Qwen2-VL is fine-tuned on external corpora (VisPro) using a LoRA adapter, generating contextually bound chains of intent and referent (Shao et al., 2 Dec 2025).
Synthetic data is style-aligned with human-authored dialogues via a discriminator loss $L_{DA}$ to minimize KL-divergence in textual style.

All tiers combine annotated scenes (e.g., Minecraft), visual crops, object IDs, and bounding boxes, supporting slot-labeled supervision for both language and grounding objectives.

3. Model Architectures and Algorithms

No fundamentally new architecture is proposed exclusively for GREC; rather, existing state-of-the-art vision-language grounding frameworks are adapted to dialogue-based set prediction (He et al., 2023, Shao et al., 2 Dec 2025, Chiyah-Garcia et al., 2023):

Multi-Crop Network (MCN), Vision-Language Transformer (VLT), MDETR/Longformer, and Unified Next (UNINEXT): baseline REC models extended to output variable-sized candidate sets via a confidence threshold on detection scores.
Vision encoder: ResNet-101, DarkNet-53, or Detectron2 region features.
Text encoder: GRU, BERT/RoBERTa, or transformer-based; additionally, cross-attention modules or multimodal feature concatenation.
Coreference and slot prediction: Multi-label cross-entropy is used for object identification, augmented by attribute-slot prediction (e.g., color, size, location) and margin-based latent disentanglement (Chiyah-Garcia et al., 2023).

Relational reasoning is implemented by embedding object-pair relationships (e.g., spatial adjacency), and multi-task objectives ensure attribute-specific disentanglement in latent space—particularly important for clarification and repair in dialogue.

No explicit set-prediction or multi-label Hungarian matching losses are introduced for dialogue in (He et al., 2023), but future extensions are suggested.

4. Evaluation Metrics and Protocols

Dialogue-based GREC inherits and extends precision metrics from classic REC:

Precision@ $(F_1 = 1, \text{IoU} \geq 0.5)$ : For each sample, predicted boxes are matched to ground-truth via IoU $\geq 0.5$ , and $F_1$ is computed as $2 \cdot TP/(2 \cdot TP + FP + FN)$ . A sample is only correct if $F_1 = 1.0$ —all refs found, no extras.
No-Target Accuracy (N-acc): For cases where the correct referent set is empty, N-acc is the proportion of samples where no boxes are predicted (He et al., 2023).
Object-level F1: Used for multi-label referent identification in each turn, especially critical for evaluating context-sensitive clarification, with $\Delta F_1$ measuring improvement before and after clarificational replies (Chiyah-Garcia et al., 2023).
Style alignment loss $L_{DA}$ : In synthetic data generation, discriminator-based domain adaptation quantifies textual realism conditional on human-vs.-synthetic dialogue style (Shao et al., 2 Dec 2025).

Average precision (COCO-style) is tracked for completeness but is noted to be less suitable for dialogue GREC due to its tolerance of low-confidence extras.

5. Empirical Results and Analysis

Comparative performance across synthesis tiers, model architectures, and evaluation splits reveals several key findings:

Training Data	Size	Mean F1	Prec@( $F_1=1$ )
gRefCOCO (out-of-domain)	209k	19.1	13.5
Tier 1 (Template)	19k	42.8	24.5
Tier 2 (AI-Short)	1k	22.8	15.7
Tier 3 (AI-Dialogue)	1k	24.8	11.2

Fine-tuning on in-domain template data yields $+23.7$ F1 absolute gain over large, out-of-domain sets (gRefCOCO). At constant data size (1k), LLM-generated short expressions and full dialogues yield greater Precision and F1 than pure templates. Notably, models perform better on "mention-only" scenarios than on full multi-turn coreferential dialogues, highlighting the difficulty of context-sensitive grounding (Shao et al., 2 Dec 2025).

Multi-modal, relational BART-based models excel at processing clarificational exchanges, achieving the largest $\Delta F_1$ improvements after ambiguous referents are explicitly repaired. Disentangled object-centric representations and explicit attribute-slot supervision are shown to be essential for dialogue GREC (Chiyah-Garcia et al., 2023). Simpler language-only models fail to exploit dialogue context and may worsen after clarification exchanges.

Ablation analysis combining synthetic tiers demonstrates that Tier 1 + Tier 2 maximizes F1, while Tier 3 can hurt performance due to style or domain mismatch unless alignment losses are deployed.

6. Disambiguation, Clarificational Exchanges, and Coreference Resolution

Dialogue-Based GREC critically depends on a system’s ability to repair referential ambiguity through clarificational exchanges (CEs), which are formalized as the triples $\langle U_b, CR_t, U_a \rangle$ —user turn introducing ambiguity, system clarification request, and user's disambiguating reply (Chiyah-Garcia et al., 2023). Core steps include:

Detection of referential ambiguity in $U_b$ .
Generation of $CR_t$ by the system.
Contextual update after $U_a$ (quantified by $\Delta F_1$ and improvement percentages).

Relational models, especially those with attribute-slot and spatial relation modules, outperform pure language systems in exploiting clarificational repairs. Architectural constraints for high CE-processing capacity include multi-task slot losses and relational encoders. Disentanglement objectives—margin-based or multi-task slot constraints—produce embeddings most amenable to coreferential grounding across turns.

7. Future Directions and Open Challenges

Current limitations include insufficient set-prediction objectives, lack of explicit mechanisms for no-object prediction, and difficulties with dialogue-style domain adaptation (He et al., 2023, Shao et al., 2 Dec 2025). Prospective research themes include:

Reinforcement-triggered, targeted data synthesis to address model failure cases (active synthesis).
Advanced adversarial domain alignment to close the gap between synthetic and human dialogue styles.
Extension to open-vocabulary, dynamic visual environments, and meta-learning for cross-scene generalization.
Benchmark construction for turn-based, dialogue-intensive GREC, including richer anaphora, coreference, and interactive grounding protocols.
Integration of dialog managers and context modules for incremental entity set updates and interactive clarification handling.

This suggests dialogue-based GREC is an active research area that unifies visually-grounded language understanding, robust referential grounding, and sequential, context-sensitive entity selection, offering substantial complexity distinct from classic REC.