Papers
Topics
Authors
Recent
2000 character limit reached

Dialogue-Based GREC in Multi-Turn Scenes

Updated 9 December 2025
  • Dialogue-Based GREC is a framework that extends classic referring expression comprehension to multi-turn dialogues, enabling disambiguation and coreference resolution.
  • It leverages multi-tier synthetic data and advanced vision-language models to handle context-sensitive grounding of ambiguous or co-referential expressions.
  • Empirical results show that integrating dialogue context and clarificational exchanges significantly improves precision and referent detection performance.

Dialogue-Based Generalized Referring Expressions Comprehension (GREC) encompasses the set of methods and benchmarks where a system receives a dialogue containing one or more referring expressions and must precisely ground each expression—possibly ambiguous, co-referential, or relating to multiple targets—onto a set of visual entities in a scene. The objective extends classic referring expression comprehension beyond one-shot, single-turn utterances to handle multi-turn, potentially open-ended conversations where expressions may refer to any number of objects, require disambiguation, or depend on context or clarifications. Contemporary research focuses on both the foundational definitions underpinning dialogue-based GREC and the practical challenges of dataset construction, modeling, evaluation metrics, and cross-domain generalization.

1. Formal Task Definition in Dialogue-Based GREC

Dialogue-Based GREC generalizes classic REC by introducing multi-turn conversational context, potentially resolving unlimited targets, ambiguity, and coreference. Formally, given:

  • An input image II, typically containing NN candidate objects O={o1,,oN}O = \{o_1, \ldots, o_N\}, each associated with visual features xo=φ(I,o)x_o = \varphi(I, o) via a convolutional or transformer encoder φ\varphi.
  • A multi-turn dialogue context D={u1,,uT}D = \{u_1, \ldots, u_T\}, where each utu_t is a free-form natural language utterance.
  • At turn TrefTT_{\text{ref}} \leq T, an utterance rr containing one or more referring expressions.

The model seeks a grounding function f:(I,D,r)O^Of: (I, D, r) \rightarrow \hat{O} \subseteq O which outputs the (possibly empty) set of target objects O^\hat{O} whose bounding boxes or segmentations correspond to the referents in rr as resolved in the context DD (Shao et al., 2 Dec 2025). The model may either score each candidate oo via s(oI,D,r)s(o \mid I, D, r) and select those exceeding threshold τ\tau, or output argmaxoOP(oI,D,r)\arg\max_{o \in O} P(o \mid I, D, r) for single-target cases. This formalism subsumes cases with zero, one, or multiple targets, and necessitates scalable set prediction.

A distinctive property is the integration of contextual dialogue—including clarificational exchanges, corrections, and anaphoric terms ("that one", "the others")—which demands a memory-augmented or context-sensitive model capable of accumulating, updating, and refining entity sets over turns (He et al., 2023).

2. Dataset Construction Strategies

Annotated dialogue-grounding data is sparse; thus, synthetic data frameworks play a critical role in research progress. The three-tier data synthesis methodology (Shao et al., 2 Dec 2025) trades off controllability and realism as follows:

Tier 1: Template-Based Synthesis

  • Exhaustively produces single-turn expressions covering color, position, and ordinal patterns via a finite grammar, paired with bounding-box metadata. For example, "the red column of blocks" or "the first blue block of the tall tower."
  • Yields Ntemplate19kN_{\text{template}} \approx 19k samples under strict control.

Tier 2: Prompted LLM-Based Synthesis

  • Introduces linguistic variability through controlled prompts to GPT-4, generating utterances and semantic slots in structured JSON. Each retained sample must satisfy parser checks for downstream interpretability.

Tier 3: Full Dialogue with Coreference

  • Synthesizes multi-turn dialogues by chaining Tier 2 expressions, simulating natural clarification, correction, and coreference. Qwen2-VL is fine-tuned on external corpora (VisPro) using a LoRA adapter, generating contextually bound chains of intent and referent (Shao et al., 2 Dec 2025).
  • Synthetic data is style-aligned with human-authored dialogues via a discriminator loss LDAL_{DA} to minimize KL-divergence in textual style.

All tiers combine annotated scenes (e.g., Minecraft), visual crops, object IDs, and bounding boxes, supporting slot-labeled supervision for both language and grounding objectives.

3. Model Architectures and Algorithms

No fundamentally new architecture is proposed exclusively for GREC; rather, existing state-of-the-art vision-language grounding frameworks are adapted to dialogue-based set prediction (He et al., 2023, Shao et al., 2 Dec 2025, Chiyah-Garcia et al., 2023):

  • Multi-Crop Network (MCN), Vision-Language Transformer (VLT), MDETR/Longformer, and Unified Next (UNINEXT): baseline REC models extended to output variable-sized candidate sets via a confidence threshold on detection scores.
  • Vision encoder: ResNet-101, DarkNet-53, or Detectron2 region features.
  • Text encoder: GRU, BERT/RoBERTa, or transformer-based; additionally, cross-attention modules or multimodal feature concatenation.
  • Coreference and slot prediction: Multi-label cross-entropy is used for object identification, augmented by attribute-slot prediction (e.g., color, size, location) and margin-based latent disentanglement (Chiyah-Garcia et al., 2023).

Relational reasoning is implemented by embedding object-pair relationships (e.g., spatial adjacency), and multi-task objectives ensure attribute-specific disentanglement in latent space—particularly important for clarification and repair in dialogue.

No explicit set-prediction or multi-label Hungarian matching losses are introduced for dialogue in (He et al., 2023), but future extensions are suggested.

4. Evaluation Metrics and Protocols

Dialogue-based GREC inherits and extends precision metrics from classic REC:

  • Precision@(F1=1,IoU0.5)(F_1 = 1, \text{IoU} \geq 0.5): For each sample, predicted boxes are matched to ground-truth via IoU 0.5\geq 0.5, and F1F_1 is computed as 2TP/(2TP+FP+FN)2 \cdot TP/(2 \cdot TP + FP + FN). A sample is only correct if F1=1.0F_1 = 1.0—all refs found, no extras.
  • No-Target Accuracy (N-acc): For cases where the correct referent set is empty, N-acc is the proportion of samples where no boxes are predicted (He et al., 2023).
  • Object-level F1: Used for multi-label referent identification in each turn, especially critical for evaluating context-sensitive clarification, with ΔF1\Delta F_1 measuring improvement before and after clarificational replies (Chiyah-Garcia et al., 2023).
  • Style alignment loss LDAL_{DA}: In synthetic data generation, discriminator-based domain adaptation quantifies textual realism conditional on human-vs.-synthetic dialogue style (Shao et al., 2 Dec 2025).

Average precision (COCO-style) is tracked for completeness but is noted to be less suitable for dialogue GREC due to its tolerance of low-confidence extras.

5. Empirical Results and Analysis

Comparative performance across synthesis tiers, model architectures, and evaluation splits reveals several key findings:

Training Data Size Mean F1 Prec@(F1=1F_1=1)
gRefCOCO (out-of-domain) 209k 19.1 13.5
Tier 1 (Template) 19k 42.8 24.5
Tier 2 (AI-Short) 1k 22.8 15.7
Tier 3 (AI-Dialogue) 1k 24.8 11.2

Fine-tuning on in-domain template data yields +23.7+23.7 F1 absolute gain over large, out-of-domain sets (gRefCOCO). At constant data size (1k), LLM-generated short expressions and full dialogues yield greater Precision and F1 than pure templates. Notably, models perform better on "mention-only" scenarios than on full multi-turn coreferential dialogues, highlighting the difficulty of context-sensitive grounding (Shao et al., 2 Dec 2025).

Multi-modal, relational BART-based models excel at processing clarificational exchanges, achieving the largest ΔF1\Delta F_1 improvements after ambiguous referents are explicitly repaired. Disentangled object-centric representations and explicit attribute-slot supervision are shown to be essential for dialogue GREC (Chiyah-Garcia et al., 2023). Simpler language-only models fail to exploit dialogue context and may worsen after clarification exchanges.

Ablation analysis combining synthetic tiers demonstrates that Tier 1 + Tier 2 maximizes F1, while Tier 3 can hurt performance due to style or domain mismatch unless alignment losses are deployed.

6. Disambiguation, Clarificational Exchanges, and Coreference Resolution

Dialogue-Based GREC critically depends on a system’s ability to repair referential ambiguity through clarificational exchanges (CEs), which are formalized as the triples Ub,CRt,Ua\langle U_b, CR_t, U_a \rangle—user turn introducing ambiguity, system clarification request, and user's disambiguating reply (Chiyah-Garcia et al., 2023). Core steps include:

  • Detection of referential ambiguity in UbU_b.
  • Generation of CRtCR_t by the system.
  • Contextual update after UaU_a (quantified by ΔF1\Delta F_1 and improvement percentages).

Relational models, especially those with attribute-slot and spatial relation modules, outperform pure language systems in exploiting clarificational repairs. Architectural constraints for high CE-processing capacity include multi-task slot losses and relational encoders. Disentanglement objectives—margin-based or multi-task slot constraints—produce embeddings most amenable to coreferential grounding across turns.

7. Future Directions and Open Challenges

Current limitations include insufficient set-prediction objectives, lack of explicit mechanisms for no-object prediction, and difficulties with dialogue-style domain adaptation (He et al., 2023, Shao et al., 2 Dec 2025). Prospective research themes include:

  • Reinforcement-triggered, targeted data synthesis to address model failure cases (active synthesis).
  • Advanced adversarial domain alignment to close the gap between synthetic and human dialogue styles.
  • Extension to open-vocabulary, dynamic visual environments, and meta-learning for cross-scene generalization.
  • Benchmark construction for turn-based, dialogue-intensive GREC, including richer anaphora, coreference, and interactive grounding protocols.
  • Integration of dialog managers and context modules for incremental entity set updates and interactive clarification handling.

This suggests dialogue-based GREC is an active research area that unifies visually-grounded language understanding, robust referential grounding, and sequential, context-sensitive entity selection, offering substantial complexity distinct from classic REC.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Dialogue-Based Generalized Referring Expressions Comprehension (GREC).