Dialogue-Based GREC in Multi-Turn Scenes
- Dialogue-Based GREC is a framework that extends classic referring expression comprehension to multi-turn dialogues, enabling disambiguation and coreference resolution.
- It leverages multi-tier synthetic data and advanced vision-language models to handle context-sensitive grounding of ambiguous or co-referential expressions.
- Empirical results show that integrating dialogue context and clarificational exchanges significantly improves precision and referent detection performance.
Dialogue-Based Generalized Referring Expressions Comprehension (GREC) encompasses the set of methods and benchmarks where a system receives a dialogue containing one or more referring expressions and must precisely ground each expression—possibly ambiguous, co-referential, or relating to multiple targets—onto a set of visual entities in a scene. The objective extends classic referring expression comprehension beyond one-shot, single-turn utterances to handle multi-turn, potentially open-ended conversations where expressions may refer to any number of objects, require disambiguation, or depend on context or clarifications. Contemporary research focuses on both the foundational definitions underpinning dialogue-based GREC and the practical challenges of dataset construction, modeling, evaluation metrics, and cross-domain generalization.
1. Formal Task Definition in Dialogue-Based GREC
Dialogue-Based GREC generalizes classic REC by introducing multi-turn conversational context, potentially resolving unlimited targets, ambiguity, and coreference. Formally, given:
- An input image , typically containing candidate objects , each associated with visual features via a convolutional or transformer encoder .
- A multi-turn dialogue context , where each is a free-form natural language utterance.
- At turn , an utterance containing one or more referring expressions.
The model seeks a grounding function which outputs the (possibly empty) set of target objects whose bounding boxes or segmentations correspond to the referents in as resolved in the context (Shao et al., 2 Dec 2025). The model may either score each candidate via and select those exceeding threshold , or output for single-target cases. This formalism subsumes cases with zero, one, or multiple targets, and necessitates scalable set prediction.
A distinctive property is the integration of contextual dialogue—including clarificational exchanges, corrections, and anaphoric terms ("that one", "the others")—which demands a memory-augmented or context-sensitive model capable of accumulating, updating, and refining entity sets over turns (He et al., 2023).
2. Dataset Construction Strategies
Annotated dialogue-grounding data is sparse; thus, synthetic data frameworks play a critical role in research progress. The three-tier data synthesis methodology (Shao et al., 2 Dec 2025) trades off controllability and realism as follows:
Tier 1: Template-Based Synthesis
- Exhaustively produces single-turn expressions covering color, position, and ordinal patterns via a finite grammar, paired with bounding-box metadata. For example, "the red column of blocks" or "the first blue block of the tall tower."
- Yields samples under strict control.
Tier 2: Prompted LLM-Based Synthesis
- Introduces linguistic variability through controlled prompts to GPT-4, generating utterances and semantic slots in structured JSON. Each retained sample must satisfy parser checks for downstream interpretability.
Tier 3: Full Dialogue with Coreference
- Synthesizes multi-turn dialogues by chaining Tier 2 expressions, simulating natural clarification, correction, and coreference. Qwen2-VL is fine-tuned on external corpora (VisPro) using a LoRA adapter, generating contextually bound chains of intent and referent (Shao et al., 2 Dec 2025).
- Synthetic data is style-aligned with human-authored dialogues via a discriminator loss to minimize KL-divergence in textual style.
All tiers combine annotated scenes (e.g., Minecraft), visual crops, object IDs, and bounding boxes, supporting slot-labeled supervision for both language and grounding objectives.
3. Model Architectures and Algorithms
No fundamentally new architecture is proposed exclusively for GREC; rather, existing state-of-the-art vision-language grounding frameworks are adapted to dialogue-based set prediction (He et al., 2023, Shao et al., 2 Dec 2025, Chiyah-Garcia et al., 2023):
- Multi-Crop Network (MCN), Vision-Language Transformer (VLT), MDETR/Longformer, and Unified Next (UNINEXT): baseline REC models extended to output variable-sized candidate sets via a confidence threshold on detection scores.
- Vision encoder: ResNet-101, DarkNet-53, or Detectron2 region features.
- Text encoder: GRU, BERT/RoBERTa, or transformer-based; additionally, cross-attention modules or multimodal feature concatenation.
- Coreference and slot prediction: Multi-label cross-entropy is used for object identification, augmented by attribute-slot prediction (e.g., color, size, location) and margin-based latent disentanglement (Chiyah-Garcia et al., 2023).
Relational reasoning is implemented by embedding object-pair relationships (e.g., spatial adjacency), and multi-task objectives ensure attribute-specific disentanglement in latent space—particularly important for clarification and repair in dialogue.
No explicit set-prediction or multi-label Hungarian matching losses are introduced for dialogue in (He et al., 2023), but future extensions are suggested.
4. Evaluation Metrics and Protocols
Dialogue-based GREC inherits and extends precision metrics from classic REC:
- Precision@: For each sample, predicted boxes are matched to ground-truth via IoU , and is computed as . A sample is only correct if —all refs found, no extras.
- No-Target Accuracy (N-acc): For cases where the correct referent set is empty, N-acc is the proportion of samples where no boxes are predicted (He et al., 2023).
- Object-level F1: Used for multi-label referent identification in each turn, especially critical for evaluating context-sensitive clarification, with measuring improvement before and after clarificational replies (Chiyah-Garcia et al., 2023).
- Style alignment loss : In synthetic data generation, discriminator-based domain adaptation quantifies textual realism conditional on human-vs.-synthetic dialogue style (Shao et al., 2 Dec 2025).
Average precision (COCO-style) is tracked for completeness but is noted to be less suitable for dialogue GREC due to its tolerance of low-confidence extras.
5. Empirical Results and Analysis
Comparative performance across synthesis tiers, model architectures, and evaluation splits reveals several key findings:
| Training Data | Size | Mean F1 | Prec@() |
|---|---|---|---|
| gRefCOCO (out-of-domain) | 209k | 19.1 | 13.5 |
| Tier 1 (Template) | 19k | 42.8 | 24.5 |
| Tier 2 (AI-Short) | 1k | 22.8 | 15.7 |
| Tier 3 (AI-Dialogue) | 1k | 24.8 | 11.2 |
Fine-tuning on in-domain template data yields F1 absolute gain over large, out-of-domain sets (gRefCOCO). At constant data size (1k), LLM-generated short expressions and full dialogues yield greater Precision and F1 than pure templates. Notably, models perform better on "mention-only" scenarios than on full multi-turn coreferential dialogues, highlighting the difficulty of context-sensitive grounding (Shao et al., 2 Dec 2025).
Multi-modal, relational BART-based models excel at processing clarificational exchanges, achieving the largest improvements after ambiguous referents are explicitly repaired. Disentangled object-centric representations and explicit attribute-slot supervision are shown to be essential for dialogue GREC (Chiyah-Garcia et al., 2023). Simpler language-only models fail to exploit dialogue context and may worsen after clarification exchanges.
Ablation analysis combining synthetic tiers demonstrates that Tier 1 + Tier 2 maximizes F1, while Tier 3 can hurt performance due to style or domain mismatch unless alignment losses are deployed.
6. Disambiguation, Clarificational Exchanges, and Coreference Resolution
Dialogue-Based GREC critically depends on a system’s ability to repair referential ambiguity through clarificational exchanges (CEs), which are formalized as the triples —user turn introducing ambiguity, system clarification request, and user's disambiguating reply (Chiyah-Garcia et al., 2023). Core steps include:
- Detection of referential ambiguity in .
- Generation of by the system.
- Contextual update after (quantified by and improvement percentages).
Relational models, especially those with attribute-slot and spatial relation modules, outperform pure language systems in exploiting clarificational repairs. Architectural constraints for high CE-processing capacity include multi-task slot losses and relational encoders. Disentanglement objectives—margin-based or multi-task slot constraints—produce embeddings most amenable to coreferential grounding across turns.
7. Future Directions and Open Challenges
Current limitations include insufficient set-prediction objectives, lack of explicit mechanisms for no-object prediction, and difficulties with dialogue-style domain adaptation (He et al., 2023, Shao et al., 2 Dec 2025). Prospective research themes include:
- Reinforcement-triggered, targeted data synthesis to address model failure cases (active synthesis).
- Advanced adversarial domain alignment to close the gap between synthetic and human dialogue styles.
- Extension to open-vocabulary, dynamic visual environments, and meta-learning for cross-scene generalization.
- Benchmark construction for turn-based, dialogue-intensive GREC, including richer anaphora, coreference, and interactive grounding protocols.
- Integration of dialog managers and context modules for incremental entity set updates and interactive clarification handling.
This suggests dialogue-based GREC is an active research area that unifies visually-grounded language understanding, robust referential grounding, and sequential, context-sensitive entity selection, offering substantial complexity distinct from classic REC.