Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog (1903.03166v2)

Published 7 Mar 2019 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: Visual Dialog is a multimodal task of answering a sequence of questions grounded in an image, using the conversation history as context. It entails challenges in vision, language, reasoning, and grounding. However, studying these subtasks in isolation on large, real datasets is infeasible as it requires prohibitively-expensive complete annotation of the 'state' of all images and dialogs. We develop CLEVR-Dialog, a large diagnostic dataset for studying multi-round reasoning in visual dialog. Specifically, we construct a dialog grammar that is grounded in the scene graphs of the images from the CLEVR dataset. This combination results in a dataset where all aspects of the visual dialog are fully annotated. In total, CLEVR-Dialog contains 5 instances of 10-round dialogs for about 85k CLEVR images, totaling to 4.25M question-answer pairs. We use CLEVR-Dialog to benchmark performance of standard visual dialog models; in particular, on visual coreference resolution (as a function of the coreference distance). This is the first analysis of its kind for visual dialog models that was not possible without this dataset. We hope the findings from CLEVR-Dialog will help inform the development of future models for visual dialog. Our dataset and code are publicly available.

CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog

This paper introduces CLEVR-Dialog, a synthetic dataset designed to facilitate the evaluation and improvement of models in the domain of visual dialog by providing a structured platform for multi-round reasoning and visual coreference resolution. Visual dialog entails the task of sequentially answering questions grounded on image data, drawing on conversation history as contextual information. CLEVR-Dialog is positioned as a diagnostic tool capable of dissecting and analyzing specific aspects of visual dialog owing to its fully annotated and synthetically-generated nature.

The CLEVR-Dialog dataset is constructed upon the CLEVR image dataset, which offers exhaustive annotations in the form of scene graphs. Each image's scene graph contains detailed attribute information on all objects, including aspects such as color, shape, size, and material, along with spatial relationships between objects. The authors employ this foundation to generate dialog instances using a structured grammar, resulting in approximately 4.25 million question-answer pairs spread across 85k images.

A distinct feature of this dataset is its emphasis on visual coreference resolution—a crucial area within visual dialog—where the model must resolve references to objects iteratively across multiple dialog rounds. This makes CLEVR-Dialog the first dataset to enable comprehensive analysis of such resolutions. Performance is benchmarked on various models, including several standard visual dialog architectures and more specialized models like CorefNMN, which particularly targets coreference in visual dialog through a modular network approach.

Numerical results indicate the CorefNMN model excels at visual coreference tasks, showing a performance advantage over others across questions that require discernment of coreferences. The average performance of several models that undertake coreference resolution is substantially lower than non-coreference questions, with a marked performance gap of approximately 30 percentage points witnessed in state-of-the-art models.

The broader implications of CLEVR-Dialog for AI research are twofold. Practically, it provides a testbed for training and honing models on structured visual reasoning tasks, facilitating insights into model behaviors and limitations in visual dialog scenarios. Theoretically, it challenges and fosters the evolution of visual dialog systems by emphasizing multi-step reasoning and grounding tasks, pushing forward the complexity envelope in synthetically constructed datasets that might one day inform real-world applications.

Future research might focus on expanding the scope of dialog grammars or integrating more sophisticated natural language processing techniques to bridge current limitations in understanding visual-coreference interactions. Additionally, there is potential for extending this framework to other datasets beyond CLEVR, accommodating more varied visual environments to further stress-test reasoning capabilities of visual dialog models.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Satwik Kottur (19 papers)
  2. José M. F. Moura (118 papers)
  3. Devi Parikh (129 papers)
  4. Dhruv Batra (160 papers)
  5. Marcus Rohrbach (75 papers)
Citations (85)
Github Logo Streamline Icon: https://streamlinehq.com