CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog
This paper introduces CLEVR-Dialog, a synthetic dataset designed to facilitate the evaluation and improvement of models in the domain of visual dialog by providing a structured platform for multi-round reasoning and visual coreference resolution. Visual dialog entails the task of sequentially answering questions grounded on image data, drawing on conversation history as contextual information. CLEVR-Dialog is positioned as a diagnostic tool capable of dissecting and analyzing specific aspects of visual dialog owing to its fully annotated and synthetically-generated nature.
The CLEVR-Dialog dataset is constructed upon the CLEVR image dataset, which offers exhaustive annotations in the form of scene graphs. Each image's scene graph contains detailed attribute information on all objects, including aspects such as color, shape, size, and material, along with spatial relationships between objects. The authors employ this foundation to generate dialog instances using a structured grammar, resulting in approximately 4.25 million question-answer pairs spread across 85k images.
A distinct feature of this dataset is its emphasis on visual coreference resolution—a crucial area within visual dialog—where the model must resolve references to objects iteratively across multiple dialog rounds. This makes CLEVR-Dialog the first dataset to enable comprehensive analysis of such resolutions. Performance is benchmarked on various models, including several standard visual dialog architectures and more specialized models like CorefNMN, which particularly targets coreference in visual dialog through a modular network approach.
Numerical results indicate the CorefNMN model excels at visual coreference tasks, showing a performance advantage over others across questions that require discernment of coreferences. The average performance of several models that undertake coreference resolution is substantially lower than non-coreference questions, with a marked performance gap of approximately 30 percentage points witnessed in state-of-the-art models.
The broader implications of CLEVR-Dialog for AI research are twofold. Practically, it provides a testbed for training and honing models on structured visual reasoning tasks, facilitating insights into model behaviors and limitations in visual dialog scenarios. Theoretically, it challenges and fosters the evolution of visual dialog systems by emphasizing multi-step reasoning and grounding tasks, pushing forward the complexity envelope in synthetically constructed datasets that might one day inform real-world applications.
Future research might focus on expanding the scope of dialog grammars or integrating more sophisticated natural language processing techniques to bridge current limitations in understanding visual-coreference interactions. Additionally, there is potential for extending this framework to other datasets beyond CLEVR, accommodating more varied visual environments to further stress-test reasoning capabilities of visual dialog models.