GuessWhat?! Visual object discovery through multi-modal dialogue

Published 23 Nov 2016 in cs.AI and cs.CV | (1611.08481v2)

Abstract: We introduce GuessWhat?!, a two-player guessing game as a testbed for research on the interplay of computer vision and dialogue systems. The goal of the game is to locate an unknown object in a rich image scene by asking a sequence of questions. Higher-level image understanding, like spatial reasoning and language grounding, is required to solve the proposed task. Our key contribution is the collection of a large-scale dataset consisting of 150K human-played games with a total of 800K visual question-answer pairs on 66K images. We explain our design decisions in collecting the dataset and introduce the oracle and questioner tasks that are associated with the two players of the game. We prototyped deep learning models to establish initial baselines of the introduced tasks.

Abstract PDF Upgrade to Chat

Citations (418)

View on Semantic Scholar

Summary

The paper introduces a novel multi-modal dialogue system that uses a guessing game to enhance visual object discovery.
The dataset features over 150,000 games and 800,000 question-answer pairs across 66,000 images, providing rich resources for AI research.
Experiments show that incorporating spatial and category embeddings boosts model performance in object identification tasks.

The paper "GuessWhat?! Visual object discovery through multi-modal dialogue" introduces a novel task for combining computer vision with dialogue systems via a guessing game named GuessWhat?!. This game serves as a pivotal testbed for research exploring the synergy between visual scene analysis and conversational intelligence, emphasizing spatial reasoning, and language-to-visual grounding. The ultimate objective is to pinpoint an unknown object within a detailed image scene by posing a series of intelligent questions, showcasing advanced image understanding and natural language capabilities.

Key Contributions and Dataset

At the core of this research is a large-scale dataset amassed from over 150,000 human-played games, featuring 800,000 visual question-answer pairs across 66,000 images. This dataset marks a significant achievement by providing a robust foundation for AI systems to develop and refine their abilities in multi-modal understanding and dialogue-based exploration.

The dataset is built from the MS COCO images, ensuring rich contextual scenes filled with diverse objects. This complexity demands a higher order of reasoning and comprehension for success in the game, distinguishing it from simpler, single-shot image captioning or VQA tasks.

Game Structure and Challenges

GuessWhat?! involves two players: an oracle, who knows the target object, and a questioner, tasked with identifying this object using yes-no questions. This setup mimics real-world dialogue scenarios where clarifications and refinements are necessary to establish a common understanding. The game's rules implicitly provide an automatic evaluation metric: the success or failure of the questioner in identifying the correct object.

Several layers of complexity are intrinsic to the game. For the questioner, an optimal strategy involves asking questions that efficiently split the search space—akin to binary search—though temporal constraints and scene dynamics often lead to suboptimal human strategies.

Methodology and Experiments

The authors propose and implement various deep learning baselines to address the game's tasks:

Oracle Task: The oracle's role, akin to a specialized VQA task, requires answering whether the current object matches the questioner's query. The deep learning models employed utilize various embeddings (image, crop, spatial, category) fed into multi-layer perceptrons to predict answers. The analysis indicates significant performance gains when including comprehensive object features like category and spatial data.
Questioner Task: This involves generating questions and ultimately identifying the correct object. The task is divided into question generation and guessing sub-tasks, with separate mechanisms for dialogue comprehension and visual scene understanding. Hierarchical recurrent encoder-decoder models (HRED), enhanced with visual features, are employed to generate contextually relevant questions.

The experimental results highlight the challenge in bridging image analysis with dialogue generation. While the best statistical models based on human interaction achieve relatively high accuracy in object identification, the generated dialogue questions showcase room for improvement in realistic AI interactions.

Implications and Future Directions

This research underscores the increasing importance of multi-modal AI systems capable of synthesizing dialogue with visual perception. As AI continues to evolve towards more immersive interactions, the GuessWhat?! framework provides a valuable pathway to train systems on. Although current models fall short of human performance, they demonstrate foundational capabilities in understanding and interacting with rich media environments.

Further research may explore more intricate models that improve question intelligibility and contextual understanding. Additionally, expanding GuessWhat?! to accommodate dynamic scenes or integrating more complex dialogue structures will enhance its applicability to real-world tasks, aligning with future developments in AI-assisted technology.

Overall, the GuessWhat?! dataset and associated challenges represent a substantial contribution to the field of AI, setting a precisely quantified benchmark for future explorations in visual dialogue systems.

Markdown