WebQA: Multihop and Multimodal QA (2109.00590v4)

Published 1 Sep 2021 in cs.CL, cs.AI, cs.CV, and cs.LG

Abstract: Scaling Visual Question Answering (VQA) to the open-domain and multi-hop nature of web searches, requires fundamental advances in visual representation learning, knowledge aggregation, and language generation. In this work, we introduce WebQA, a challenging new benchmark that proves difficult for large-scale state-of-the-art models which lack language groundable visual representations for novel objects and the ability to reason, yet trivial for humans. WebQA mirrors the way humans use the web: 1) Ask a question, 2) Choose sources to aggregate, and 3) Produce a fluent language response. This is the behavior we should be expecting from IoT devices and digital assistants. Existing work prefers to assume that a model can either reason about knowledge in images or in text. WebQA includes a secondary text-only QA task to ensure improved visual performance does not come at the cost of language understanding. Our challenge for the community is to create unified multimodal reasoning models that answer questions regardless of the source modality, moving us closer to digital assistants that not only query language knowledge, but also the richer visual online world.

PDF Abstract

An Expert Review of "WebQA: Multihop and Multimodal QA"

The paper, "WebQA: Multihop and Multimodal QA," presents a novel benchmark designed to challenge and advance the capabilities of current Visual Question Answering (VQA) systems. Conceptualized by researchers from Carnegie Mellon University and Microsoft, WebQA is positioned as a comprehensive dataset that embraces the complexity and open-domain nature of web searches, encouraging the development of models that effectively process and reason over both text and visual information. This paper is pivotal in the progression towards more sophisticated Internet of Things (IoT) devices and digital assistants capable of mimicking human approaches to web-based information retrieval and question answering.

WebQA pushes the boundaries of traditional QA systems by integrating a fully multimodal perspective—treating images and text as equally valid knowledge sources. Notably, the dataset emphasizes multihop reasoning, which requires models to integrate knowledge from multiple sources to generate coherent answers. This emulates the multihop reasoning humans naturally employ, thereby presenting a genuine cognitive challenge to AI models.

The paper meticulously delineates the composition and structure of WebQA. It includes 34.2K training samples with additional 5K for development and 7.5K for testing. Importantly, the dataset's responses demand full natural language sentences for answers, diverging from the conventional span or classification-based responses seen in earlier datasets. Such requirements implicate the need for advanced natural language generation capabilities in AI systems dealing with WebQA.

Significant results from the experiments demonstrate the unique difficulties posed by WebQA. When confronted with this dataset, state-of-the-art models, such as those leveraging VLP (Vision-Language Pre-training), showcase limited efficacy, achieving only moderate accuracy scores. This underperformance underlines the necessity for continued research into models capable of truly integrated multimodal reasoning.

The research also explores the implications of retrieval limitations, wherein models are presented with vast sets of potential knowledge sources. The authors explore both sparse retrieval (BM25) and dense retrieval (using CLIP) strategies to preliminarily filter sources, only to find significant retrieval inefficiencies that persist in comprehensive open-domain settings.

Implications for future AI development are manifold. The work suggests that success in WebQA could lead to more robust digital assistants capable of sophisticated interaction with the vast reservoir of multimodal web information. The current inability of models to fully exploit the dataset points toward the need for refined techniques in multimodal reasoning and knowledge aggregation. This research opens essential dialogues about the complexities of language generation and the necessity for seamless integration between text and image processing in AI models.

WebQA is a benchmark that does not merely propose a technical challenge but instead encapsulates a broader vision towards creating unified, more human-like AI systems. Future work must focus on examining the intersection of retrieval efficacy and reasoning capabilities, potentially exploring symbolic or compositional representations to facilitate better information synthesis.

Overall, "WebQA: Multihop and Multimodal QA" significantly contributes to the field of AI by establishing a benchmark that tangibly aligns with real-world applications and addressing the multifaceted nature of human informational queries. As AI continues to expand into open-domain environments, WebQA stands as an instrumental resource pushing toward the frontier of multimodal AI research.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Yingshan Chang (10 papers)
Mridu Narang (2 papers)
Hisami Suzuki (3 papers)
Guihong Cao (9 papers)
Jianfeng Gao (344 papers)
Yonatan Bisk (91 papers)

Citations (69)

View on Semantic Scholar

WebQA: Multihop and Multimodal QA (2109.00590v4)

An Expert Review of "WebQA: Multihop and Multimodal QA"

Related Papers