InfographicVQA (2104.12756v2)

Published 26 Apr 2021 in cs.CV and cs.CL

Abstract: Infographics are documents designed to effectively communicate information using a combination of textual, graphical and visual elements. In this work, we explore the automatic understanding of infographic images by using Visual Question Answering technique.To this end, we present InfographicVQA, a new dataset that comprises a diverse collection of infographics along with natural language questions and answers annotations. The collected questions require methods to jointly reason over the document layout, textual content, graphical elements, and data visualizations. We curate the dataset with emphasis on questions that require elementary reasoning and basic arithmetic skills. Finally, we evaluate two strong baselines based on state of the art multi-modal VQA models, and establish baseline performance for the new task. The dataset, code and leaderboard will be made available at http://docvqa.org

Authors (6)

Minesh Mathew (16 papers)
Viraj Bagal (2 papers)
Rubèn Pérez Tito (1 paper)
Dimosthenis Karatzas (80 papers)
Ernest Valveny (28 papers)
C. V Jawahar (3 papers)

Citations (150)

View on Semantic Scholar

Summary

Essay on 'InfographicVQA'

The paper entitled "InfographicVQA" by Minesh Mathew, Viraj Bagal, Ruben Tito, Dimosthenis Karatzas, Ernest Valveny, and C.V. Jawahar introduces an innovative dataset focusing on Visual Question Answering (VQA) applied to infographics. The dataset, dubbed InfographicVQA, emerges as a critical resource aimed at evaluating and benchmarking the machine's capability to comprehend infographics, which are inherently complex due to their amalgamation of textual, graphical, and visual elements. This dataset comprises 30,035 questions over 5,485 infographic images and is curated to demand elementary reasoning and basic arithmetic operations to extract answers derived from visual and textual information.

Infographics, being compact vehicles of information dissemination, pose substantial challenges in understanding due to their multi-faceted design involving text, numbers, symbols, color schemes, and layout. The research, therefore, extends the frontier of VQA by intertwining the image-text relational logic with a robust understanding of layout and data visualizations unique to infographics. It bridges the gap existing in conventional VQA setups that typically lack this level of complexity.

The authors set the performance bar using two Transformer-based models: M4C and LayoutLM. Both models are established as state-of-the-art in scene text VQA and document understanding, respectively. However, upon evaluation, both baselines fall short in comparison to human benchmarks, underscoring a significant gap in infographic comprehension by machines. The M4C utilizes multimodal Transformers to decode questions about infographics iteratively but encounters limitations in handling the intricacies of embedded text fused with visual elements typical of infographics. The LayoutLM, adapted for task-specific document understanding, similarly highlights the inadequacies of current technologies in addressing the multimodality inherent in infographics.

Analytic insights reveal that typical heuristic approaches on the dataset perform poorly, demonstrating the unique challenge presented by infographics. Much of the dataset's distinctiveness lies in its requirement for models to perform discrete operations such as counting or arithmetic, contingent on contextual understanding and relational data analysis. This notion is prominently exhibited when model predictions are measured against types of evidence – Text, Table/List, Visual/Layout, Figure, and Map – and discrete operations, with models generally underperforming across these categories.

The implications of this research are vast for the field of AI and VQA. The InfographicVQA dataset serves as a critical step towards advanced document and image understanding tasks, highlighting the necessity for models to synthesize and contextualize multi-format information—skills imperative in real-world applications like automated reporting, data analysis, and digital assistants. Moreover, this research stimulates a discourse on the need for advancing feature extraction and fusion techniques tailored to infographic-specific tasks.

Future research avenues might explore avenues for better visual feature extraction to fuse narrative-driven graphics and text efficiently, overcoming current paradigms where features extracted from natural scene images or traditional document layouts have proved inadequate. Advancements in understanding the infographic medium could foster greater applicability across diverse domains where quick, accurate comprehension of complex visuals is required.

In conclusion, the authors have made a significant contribution by developing and deploying a challenging benchmark for VQA in the InfographicVQA dataset, facilitating progress towards robust solutions in multimodal machine understanding. Their findings lay the groundwork for future explorations and advancements in AI's interpretative capabilities concerning rich visual content embedded with textual and graphical data.

PDF Markdown

Related Papers

YouTube

Show All Videos