ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning (2203.10244v1)

Published 19 Mar 2022 in cs.CL

Abstract: Charts are very popular for analyzing data. When exploring charts, people often ask a variety of complex reasoning questions that involve several logical and arithmetic operations. They also commonly refer to visual features of a chart in their questions. However, most existing datasets do not focus on such complex reasoning questions as their questions are template-based and answers come from a fixed-vocabulary. In this work, we present a large-scale benchmark covering 9.6K human-written questions as well as 23.1K questions generated from human-written chart summaries. To address the unique challenges in our benchmark involving visual and logical reasoning over charts, we present two transformer-based models that combine visual features and the data table of the chart in a unified way to answer questions. While our models achieve the state-of-the-art results on the previous datasets as well as on our benchmark, the evaluation also reveals several challenges in answering complex reasoning questions.

PDF Abstract

ChartQA: Benchmarking Visual and Logical Reasoning in Chart-Based Question Answering

The field of Question Answering (QA) over data visualizations is an intriguing niche within the broader spectrum of natural language processing tasks. The paper "ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning" addresses the challenging problem of interpreting and reasoning over chart data through the development of a specialized benchmark for QA systems. This benchmark, aptly named ChartQA, encompasses a diverse set of human-created and automatically generated questions targeting a wide range of chart types. This paper holds significance by providing a robust foundation for future exploration and advancement in chart-based QA systems.

Dataset Construction and Analysis

The authors meticulously design a large-scale benchmark dataset comprising over 9,600 human-authored and an additional 23,100 machine-generated questions about charts. Unlike prior datasets that rely heavily on template-based question generation, ChartQA captures a more realistic distribution of question types, predominantly focusing on compositional and visually referenced queries that require substantial reasoning abilities. This inclusion of human-written questions is a critical advance, addressing the limitations of previous datasets and encouraging the model's comprehension of real-world question variability.

Each chart within the dataset is derived from credible sources, ensuring graphical and thematic diversity. A significant effort has been invested in maintaining high data quality by crowdsourcing annotations and carrying out rigorous validation. The dataset is structured to provoke both logical operations, like summation and comparison, and visual references to chart markers, e.g., color and size, which are common in real-world scenarios.

Proposed Methods

To tackle the complexities inherent in chart-based QA, the paper introduces two transformer-based models designed to integrate both visual and data-table features: VisionTAPAS and VL-T5. These models represent an evolution from standard NLP frameworks, specifically tailored to engage with visual elements extracted from charts, alongside traditional textual inputs.

VisionTAPAS operates by combining visual embeddings, obtained through Vision Transformer (ViT) encodings of chart images, with text and table data processed through an adapted TaPas encoder. This aggregation of modalities aims to supply the model with a comprehensive context, boosting its capability to answer questions that necessitate visual interpretation.

VL-T5, on the other hand, extends the T5 architecture to accommodate multimodal inputs, integrating visual cues alongside linguistic content and reconstructed data tables. This approach utilizes powerful pre-trained LLMs aided by visual recognition capabilities to generate answer outputs.

Evaluation and Results

The performance of these models was evaluated across several existing datasets (e.g., DVQA, FigureQA, PlotQA) and the newly introduced ChartQA benchmark. Results showed that VisionTAPAS and VL-T5 outperform previous state-of-the-art models under both scenarios: when the data table is provided and when it must be extracted from the image. Notably, VisionTAPAS achieved superior results when addressing DVQA and ChartQA, highlighting its robustness in handling the dual challenge of logical and visual reasoning.

Moreover, the paper identifies a core limitation in the effective extraction of structured data from real-world chart images, a process that remains error-prone under diverse visual formats. This is a significant area for improvement, as it greatly affects model performance in settings where gold-standard data tables are unavailable.

Implications and Future Directions

ChartQA presents several implications for the field, both practically and theoretically. On a practical level, the dataset opens new fronts for evaluating and enhancing AI systems designed for interpreting numeric data representations, a task increasingly relevant in data-driven decision environments. Theoretically, the multifaceted nature of problems presented in the benchmark indicates a need for models that can proficiently handle both high-level logical operations and a detailed understanding of visual semantics.

The paper also suggests intriguing pathways for future research, advocating for advances in models that can seamlessly blend different data representations and reasoning strategies. Developing end-to-end systems capable of extracting and interpreting data from real-world charts remains an aspiring goal. Furthermore, refining dataset diversity and addressing the extraction inaccuracies would significantly benefit the domain and extend the application potential of such models.

Overall, "ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning" makes a significant contribution to the AI and NLP communities. The benchmark and models proposed serve not only as a platform for current experiments but also chart a course for future advancements in the space of multimodal data interpretation.