Figure-Based Question Answering

Updated 11 August 2025

Figure-based QA is a multimodal AI approach that extracts structured, quantitative information from charts, diagrams, and tables by fusing visual and textual inputs.
Methodologies include modular pipelines, end-to-end fusion, and graph-based reasoning to capture fine-grained features and relational data from visualizations.
Key challenges such as OCR limitations, open-vocabulary issues, and dataset generalization drive innovations in automated document parsing and assistive technologies.

Figure-Based Question Answering (QA) is a subfield of multimodal artificial intelligence focused on answering natural language questions that require interpreting figures, diagrams, plots, charts, or tables, often in conjunction with associated textual and contextual information. Unlike traditional visual question answering that targets photographs, figure-based QA typically requires extracting structured, quantitative, and relational information from stylized, schematic, or data-driven visualizations, frequently under conditions where fine-grained recognition, reasoning, and cross-modal fusion are essential.

1. Foundations and Problem Formulation

Figure-based QA centers on systems that, given a question and a figure (e.g., chart, scientific plot, schematic), must provide an accurate answer—sometimes as free-form text, a specific value, or a designated region or span within another modality. The problem is distinguished by several characteristics:

High Density of Structured Information: Figures encode data, annotations, legends, and often multiple semantic layers such as categorical relationships, trends, and axes mappings.
Cross-Modality Reasoning: Questions may depend on both visual (figure) and textual (captions, labels, context) information, necessitating joint processing.
Fine-Grained Measurements and Semantics: Accurate answering often hinges on precise detection of bar heights, line intersections, legend correspondences, or cell values.
Open Vocabulary: OOV (out-of-vocabulary) issues frequently arise, as answers or referents are unique to each visualization.

Prominent early datasets such as FigureQA (Kahou et al., 2017) and more recently SPIQA (Pramanick et al., 12 Jul 2024) have formalized the task, developing benchmarks that include a wide range of figure types and question modalities.

2. Dataset Design and Scope

Figure-based QA research is predicated on large-scale, annotated datasets that model the challenges of real-world data visualizations:

Synthetic Datasets: FigureQA (Kahou et al., 2017) introduced over 100,000 synthetic scientific figures (line plots, bar graphs, pie charts) paired with >1 million binary Q/A pairs generated from 15 question templates (min/max, area, intersection, smoothness).
Chart Diversity: LEAF-QA (Chaudhry et al., 2019) expands coverage to 250,000 real-world chart images (bar, line, scatter, box, pie) from public sources, with ~2M Q/A pairs emphasizing both structural and relational reasoning, fine-grained labels, and chart parsing annotations.
Text-Integrated and Scientific Datasets: SPIQA (Pramanick et al., 12 Jul 2024) incorporates 25,859 research articles, combining context-rich figures/tables, full papers, and 270,000 questions, addressing holistic scientific paper understanding.
Pre-training Datasets: SBS Figures (Shinoda et al., 23 Dec 2024) synthesizes up to 1 million figures by topic-driven, stagewise generation, enabling scalable, annotation-free supervised pre-training for models handling chart interpretation.

Table 1 summarizes key dataset characteristics.

Dataset	Source Type	Chart Diversity	Annotation Style
FigureQA	Synthetic	5 chart/plot types	Binary Q/A, bounding boxes
LEAF-QA	Real-world	7+ chart types	Dense Q/A, regions, masks
SPIQA	Scientific papers	Figures, tables	Free-form Q/A, rationales
SBS Figures	Synthetic	10+ chart types	Structured JSON, dense Q/A

3. Methodologies and Model Architectures

A principal focus of figure-based QA is on model architectures that integrate visual, textual, and structural signals:

Modular Paradigms: Approaches such as FigureNet (Reddy et al., 2018) employ a staged pipeline—first segmenting plot elements (Spectral Segregator), then extracting quantitative/statistical orders, followed by question-conditioned fusion.
End-to-End Multimodal Fusion: Models like PReFIL (Kafle et al., 2019) perform early fusion by concatenating question embeddings with spatial image features, processing via convolutional layers and aggregating with bidirectional GRUs to capture global-local feature dependencies.
Graph-based Reasoning: Scene graph methods (Zhang et al., 2019), entity graph approaches for knowledge-based VQA (Narasimhan et al., 2018), and reasoning graphs for multi-hop QA (Pahilajani et al., 1 Nov 2024) introduce explicit graph structures to encode objects, relations, or evidence, and leverage message passing (e.g., graph convolutional networks) for answer inference.
Answer Embedding Models: Embedding-based QA (Hu et al., 2018) projects both (figure, question) and answer candidates into a joint space, scoring via inner product and using softmax over candidate answers, enabling open-universe answer prediction and transfer learning.
Multimodal LLMs and CoT: SPIQA (Pramanick et al., 12 Jul 2024) demonstrates the deployment of multimodal LLMs to perform stepwise chain-of-thought (CoT) reasoning, improving performance by jointly retrieving and reasoning over figures, tables, and corresponding texts.

A general formulation for answer selection in embedding-based models:

$a^* = \arg\max_{a \in A} f(i, q)^\top g(a)$

where $f(i, q)$ is the function encoding the image (or figure) and question, and $g(a)$ encodes the answer candidate.

4. Evaluation Protocols and Performance Metrics

Evaluation in figure-based QA is sensitive to both answer correctness and the pathway of reasoning:

Accuracy and Exact Match: Most datasets report accuracy for binary, short, or categorical answers (e.g., ~60.34% open-ended accuracy for VQABQ (Huang et al., 2017); 84.29% for FigureNet on FigureQA (Reddy et al., 2018); >90% for PReFIL on FigureQA (Kafle et al., 2019)).
Answer Quality Metrics: Free-form answer validation leverages METEOR, ROUGE-L, CIDEr, and BERTScore F1 for token overlap, and L3Score (Pramanick et al., 12 Jul 2024) (a log-likelihood-based score using LLM outputs for tokens "yes"/"no") for semantic alignment.
Retrieval and Reasoning Structure Analysis: For multi-hop/graph-based QA, retrieval precision/recall/F1 over supporting evidence is assessed, alongside answer exact match and LLM-as-judge metrics (Pahilajani et al., 1 Nov 2024).
Chain-of-Thought (CoT) Evaluation: CoT-style prompts are evaluated by their intermediate retrieval accuracy and final answer correctness, as in the SPIQA framework.

5. Design Challenges and Solutions

Figure-based QA development faces several technical and practical obstacles:

OCR Limitations and OOV Handling: Approaches such as dynamic dictionaries and chart element detection in PReFIL (Kafle et al., 2019) and LEAF-Net (Chaudhry et al., 2019) address the challenge of scene text extraction and novel labels.
Error-Prone Figure Generation: SBS Figures (Shinoda et al., 23 Dec 2024) mitigates code generation errors and figure monotony in synthetic datasets by separating figure topic, data, rendering, and QA generation into discrete, quality-controlled pipeline stages.
Relational Reasoning: FigureQA analysis (Kahou et al., 2017) and subsequent methods highlight the need for architectures—such as Relation Networks and graph-based models—to model spatially and semantically distributed information.
Generalization and Transfer: Embedding-based models (Hu et al., 2018) and pre-training (e.g., on SBS Figures) support transfer to new domains, datasets, or answer spaces not seen during training.

6. Impact, Applications, and Future Directions

Figure-based QA underpins a range of scientific, educational, and business intelligence applications:

Automated Document Parsing: Models capable of chart and diagram comprehension facilitate structured data extraction from scientific papers (Pramanick et al., 12 Jul 2024), business reports, and government datasets, supporting downstream analytics and search.
Assistive Technology: Robust figure QA can enhance accessibility, enabling visually impaired users or non-experts to query charts and diagrams via natural language.
Analytic Tools and Knowledge Discovery: Figure QA models support scenario-specific applications such as interactive educational tutors, report analysis tools, and knowledge base construction from visual data (Chaudhry et al., 2019, Shinoda et al., 23 Dec 2024).
Explainability and Reasoning Transparency: Reasoning-structured datasets like GRS-QA (Pahilajani et al., 1 Nov 2024) and graph-based VQA approaches advance the field toward interpretable, auditable multimodal AI.

Continued progress depends on enriching datasets (realistic figures, open-ended questions), advancing multi-stage and CoT reasoning, improving integrated OCR/text parsing, and developing metrics attuned to semantic and reasoning quality. A plausible implication is that scaling pre-training on synthetic, richly annotated chart datasets (as with SBS Figures) and leveraging MLLM-based reasoning pipelines will further narrow the gap between machine and human performance across diverse figure-based QA scenarios.