Infographic Visual Question Answering
- InfographicVQA is a specialized task that fuses OCR, layout analysis, and multi-step numeric reasoning to interpret data-rich infographics.
- It employs multi-modal evidence integration to decode textual, numerical, and graphical elements from charts, tables, and maps.
- Key challenges include OCR errors, complex layout interpretation, and arithmetic reasoning, with current models scoring around 0.61 ANLS versus a human upper-bound of 0.98.
Infographic Visual Question Answering (InfographicVQA) is a specialized subfield of Visual Question Answering (VQA) focused on the automated reading, reasoning, and information extraction from visually complex, data- and text-rich infographic documents. Unlike natural-scene VQA, InfographicVQA requires the integration of Optical Character Recognition (OCR), visual-object parsing, layout understanding, and multi-step numeric reasoning, operating over a broad range of visual forms including charts, tables, maps, and decorative graphical elements. The task emphasizes questions that jointly interrogate both textual and visual content, frequently demanding interpretation of the underlying data visualizations and multi-modal relational reasoning (Mathew et al., 2021) (Tito et al., 2021) (Van-Dinh et al., 13 Dec 2025).
1. Task Definition and Distinct Challenges
InfographicVQA formalizes the problem as learning a function that maps pairs—an infographic image (or a set in the multi-image case) and a natural language question —to a textual answer , written as
with prediction driven by or (Van-Dinh et al., 13 Dec 2025).
Key differentiators relative to general VQA and Document VQA include:
- Multimodal Evidence Integration: Requires fusing evidence from text blocks, numerical data (embodied as charts/tables), icons, and spatial layout.
- Multi-step Reasoning: Many queries (~45% across datasets) require counting, sorting or arithmetic, often over non-contiguous regions.
- Cross-image Reasoning: In multi-image settings, as introduced in ViInfographicVQA, questions reference multiple related infographics and demand synthesis of distributed evidence.
Distinct failure points compared to natural-image VQA include:
- OCR error cascades in dense or stylized text.
- Inability of generic VLMs to parse non-standard layouts and perform arithmetic or compositional operations beyond shallow span lookup.
- Semantic ambiguity in answers requiring inference from multi-element visuals, especially for “non-span” (generated or computed) answers (Tito et al., 2021) (Mathew et al., 2021).
2. Datasets and Annotation Protocols
The development of InfographicVQA has been driven by several cornerstone datasets:
InfographicVQA (English)
- Scale: 5,485 real-world infographic images, >30,000 question-answer pairs, curated from ~2,600 distinct internet domains.
- Data Split: Train (4,406/23,946 Qs), Validation (500/2,801 Qs), Test (579/3,288 Qs), with train/val/test split by image to prevent leakage (Mathew et al., 2021) (Tito et al., 2021).
- Question and Answer Types:
- Image-span: Answers verbatim in the image.
- Multi-span: Concatenation of several discrete text spans.
- Non-span: Computed/generated numeric or textual answers absent from the image.
- Question-span: Answers directly within the question text.
- Operation Types: ~30% require counting, ~10% sorting, ~5% arithmetic (e.g., difference between chart values).
- Evidence Tags: Annotators mark whether the answer is evidenced by text, table/list, figure, map, or pure visual/layout features.
Annotation Schema: Each QA pair records answer type, evidence type, operation, and all valid answer variants to accommodate linguistic or spelling variability. OCR (Amazon Textract) outputs and bounding boxes are provided.
ViInfographicVQA (Vietnamese)
- Scale: 2,400 unique infographics, 9,984 QAs, from domains including economics, healthcare, culture/society, disasters, and sports.
- Settings:
- Single-image and multi-image (cross-image) subtasks.
- Multi-image questions require aggregation or comparison across 2–5 related infographics.
- Domain Annotation: Each infographic is labeled to sector and grouped for cross-image queries (Van-Dinh et al., 13 Dec 2025).
Comparative Table of Datasets
| Dataset | Images / QA Pairs | Languages | Unique Features |
|---|---|---|---|
| InfographicVQA | 5,485 / 30,035 | English | Rich chart/table, layout ops |
| ViInfographicVQA | 2,400 / 9,984 | Vietnamese | Multi-image, domain diversity |
Editor’s term: “non-span” for generated/computed answers.
3. Evaluation Metrics
Evaluation in InfographicVQA uses metrics sensitive to textual variability and minor error tolerance:
- Average Normalized Levenshtein Similarity (ANLS): For QA pairs,
with
where is the Levenshtein edit distance (Tito et al., 2021) (Mathew et al., 2021) (Van-Dinh et al., 13 Dec 2025).
- Multi-span and List Matching: All permutations of predicted answer items are compared to ground truth, with maximum similarity defining the score.
- Extensions: ANLSL for list-type answers in document collections, leveraging the Hungarian algorithm for optimal matching.
- Human and Upper Bounds: On InfographicVQA, human performance is ≈0.98 ANLS; Vocab+OCR upper bound approaches 0.77, demonstrating the challenge for current models (best reported ANLS is ~0.61 for the leading system) (Tito et al., 2021).
4. Model Architectures and Performance
Vision–Language–Layout Transformers
Recent approaches uniformly adopt vision–language–layout transformer backbones:
- TILT (Applica.ai): Text-Image-Layout Transformer with token, bounding-box, and patch embeddings; achieves 0.6120 ANLS and supports answer generation for non-explicit answers (Tito et al., 2021).
- IG-BERT: BERT-Large variant, integrates Faster R-CNN visual features and substitutes OCR engine (Google Vision vs. Textract), scoring 0.3854 ANLS.
- NAVER CLOVA: Based on HyperDQA and BROS-style LM, pre-trained on diversified document and QA corpora.
Baseline Systems
- M4C (Multi-modal Pointer Network): Combines question, OCR tokens, and object ROIs with a transformer fusion layer; pointer-augmented decoding.
- LayoutLM: BERT-style model extended with 2D positional embeddings and a masked LLM pretraining objective; performs SQuAD-style span prediction over OCR tokens.
Vietnamese Benchmarks
- ViInfographicVQA explores a range of recent VLMs, including:
- Ovis2.5-9B (leading single-image ANLS 0.71)
- Qwen2.5-VL-7B (single-image finetuned ANLS 0.678; multi-image 0.555)
- InternVL3.5-8B, MiniCPM-o2.6-8B, etc.
- Multi-image reasoning and non-extractive queries exhibit a 12–32 point ANLS drop relative to single-image extractive benchmarks (Van-Dinh et al., 13 Dec 2025).
| Model | InfographicVQA ANLS | ViInfographicVQA ANLS (Single) |
|---|---|---|
| Applica TILT | 0.6120 | — |
| Ovis2.5-9B | — | 0.71 |
| Qwen2.5-VL-7B | — | 0.678 (finetuned) |
| IG-BERT | 0.3854 | — |
| NAVER CLOVA | 0.3219 | — |
| M4C | 0.1470 | — |
5. Empirical Findings and Error Taxonomy
Current systems exhibit several systematic weaknesses, evidenced by quantitative error analysis:
- Answer-type Fragility: Strong performance on image-span/plain-text retrieval; substantial degradation for multi-span and non-span answers (drop of ≥20 ANLS points).
- Reasoning Complexity: Questions demanding arithmetic, counting, or sorting are 20–30% less accurate than those requiring simple lookup (Tito et al., 2021, Mathew et al., 2021).
- Evidence Source: Lowest scores observed on figures, tables, and layout-only evidence queries.
- Cross-image Synthesis: In ViInfographicVQA, cross-image questions (e.g., aligning semantically homologous fields across infographics) are the hardest, sometimes falling below 0.11 ANLS for some models (Van-Dinh et al., 13 Dec 2025).
Common failure sources include:
- OCR misrecognition or omissions in diagrams and complex typographic regions.
- Inability to chain multiple reasoning steps (e.g., extracting numeric fields, then applying arithmetic).
- Layout confusion: models often fail to disambiguate overlapping or spatially adjacent labels within cluttered visuals.
- Hallucination or inconsistent unit normalization in generated answers.
6. Methodological Limitations and Cross-domain Comparisons
InfographicVQA tasks are notably more challenging than single-document (forms, letters) or document-collection VQA:
- Single-Document VQA: Extractive, largely resolved with ANLS ≈0.87 and heavy dependence on OCR+LLMs.
- Document Collection VQA: Evaluated primarily by list-type matching, reaching ANLSL ≈0.77.
- Infographics VQA: Lower best-in-class ANLS (≈0.61), reflecting the difficulty of interpreting graphical, spatial, and arithmetic information. Human upper-bound (0.98 ANLS) remains far from leading model performance (Tito et al., 2021).
Direct transfer of advances from Chart Question Answering (CQA) and Table VQA tasks (e.g., PReFIL, which achieves ≈92–96% on synthetic chart QA (Kafle et al., 2019)) to InfographicVQA is constrained by the greater diversity of layouts, question styles, and real-world noise in infographics.
7. Research Directions and Open Problems
Recommendations converge on several axes of future work:
- Numeric and Modular Reasoning: Incorporating neural arithmetic circuits or module networks able to perform multi-step operations natively within transformer architectures.
- Graph-based Layout Encoding: Explicit construction of region-interaction or element-graph representations to support layout-aware and relational reasoning.
- Better Pretraining: Use of synthetically or richly annotated infographics for model pretraining, capturing object-level and chart semantics.
- Cross-image Retrieval and Planning: For multi-image tasks, exploration of retrieval-augmented transformers, memory-augmented architectures, and chain-of-thought approaches for compositional reasoning (Van-Dinh et al., 13 Dec 2025).
- Robust OCR Integration: Joint optimization of OCR, visual encoding, and downstream QA to reduce error propagation.
- Evaluation Enhancements: Complementing ANLS with numeric exactness, unit normalization, and human-in-the-loop faithfulness metrics.
A plausible implication is that further advance in InfographicVQA may depend on moving beyond shallow vision-language fusion toward explicitly computable intermediate representations, tightly coupled OCR-visual encoding, and modular compositionality—especially for low-resource languages and cross-image reasoning (Tito et al., 2021, Van-Dinh et al., 13 Dec 2025, Mathew et al., 2021).