ViInfographicVQA Benchmark for Vietnamese VQA

Updated 20 December 2025

ViInfographicVQA is a benchmark suite for Vietnamese infographic VQA that combines OCR, layout parsing, and numeric reasoning in both single-image and multi-image setups.
It leverages around 5,000 real-world infographics and 30,000 QA pairs across diverse societal domains to challenge visual and textual understanding.
Evaluation using ANLS and accuracy highlights current limitations in multi-image synthesis and precise numeric computation, urging advanced methods.

ViInfographicVQA defines a research domain and benchmark suite for Visual Question Answering (VQA) on Vietnamese-language infographics, encompassing both single-image document understanding and multi-image (cross-document) reasoning. This task combines the challenges of integrating OCR with layout analysis, visual feature extraction, discrete numeric/computational reasoning, and, uniquely, cross-infographic aggregation. Models are evaluated on their ability to parse, align, and reason over heterogeneous, highly structured visuals that embed text, charts, icons, and graphical layout motifs, with a focus on low-resource language settings (Van-Dinh et al., 13 Dec 2025).

1. Dataset Composition, Domains, and Annotation Protocol

ViInfographicVQA is constructed from approximately 5,000 real-world infographics obtained from infographics.vn, curated to maximize domain heterogeneity and layout complexity. The benchmark includes around 30,000 human-verified question–answer (QA) pairs distributed across major Vietnamese societal domains: Economics & Integration (18%), Healthcare & Community (16%), Culture & Society (14%), Disaster & Accident (12%), Sports & Arts (10%), with the remainder spanning education, environment, and technology.

Data preparation incorporates geometry filtering (aspect-ratio ∈ [0.33, 3.0], min short side 512 px) to guarantee OCR legibility. For each infographic, OCR tokenization and bounding box extraction are performed; panel, legend, and region-level embeddings are generated using pre-trained VLMs. Within-topic sets are generated by clustering embeddings (k=3 per topic), then assigned to train/validation/test splits on a (topic × answer-source) stratified basis, ensuring no cross-set leakage.

QA annotation proceeds via semi-automatic generation: Gemini 2.0 Flash suggests candidate entities (text/charts/tables/icons/maps) and generate rule-based, templated QA pairs, which are then automatically validated for duplicates/consistency and manually reviewed by domain experts. Final curation ensures high-quality, faithful, and non-overlapping rationales. Layout complexity is pronounced: 70%+ of infographics display multi-column or nested-panel structure, with an average of ~120 OCR tokens and ~12 graphical regions per image. Six primary visual elements are referenced: diagrams, graphs, maps, timelines/sequences, tables, free text, and pure visual/layout icons.

2. Task Definitions: Single-image and Multi-image VQA

ViInfographicVQA targets two modes: single-image VQA and multi-image (cross-document) VQA.

Single-image VQA seeks to model $f_s : I \times Q \rightarrow A$ , where each instance is a tuple (image, text question), and the output is an answer string or number. A typical pipeline proceeds through OCR tokenization, visual region detection, construction of a multimodal (text/vision/layout) graph, followed by feature fusion for answer prediction. Required reasoning skills include exact OCR matching, spatial localization within panels, attribute retrieval, and discrete calculation.
Multi-image VQA expands the space to groups $S = \{I_1, ..., I_k\}$ , with $k \geq 2$ , and requires $f_m : 2^I \times Q \rightarrow A$ . Here, models must align corresponding visual elements across documents, aggregate values, and perform synthesis. Challenges include cross-infographic chart/table alignment, summation and comparison over values distributed among panels, and non-extractive inferential reasoning.

3. Question and Answer Typology

ViInfographicVQA systematically annotates each question–answer pair according to answer source, reasoning skill, and required operations. Categories include:

Image-span: Verbatim extraction of a contiguous span from the image (e.g., chart titles, key labels).
Question-span: Multiple-choice selection, typically answerable from explicit options.
Multi-span: Extraction of multiple, discontiguous spans, often as unordered lists from tables or legends.
Non-extractive/numeric: Computational questions requiring arithmetic, counting, or ranking.
Multi-image span, Cross-image synthesis, Non-span numeric: Extending to cross-infographic evidence alignment and numerical aggregation in the multi-image setting.

Question coverage references six major visual elements: diagrams, graphs, maps, timelines, tables, and icons. Layout complexity and the requirement for Vietnamese diacritic OCR further increase the diversity of challenges.

4. Evaluation Metrics and Protocol

Two principal metrics are employed: accuracy (exact string match) and Average Normalized Levenshtein Similarity (ANLS). The latter, defined as

$\mathrm{ANLS} = \frac{1}{N} \sum_{i=1}^N \max_j s(a_{ij}, a_i^*), \quad s(x, y) = \begin{cases} 1 - \mathrm{NL}(x, y) & \text{if} \ \mathrm{NL}(x, y) < 0.5 \ 0 & \text{otherwise} \end{cases}$

(where NL is the normalized Levenshtein distance), allows partial credit for near-orthographic answers and minimal tolerance for numeric string or diacritic errors. Numerics are string matched; layout normalization is not automatically applied. Models are evaluated on both single-image and multi-image splits, always with non-overlapping splits by both sample and cluster.

5. Baseline Models and Empirical Results

Seven state-of-the-art vision–LLMs (VLMs) are evaluated in both fine-tuned and zero-shot regimes: Phi-4-multimodal-5B, VideoLLaMA3 Image-7B, InternVL3.5-8B, MiniCPM-o2.6-8B, Molmo D-7B, Ovis2.5-9B, and Qwen2.5-VL-7B (QLoRA fine-tuned and base).

Single-image ANLS (selected):

Model	Img-span	Multi-span	Non-extractive	Overall
Ovis2.5-9B	78.2	61.4	61.2	71.0
InternVL3.5-8B	73.3	49.3	65.7	67.0
Qwen2.5-VL-7B (finetune)	72.7	53.9	63.9	67.8

Multi-image ANLS (selected):

Model	Cross-synth	Multi-span	Non-span	Overall
Qwen2.5-VL-7B (finetune)	56.6	58.9	47.7	55.5
Qwen2.5-VL-7B	56.3	58.3	48.0	54.9
MiniCPM-o2.6-8B	35.2	48.7	30.1	40.6

Performance is highest for short image-span and question-span extraction (ANLS > 75%), dips for non-extractive/numeric and multi-span reasoning (ANLS ~ 60%), and is lowest for cross-image synthesis (ANLS ~ 56%). These trends underscore persistent limitations in current VLMs around context aggregation, layout parsing, and symbolic computation.

6. Analysis of Failure Cases and System Limitations

Empirical analysis reveals that span-extraction tasks (single-word/phrase lookups) are reliably addressed by current systems. However, the following remain bottlenecks:

Non-extractive numeric and multi-step chart reasoning: Models frequently err in arithmetic, counting, or aggregate comparison, often due to inadequate symbolic reasoning or inability to retrieve and synthesize values across visual regions.
Layout-driven and multi-panel reasoning: Multimodal models that lack explicit layout graph construction struggle with images exhibiting highly non-linear, multi-column, or nested-panel layouts.
Cross-image evidence synthesis: Multi-image VQA tasks expose model deficiencies in aligning semantic elements, combining information, and avoiding redundancy.
Vietnamese OCR/diacritics: Text extraction with language-specific diacritics introduces frequent error, especially on small or stylized fonts.
This suggests that the standard VLM pipeline, which fuses OCR tokens and global visual features, may be insufficient for the detailed document structure and high-order aggregation required in infographic VQA.

7. Future Directions and Research Challenges

Three primary research directions are identified:

Layout-Aware Reasoning: Integrating explicit panel/table/element graphs, potentially via GNNs operating over detected nodes (text blocks, figures, axes), may improve spatial and relational inference, as advocated in both dataset and challenge papers (Van-Dinh et al., 13 Dec 2025, Mathew et al., 2021).
Tool-Augmented Numeric Computation: Deployment of symbolic calculators or lightweight arithmetic solvers, and retrieval of cell-level numbers, can address persistent non-extractive numeric errors. Symbolic module integration is suggested as a concrete enhancement.
Cross-Image Aggregation and Graph Construction: Modular architectures capable of retrieval and cross-document alignment—potentially leveraging set-level memory representations and graph-processing—are a promising approach for multi-image VQA.

Additional recommendations include expanding the metric suite beyond ANLS to include exact/relative numeric accuracy (e.g., “±1%” matching), unit normalization, and human-in-the-loop faithfulness checks. The systematic release and extension of Vietnamese-specific benchmarks are expected to accelerate cross-linguistic, document-aware, and multi-modal VQA research in low-resource settings, extending general-purpose VQA beyond the English-centric paradigm (Van-Dinh et al., 13 Dec 2025).

PDF Markdown Chat (Pro)

References (2)

ViInfographicVQA: A Benchmark for Single and Multi-image Visual Question Answering on Vietnamese Infographics (2025)

InfographicVQA (2021)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to ViInfographicVQA.