Papers
Topics
Authors
Recent
2000 character limit reached

ViInfographicVQA Benchmark for Vietnamese VQA

Updated 20 December 2025
  • ViInfographicVQA is a benchmark suite for Vietnamese infographic VQA that combines OCR, layout parsing, and numeric reasoning in both single-image and multi-image setups.
  • It leverages around 5,000 real-world infographics and 30,000 QA pairs across diverse societal domains to challenge visual and textual understanding.
  • Evaluation using ANLS and accuracy highlights current limitations in multi-image synthesis and precise numeric computation, urging advanced methods.

ViInfographicVQA defines a research domain and benchmark suite for Visual Question Answering (VQA) on Vietnamese-language infographics, encompassing both single-image document understanding and multi-image (cross-document) reasoning. This task combines the challenges of integrating OCR with layout analysis, visual feature extraction, discrete numeric/computational reasoning, and, uniquely, cross-infographic aggregation. Models are evaluated on their ability to parse, align, and reason over heterogeneous, highly structured visuals that embed text, charts, icons, and graphical layout motifs, with a focus on low-resource language settings (Van-Dinh et al., 13 Dec 2025).

1. Dataset Composition, Domains, and Annotation Protocol

ViInfographicVQA is constructed from approximately 5,000 real-world infographics obtained from infographics.vn, curated to maximize domain heterogeneity and layout complexity. The benchmark includes around 30,000 human-verified question–answer (QA) pairs distributed across major Vietnamese societal domains: Economics & Integration (18%), Healthcare & Community (16%), Culture & Society (14%), Disaster & Accident (12%), Sports & Arts (10%), with the remainder spanning education, environment, and technology.

Data preparation incorporates geometry filtering (aspect-ratio ∈ [0.33, 3.0], min short side 512 px) to guarantee OCR legibility. For each infographic, OCR tokenization and bounding box extraction are performed; panel, legend, and region-level embeddings are generated using pre-trained VLMs. Within-topic sets are generated by clustering embeddings (k=3 per topic), then assigned to train/validation/test splits on a (topic × answer-source) stratified basis, ensuring no cross-set leakage.

QA annotation proceeds via semi-automatic generation: Gemini 2.0 Flash suggests candidate entities (text/charts/tables/icons/maps) and generate rule-based, templated QA pairs, which are then automatically validated for duplicates/consistency and manually reviewed by domain experts. Final curation ensures high-quality, faithful, and non-overlapping rationales. Layout complexity is pronounced: 70%+ of infographics display multi-column or nested-panel structure, with an average of ~120 OCR tokens and ~12 graphical regions per image. Six primary visual elements are referenced: diagrams, graphs, maps, timelines/sequences, tables, free text, and pure visual/layout icons.

2. Task Definitions: Single-image and Multi-image VQA

ViInfographicVQA targets two modes: single-image VQA and multi-image (cross-document) VQA.

  • Single-image VQA seeks to model fs:I×QAf_s : I \times Q \rightarrow A, where each instance is a tuple (image, text question), and the output is an answer string or number. A typical pipeline proceeds through OCR tokenization, visual region detection, construction of a multimodal (text/vision/layout) graph, followed by feature fusion for answer prediction. Required reasoning skills include exact OCR matching, spatial localization within panels, attribute retrieval, and discrete calculation.
  • Multi-image VQA expands the space to groups S={I1,...,Ik}S = \{I_1, ..., I_k\}, with k2k \geq 2, and requires fm:2I×QAf_m : 2^I \times Q \rightarrow A. Here, models must align corresponding visual elements across documents, aggregate values, and perform synthesis. Challenges include cross-infographic chart/table alignment, summation and comparison over values distributed among panels, and non-extractive inferential reasoning.

3. Question and Answer Typology

ViInfographicVQA systematically annotates each question–answer pair according to answer source, reasoning skill, and required operations. Categories include:

  • Image-span: Verbatim extraction of a contiguous span from the image (e.g., chart titles, key labels).
  • Question-span: Multiple-choice selection, typically answerable from explicit options.
  • Multi-span: Extraction of multiple, discontiguous spans, often as unordered lists from tables or legends.
  • Non-extractive/numeric: Computational questions requiring arithmetic, counting, or ranking.
  • Multi-image span, Cross-image synthesis, Non-span numeric: Extending to cross-infographic evidence alignment and numerical aggregation in the multi-image setting.

Question coverage references six major visual elements: diagrams, graphs, maps, timelines, tables, and icons. Layout complexity and the requirement for Vietnamese diacritic OCR further increase the diversity of challenges.

4. Evaluation Metrics and Protocol

Two principal metrics are employed: accuracy (exact string match) and Average Normalized Levenshtein Similarity (ANLS). The latter, defined as

ANLS=1Ni=1Nmaxjs(aij,ai),s(x,y)={1NL(x,y)if NL(x,y)<0.5 0otherwise\mathrm{ANLS} = \frac{1}{N} \sum_{i=1}^N \max_j s(a_{ij}, a_i^*), \quad s(x, y) = \begin{cases} 1 - \mathrm{NL}(x, y) & \text{if} \ \mathrm{NL}(x, y) < 0.5 \ 0 & \text{otherwise} \end{cases}

(where NL is the normalized Levenshtein distance), allows partial credit for near-orthographic answers and minimal tolerance for numeric string or diacritic errors. Numerics are string matched; layout normalization is not automatically applied. Models are evaluated on both single-image and multi-image splits, always with non-overlapping splits by both sample and cluster.

5. Baseline Models and Empirical Results

Seven state-of-the-art vision–LLMs (VLMs) are evaluated in both fine-tuned and zero-shot regimes: Phi-4-multimodal-5B, VideoLLaMA3 Image-7B, InternVL3.5-8B, MiniCPM-o2.6-8B, Molmo D-7B, Ovis2.5-9B, and Qwen2.5-VL-7B (QLoRA fine-tuned and base).

Single-image ANLS (selected):

Model Img-span Multi-span Non-extractive Overall
Ovis2.5-9B 78.2 61.4 61.2 71.0
InternVL3.5-8B 73.3 49.3 65.7 67.0
Qwen2.5-VL-7B (finetune) 72.7 53.9 63.9 67.8

Multi-image ANLS (selected):

Model Cross-synth Multi-span Non-span Overall
Qwen2.5-VL-7B (finetune) 56.6 58.9 47.7 55.5
Qwen2.5-VL-7B 56.3 58.3 48.0 54.9
MiniCPM-o2.6-8B 35.2 48.7 30.1 40.6

Performance is highest for short image-span and question-span extraction (ANLS > 75%), dips for non-extractive/numeric and multi-span reasoning (ANLS ~ 60%), and is lowest for cross-image synthesis (ANLS ~ 56%). These trends underscore persistent limitations in current VLMs around context aggregation, layout parsing, and symbolic computation.

6. Analysis of Failure Cases and System Limitations

Empirical analysis reveals that span-extraction tasks (single-word/phrase lookups) are reliably addressed by current systems. However, the following remain bottlenecks:

  • Non-extractive numeric and multi-step chart reasoning: Models frequently err in arithmetic, counting, or aggregate comparison, often due to inadequate symbolic reasoning or inability to retrieve and synthesize values across visual regions.
  • Layout-driven and multi-panel reasoning: Multimodal models that lack explicit layout graph construction struggle with images exhibiting highly non-linear, multi-column, or nested-panel layouts.
  • Cross-image evidence synthesis: Multi-image VQA tasks expose model deficiencies in aligning semantic elements, combining information, and avoiding redundancy.
  • Vietnamese OCR/diacritics: Text extraction with language-specific diacritics introduces frequent error, especially on small or stylized fonts.
  • This suggests that the standard VLM pipeline, which fuses OCR tokens and global visual features, may be insufficient for the detailed document structure and high-order aggregation required in infographic VQA.

7. Future Directions and Research Challenges

Three primary research directions are identified:

  1. Layout-Aware Reasoning: Integrating explicit panel/table/element graphs, potentially via GNNs operating over detected nodes (text blocks, figures, axes), may improve spatial and relational inference, as advocated in both dataset and challenge papers (Van-Dinh et al., 13 Dec 2025, Mathew et al., 2021).
  2. Tool-Augmented Numeric Computation: Deployment of symbolic calculators or lightweight arithmetic solvers, and retrieval of cell-level numbers, can address persistent non-extractive numeric errors. Symbolic module integration is suggested as a concrete enhancement.
  3. Cross-Image Aggregation and Graph Construction: Modular architectures capable of retrieval and cross-document alignment—potentially leveraging set-level memory representations and graph-processing—are a promising approach for multi-image VQA.

Additional recommendations include expanding the metric suite beyond ANLS to include exact/relative numeric accuracy (e.g., “±1%” matching), unit normalization, and human-in-the-loop faithfulness checks. The systematic release and extension of Vietnamese-specific benchmarks are expected to accelerate cross-linguistic, document-aware, and multi-modal VQA research in low-resource settings, extending general-purpose VQA beyond the English-centric paradigm (Van-Dinh et al., 13 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to ViInfographicVQA.