ViInfographicVQA Benchmark for Vietnamese VQA
- ViInfographicVQA is a benchmark suite for Vietnamese infographic VQA that combines OCR, layout parsing, and numeric reasoning in both single-image and multi-image setups.
- It leverages around 5,000 real-world infographics and 30,000 QA pairs across diverse societal domains to challenge visual and textual understanding.
- Evaluation using ANLS and accuracy highlights current limitations in multi-image synthesis and precise numeric computation, urging advanced methods.
ViInfographicVQA defines a research domain and benchmark suite for Visual Question Answering (VQA) on Vietnamese-language infographics, encompassing both single-image document understanding and multi-image (cross-document) reasoning. This task combines the challenges of integrating OCR with layout analysis, visual feature extraction, discrete numeric/computational reasoning, and, uniquely, cross-infographic aggregation. Models are evaluated on their ability to parse, align, and reason over heterogeneous, highly structured visuals that embed text, charts, icons, and graphical layout motifs, with a focus on low-resource language settings (Van-Dinh et al., 13 Dec 2025).
1. Dataset Composition, Domains, and Annotation Protocol
ViInfographicVQA is constructed from approximately 5,000 real-world infographics obtained from infographics.vn, curated to maximize domain heterogeneity and layout complexity. The benchmark includes around 30,000 human-verified question–answer (QA) pairs distributed across major Vietnamese societal domains: Economics & Integration (18%), Healthcare & Community (16%), Culture & Society (14%), Disaster & Accident (12%), Sports & Arts (10%), with the remainder spanning education, environment, and technology.
Data preparation incorporates geometry filtering (aspect-ratio ∈ [0.33, 3.0], min short side 512 px) to guarantee OCR legibility. For each infographic, OCR tokenization and bounding box extraction are performed; panel, legend, and region-level embeddings are generated using pre-trained VLMs. Within-topic sets are generated by clustering embeddings (k=3 per topic), then assigned to train/validation/test splits on a (topic × answer-source) stratified basis, ensuring no cross-set leakage.
QA annotation proceeds via semi-automatic generation: Gemini 2.0 Flash suggests candidate entities (text/charts/tables/icons/maps) and generate rule-based, templated QA pairs, which are then automatically validated for duplicates/consistency and manually reviewed by domain experts. Final curation ensures high-quality, faithful, and non-overlapping rationales. Layout complexity is pronounced: 70%+ of infographics display multi-column or nested-panel structure, with an average of ~120 OCR tokens and ~12 graphical regions per image. Six primary visual elements are referenced: diagrams, graphs, maps, timelines/sequences, tables, free text, and pure visual/layout icons.
2. Task Definitions: Single-image and Multi-image VQA
ViInfographicVQA targets two modes: single-image VQA and multi-image (cross-document) VQA.
- Single-image VQA seeks to model , where each instance is a tuple (image, text question), and the output is an answer string or number. A typical pipeline proceeds through OCR tokenization, visual region detection, construction of a multimodal (text/vision/layout) graph, followed by feature fusion for answer prediction. Required reasoning skills include exact OCR matching, spatial localization within panels, attribute retrieval, and discrete calculation.
- Multi-image VQA expands the space to groups , with , and requires . Here, models must align corresponding visual elements across documents, aggregate values, and perform synthesis. Challenges include cross-infographic chart/table alignment, summation and comparison over values distributed among panels, and non-extractive inferential reasoning.
3. Question and Answer Typology
ViInfographicVQA systematically annotates each question–answer pair according to answer source, reasoning skill, and required operations. Categories include:
- Image-span: Verbatim extraction of a contiguous span from the image (e.g., chart titles, key labels).
- Question-span: Multiple-choice selection, typically answerable from explicit options.
- Multi-span: Extraction of multiple, discontiguous spans, often as unordered lists from tables or legends.
- Non-extractive/numeric: Computational questions requiring arithmetic, counting, or ranking.
- Multi-image span, Cross-image synthesis, Non-span numeric: Extending to cross-infographic evidence alignment and numerical aggregation in the multi-image setting.
Question coverage references six major visual elements: diagrams, graphs, maps, timelines, tables, and icons. Layout complexity and the requirement for Vietnamese diacritic OCR further increase the diversity of challenges.
4. Evaluation Metrics and Protocol
Two principal metrics are employed: accuracy (exact string match) and Average Normalized Levenshtein Similarity (ANLS). The latter, defined as
(where NL is the normalized Levenshtein distance), allows partial credit for near-orthographic answers and minimal tolerance for numeric string or diacritic errors. Numerics are string matched; layout normalization is not automatically applied. Models are evaluated on both single-image and multi-image splits, always with non-overlapping splits by both sample and cluster.
5. Baseline Models and Empirical Results
Seven state-of-the-art vision–LLMs (VLMs) are evaluated in both fine-tuned and zero-shot regimes: Phi-4-multimodal-5B, VideoLLaMA3 Image-7B, InternVL3.5-8B, MiniCPM-o2.6-8B, Molmo D-7B, Ovis2.5-9B, and Qwen2.5-VL-7B (QLoRA fine-tuned and base).
Single-image ANLS (selected):
| Model | Img-span | Multi-span | Non-extractive | Overall |
|---|---|---|---|---|
| Ovis2.5-9B | 78.2 | 61.4 | 61.2 | 71.0 |
| InternVL3.5-8B | 73.3 | 49.3 | 65.7 | 67.0 |
| Qwen2.5-VL-7B (finetune) | 72.7 | 53.9 | 63.9 | 67.8 |
Multi-image ANLS (selected):
| Model | Cross-synth | Multi-span | Non-span | Overall |
|---|---|---|---|---|
| Qwen2.5-VL-7B (finetune) | 56.6 | 58.9 | 47.7 | 55.5 |
| Qwen2.5-VL-7B | 56.3 | 58.3 | 48.0 | 54.9 |
| MiniCPM-o2.6-8B | 35.2 | 48.7 | 30.1 | 40.6 |
Performance is highest for short image-span and question-span extraction (ANLS > 75%), dips for non-extractive/numeric and multi-span reasoning (ANLS ~ 60%), and is lowest for cross-image synthesis (ANLS ~ 56%). These trends underscore persistent limitations in current VLMs around context aggregation, layout parsing, and symbolic computation.
6. Analysis of Failure Cases and System Limitations
Empirical analysis reveals that span-extraction tasks (single-word/phrase lookups) are reliably addressed by current systems. However, the following remain bottlenecks:
- Non-extractive numeric and multi-step chart reasoning: Models frequently err in arithmetic, counting, or aggregate comparison, often due to inadequate symbolic reasoning or inability to retrieve and synthesize values across visual regions.
- Layout-driven and multi-panel reasoning: Multimodal models that lack explicit layout graph construction struggle with images exhibiting highly non-linear, multi-column, or nested-panel layouts.
- Cross-image evidence synthesis: Multi-image VQA tasks expose model deficiencies in aligning semantic elements, combining information, and avoiding redundancy.
- Vietnamese OCR/diacritics: Text extraction with language-specific diacritics introduces frequent error, especially on small or stylized fonts.
- This suggests that the standard VLM pipeline, which fuses OCR tokens and global visual features, may be insufficient for the detailed document structure and high-order aggregation required in infographic VQA.
7. Future Directions and Research Challenges
Three primary research directions are identified:
- Layout-Aware Reasoning: Integrating explicit panel/table/element graphs, potentially via GNNs operating over detected nodes (text blocks, figures, axes), may improve spatial and relational inference, as advocated in both dataset and challenge papers (Van-Dinh et al., 13 Dec 2025, Mathew et al., 2021).
- Tool-Augmented Numeric Computation: Deployment of symbolic calculators or lightweight arithmetic solvers, and retrieval of cell-level numbers, can address persistent non-extractive numeric errors. Symbolic module integration is suggested as a concrete enhancement.
- Cross-Image Aggregation and Graph Construction: Modular architectures capable of retrieval and cross-document alignment—potentially leveraging set-level memory representations and graph-processing—are a promising approach for multi-image VQA.
Additional recommendations include expanding the metric suite beyond ANLS to include exact/relative numeric accuracy (e.g., “±1%” matching), unit normalization, and human-in-the-loop faithfulness checks. The systematic release and extension of Vietnamese-specific benchmarks are expected to accelerate cross-linguistic, document-aware, and multi-modal VQA research in low-resource settings, extending general-purpose VQA beyond the English-centric paradigm (Van-Dinh et al., 13 Dec 2025).