TableVQA-Bench: Visual Table Reasoning Benchmark

Updated 19 February 2026

TableVQA-Bench is a benchmark that evaluates the table understanding capabilities of vision-language models using both real and synthetic table images across diverse domains.
It employs a multi-stage LLM-driven pipeline to generate balanced QA pairs focused on tasks like cell lookup, aggregation, and comparison.
The benchmark uses strict and relaxed evaluation metrics to systematically analyze model performance, highlighting challenges in visual versus text-based inputs.

TableVQA-Bench is a @@@@1@@@@ benchmark designed to evaluate the table understanding capabilities of vision-LLMs, with particular emphasis on multi-modal LLMs (MLLMs) (Kim et al., 2024). The benchmark synthesizes real and synthetic table images spanning several domains and pairs them with precisely curated question–answer (QA) pairs, enabling systematic analysis of model performance under diverse table layouts and reasoning tasks.

1. Dataset Construction and Domains

The TableVQA-Bench corpus aggregates four distinct sub-domains, constructed to encompass a range of table types and visual phenomena:

VWTQ (Visual WikiTableQuestions): Extracted from Wikipedia’s WTQ tables, images are generated by rendering raw HTML/CSS through Wikipedia’s stylesheet, preserving authentic font, border, and color information.
VWTQ-Syn (Synthesized Visual WTQ) and VTabFact: Tables rendered from either structure-only WTQ HTML or pseudo-HTML generated from TabFact CSVs, then randomly styled using Bootstrap-inspired tags (background color, borders, typography). These are rendered in a headless browser (Puppeteer/Chrome) with variable viewports and compressed as JPEGs. Human feedback filters out abnormal styles.
FinTabNetQA: Derived from FinTabNet’s annual report corpus, featuring complex financial tables. Table images are captured from PDF→PNG renders. No original QA exists in FinTabNet; these are generated separately.

QA pairs are produced via a multi-stage LLM-driven pipeline, primarily leveraging GPT-4:

Table in HTML (or pseudo-HTML) is provided as LLM input.
The LLM is prompted to draft 4–6 QA pairs per table, intentionally balanced across counting, lookup, aggregation, comparison, and date/percentage extraction.
Answer templates are tailored: for TabFact, statements are verified as “True/False.” For FinTabNetQA, queries focus on numeric scales and unit extraction.
Automated filtering removes trivial or repeated QA pairs.
Human validation enforces scale consistency, discards tables exceeding 50 rows, and manually reviews for hallucinations or formatting discrepancies.

Dataset Statistics:

Domain	Real Image	Human QA	#Images	#QA
VWTQ	✓	✓	315	750
VWTQ-Syn	×	✓	150	250
VTabFact	×	✓	224	250
FinTabNetQA	✓	×	205	250
Total	—	—	894	1500

Each domain maintains a consistent format: a single table image, a natural-language question, and a short answer (text value or “True/False”) (Kim et al., 2024, Lagos et al., 15 Jul 2025).

2. Evaluation Protocol and Metrics

The primary metric is strict accuracy: proportion of exact matches between model predictions and ground truth answers.

$\text{Accuracy} = \frac{\#\text{identical predictions}}{\#\text{total questions}}$

For VWTQ, VWTQ-Syn, and VTabFact, only matches with ground-truth string or “True/False” label are accepted.
For FinTabNetQA, a relaxed “relieved-accuracy” is adopted, disregarding differences in numerical scale representation (e.g., “128 million” ≡ “128,000,000” ≡ “128”).
For CogAgent-VQA*, correctness is determined if the ground-truth answer is a substring anywhere in the response.

Secondary Metrics and Analyses:

Vision-query ablation: Models are assessed as the number of “vision queries” (cross-attention tokens sent to the vision encoder) varies, probing the dependency of table comprehension on visual token breadth.
Two-stage pipeline accuracy: Vision model reconstructs HTML from table image, which is then passed to text-only LLM for QA.
Table Edit Distance similarity (TEDs):

$\mathrm{TEDs} = 1 - \frac{\mathrm{EditDistance}(\text{pred}, \text{gt})}{\max(\lvert \text{pred} \rvert, \lvert \text{gt} \rvert)}$

This measures alignment between predicted/reconstructed and ground-truth table structures.

The inclusion of these secondary metrics enables quantitative decomposition of errors attributable to visual parsing, model architecture design, and cross-modal information flow (Kim et al., 2024, Lagos et al., 15 Jul 2025).

3. Experimental Results

Overall accuracy results by domain:

Input	Model	VWTQ	VWTQ-Syn	VTabFact	FinTabNetQA	Avg
Vision	GPT-4V	42.5	52.0	68.0	79.6	54.5
	Gemini-ProV	26.7	33.2	55.6	60.8	38.3
Text	GPT-4	68.1	69.6	80.0	98.8	75.5
	Gemini-Pro	56.4	61.2	69.6	96.4	66.1
	GPT-3.5	50.5	54.4	68.0	93.2	61.2

Key findings:

GPT-4V (visual input) achieves the top vision-only performance at 54.5% average, significantly outperforming alternative MLLMs (Gemini-ProV, etc.).
GPT-4 (text input) achieves 75.5% average, consistently outperforming its vision-enabled counterparts by over 20 points.
Vision-query analysis: Increasing vision queries (e.g., CogVLM from 256 to 1225 tokens) raises accuracy from 7.5% to 16.3%; similar gains noted in SPHINX-v1.
Two-stage inference (table extraction followed by LLM QA):
- GPT-4V→GPT-4: 60.7% (+6.2 relative to GPT-4V).
- Gemini-ProV→Gemini-Pro: 48.6% (+10.3 over Gemini-ProV).

These trends hold for both strict matching and domains that relax answer normalization (scale units, date formats) (Kim et al., 2024, Lagos et al., 15 Jul 2025).

4. Taxonomy and Reasoning Types

The benchmark covers a spectrum of question archetypes designed to probe different reasoning pathways:

Cell lookup (retrieval of a single cell given row/column anchors).
Aggregation (sum, count, mean, extremum identification).
Comparison (direct numerical or symbolic comparison across rows/columns).
Date/percentage parsing (extraction and format normalization).
Fact verification (VTabFact subset: question is a Wikipedia-style statement, to be assessed as “True/False”).
Multi-header/complex layout comprehension (FinTabNetQA).

These taxonomies are not explicitly annotated in the TableVQA-Bench release but are evident from the pipeline design and analysis across splits (Kim et al., 2024, Lagos et al., 15 Jul 2025).

5. Comparative Analysis: TableVQA-Bench in Context

TableVQA-Bench is positioned as a medium-scale, multi-source evaluation suite. Comparative benchmarks (for context):

Benchmark	#Tables	#QA Pairs	Domains	Noteworthy Aspect
TableVQA-Bench	894	1500	4	Real+synthetic, multi-domain
Visual-TableQA	2500	6000	open-domain	LaTeX-rendered, “infinite” layouts, high-depth reasoning (Lompo et al., 9 Sep 2025)
ComTQA (TabPedia)	1591	9070	2	PDF-based, scientific/financial, real-world noise (Zhao et al., 2024)
Table-VQA (AgDeTQA)	16400	82300	technical	Technical focus, less layout diversity (Lompo et al., 9 Sep 2025)
TabFact	N/A	12722	Wikipedia	Text-based fact-verification

Key distinctions:

TableVQA-Bench combines real rendered, PDF, and synthetic table images with LLM-generated QA covering cell-level, arithmetic, and logical reasoning.
Visual-TableQA provides greater layout and reasoning diversity but relies on synthetic (LaTeX) generation; TableVQA-Bench templates more closely mirror web and financial documents.
ComTQA emphasizes noisy, real-world document images, complex OCR, and logical reasoning but is not public for training.

A plausible implication is that TableVQA-Bench provides a unique balance between real-world authenticity, diversity, and tight QA curation, whereas larger or more synthetic datasets stress layout generalization or reasoning depth (Kim et al., 2024, Lompo et al., 9 Sep 2025, Zhao et al., 2024).

6. Limitations and Future Directions

Noted limitations of TableVQA-Bench (Kim et al., 2024):

Scale: With 1,500 QA pairs, coverage is moderate, limiting statistical power for rare phenomena.
Style realism: Synthetic renderings may fail to capture edge-case or degraded table styles (hand-drawn, scanned, complex multi-page).
Prompt bias: The reliance on GPT-4 prompts and completions may bias QA distributions—e.g., overusing certain question templates.
Domain coverage: Limited inclusion of scientific, medical, or non-Wikipedia tables. Expansion to additional domains would increase generalizability.

Proposed future work includes: scaling to multi-page/nested tables; integration of advanced OCR and table structure recognition modules; adversarial and compositional QA involving joint reasoning over tables and charts.

Enhancements such as multi-turn QA, denser annotation of reasoning types, and expanded support for non-English tables remain open research challenges (Kim et al., 2024).

7. Significance for Vision-LLM Development

TableVQA-Bench provides critical insights:

Visual table reasoning remains substantially more challenging than text-based for current MLLMs, as indicated by consistent 20+ point lower accuracy despite higher inference cost.
Increasing the visual input granularity (more vision queries/tokens, higher resolution) yields tangible accuracy improvements, validating the need for capacity scaling.
Two-stage pipelines that decompose table image parsing and downstream textual QA outperform end-to-end baselines, suggesting a near-term optimality for modular system design (Kim et al., 2024, Lagos et al., 15 Jul 2025).

The benchmark has thus become a reference point for both the development of table-specialized MLLMs and the design of explainable, auditable VQA pipelines, as demonstrated in works adopting detailed chain-of-thought reasoning and symbolic code generation (Lagos et al., 15 Jul 2025).

In sum, TableVQA-Bench constitutes a foundational medium-scale dataset for systematically quantifying and advancing visual table reasoning in multi-modal AI systems, with established baselines, diverse domains, and transparent evaluation methodology. It complements larger-scale or open-domain datasets by foregrounding authentic renderings, carefully curated QA, and explicit partitioning by reasoning type and table structure.