StructQA Benchmark for Table Reasoning

Updated 7 December 2025

StructQA Benchmark is a diagnostic suite that rigorously assesses LLMs' understanding of tabular structure with minimal assumptions about table heterogeneity.
It comprises 7,500 QA pairs from real Wikipedia tables, covering tasks like cell location, column lookup, row lookup, and comprehension.
The benchmark employs precise metrics and the TAMO framework to expose the limitations of text-only serialization approaches in LLMs.

StructQA Benchmark is a diagnostic suite introduced to evaluate LLMs on structural reasoning over tabular data, focusing on semantic understanding of tabular structure and permutation invariance. The benchmark emphasizes minimal assumptions about real-world table heterogeneity and aims to quantify model performance beyond plain row/column serialization by explicitly measuring whether models recognize underlying table structure. StructQA underpins the TAMO (Table As a Modality) framework and is specifically constructed to reveal systematic failings of text-only table serialization approaches, including state-of-the-art LLMs such as GPT-4 and open-source competitors (Li et al., 30 Nov 2025).

1. Dataset Definition and Construction

The StructQA dataset is composed of 500 real-world flat Wikipedia tables sampled from the WikiTableQuestions corpus. Each table is paired with five distinct structural reasoning tasks, and for each task, three natural-language question templates are defined. This yields 5 × 3 × 500 = 7,500 question–answer pairs. Flatness here denotes the absence of nested hierarchies beyond column header groupings; tables include only leaf cells containing text, numerical values, or dates. No missing-cell or corrupted-cell cases are present. The schema broadly covers categorical, quantitive, and temporal data as found in Wikipedia’s open-domain tables.

The benchmark is split into 60% train, 20% development, and 20% test partitions, preserving table diversity and type variety.

Task categories are:

Cell location: “What is the value in column C of row r?”
Column lookup: “In row r, which column(s) contain value v?”
Row lookup: “In column C, which row(s) contain value v?”
Column comprehension: “List all distinct values in column C.”
Row comprehension: “List all values in row r across all columns.”

These tasks are specifically designed to require non-trivial access to table structure, demanding that models recognize column–row relationships and set-based operations rather than relying on simple text pattern matching (Li et al., 30 Nov 2025).

2. Evaluation Protocol and Metrics

StructQA uses three main metrics to quantify model performance:

Exact-match accuracy:

$\mathrm{Acc} = \frac{1}{N}\sum_{i=1}^N \mathbf{1}\bigl[\hat A_i = A_i\bigr]$

where $N$ is the number of test questions, $\hat{A}_i$ is the predicted answer, and $A_i$ is the ground-truth answer for the $i$ -th instance.

Permutation accuracy:

$\mathrm{Acc}_{\pi} = \frac{1}{N}\sum_{i=1}^N \mathbf{1}\bigl[\hat A_i^{\pi} = A_i\bigr]$

where $\hat{A}_i^{\pi}$ is the model prediction on tables with rows and columns independently shuffled.

Robustness:

$\mathrm{Robust} = \frac{1}{N}\sum_{i=1}^N \mathbf{1}\bigl[\hat A_i = \hat A_i^{\pi}\bigr]$

capturing how often model outputs remain identical before and after table permutation (permutation invariance).

These metrics are deliberately designed to penalize models that overfit to serialization order and to assess genuine structural understanding (Li et al., 30 Nov 2025).

3. Baseline Model Performance

A diverse set of LLMs and fine-tuning strategies were evaluated on StructQA, including zero-shot text-serialization, prompt-tuning, LoRA, full supervised fine-tuning (SFT), and the multimodal TAMO architecture. The table below summarizes core reported results as measured on the test set (Li et al., 30 Nov 2025):

Setting	Model	Acc (%)	Perm Acc (%)	Robust (%)
Zero-shot	Pure text	8.60	7.19	16.3
Prompt-tuning	Pure text	37.80	29.93	31.1
	TAMO	59.07	43.47	43.8
LoRA	Pure text	45.67	35.87	39.7
	TAMO	70.80	42.77	53.7
SFT	Pure text	62.73	54.80	51.6
	TAMO	71.60	63.89	64.1

Commercial LLMs (GPT-3.5, GPT-4, GPT-4.1) and strong open models (DeepSeek-R1) exhibit moderate performance (e.g., GPT-4: 51.40%). Notably, specialist baselines such as TableLlama obtain much lower numbers, underscoring that vanilla table-focused models may not generalize to StructQA’s emphasis on structural diagnosis (Li et al., 30 Nov 2025).

4. TAMO Methodology Applied to StructQA

The TAMO (Table As a Modality) architecture demonstrates the potential of multimodal table understanding by combining a hypergraph neural network table encoder with a standard LLM. Each table $\mathcal{T}$ is modeled as a hypergraph $\mathcal{G} = (\mathcal{V}, \mathcal{E})$ , with nodes for leaf cells and hyperedges for rows, columns, and headers.

Hypergraph Transformer layers propagate and aggregate structural information:

Hyperedge update:

$h_e^{(l+1)} = \sigma \left( W_1 h_e^{(l)} + \sum_{v \in e} W_2 h_v^{(l)} \right)$

Node update:

$h_v^{(l+1)} = \sigma \left( U_1 h_v^{(l)} + \sum_{e \ni v} U_2 h_e^{(l+1)} \right)$

Structural embeddings are mean-pooled and mapped to a soft-prompt prefix for the LLM, and used jointly with serialized text and the question for answer generation:

$p(\mathcal{A}|\mathcal{T}, \mathcal{Q}) = \prod_{i=1}^n p(a_i | \mathbf{X}_{st},\,\mathbf{X}_{tt},\,\mathbf{X}_{qt},\,a_{<i})$

Empirical results show TAMO yields consistent, statistically significant improvements over text-only inputs (e.g., +56.3% accuracy gain in prompt-tuning and +14% with SFT) (Li et al., 30 Nov 2025).

5. Analysis, Diagnostics, and Implications

StructQA isolates the phenomenon that text-serialized LLMs are highly sensitive to row/column ordering, with observed performance drops of ~8 percentage points in permutation accuracy and robustness. The inclusion of explicit structure tokens via TAMO mitigates this artefact, with only minor degradation under table shuffling. Attention heatmap visualizations in related experiments (e.g., WikiSQL) corroborate that explicit structure encoding enhances cross-attention to correct answer cells and headers, improving localization fidelity.

Ablation studies further demonstrate that neither text nor pure structure alone enables high performance, but their combination is essential—text-only and graph-only models both perform close to random chance on several tasks. A plausible implication is that structure-aware representations represent an indispensable modality for reliable table QA.

6. Limitations and Future Directions

StructQA is limited to flat tables and single-turn static queries; it omits nested schemas, multi-table joins, and interaction scenarios. The table encoder in use is trained only on the downstream diagnostic tasks; large-scale, table-agnostic pretraining is suggested as a means to further boost cross-dataset generalization. Benchmark expansion toward deeper hierarchies, multi-table settings, embedded semi-structured content, and dialog-based QA are proposed as fertile directions (Li et al., 30 Nov 2025).

7. Position Within the StructQA Benchmark Ecosystem

StructQA sharply contrasts with benchmarks such as TQA-Bench, which emphasizes multi-table relational reasoning with symbolic extensions and large-scale LLM benchmarking (up to 64K context tokens) (Qiu et al., 29 Nov 2024); HCT-QA, which targets human-centric table layout diversity and cell-based evaluation (Ahmad et al., 9 Mar 2025); and CBench, which focuses on question answering over knowledge graphs and fine-grained structural diagnostics of natural-language and SPARQL queries (Orogat et al., 2021). StructQA fills a diagnostic gap, isolating core limitations of current LLM table reasoning that are masked in larger, more heterogeneous or functionally-overloaded datasets.

In summary, StructQA provides a rigorous and fine-grained assessment of structure-sensitive table understanding, catalyzing the development of next-generation, modality-aware QA systems for tabular data (Li et al., 30 Nov 2025).