TagOp: Hybrid Table Aggregation for Financial QA

Updated 13 December 2025

TagOp Model is a hybrid table aggregation operator that processes both tabular and textual data for financial QA.
It employs sequence tagging to extract evidential spans and operator-based symbolic computation for precise numerical reasoning.
State-of-the-art TAT-QA performance is achieved despite challenges in accurate IO tagging of heterogeneous inputs.

TagOp (Table Aggregation Operator) is a QA model designed for numerical and semantic reasoning over hybrid input consisting of tabular and textual content, introduced as part of the TAT-QA benchmark for financial report question answering. Its architecture supports the extraction and reasoning over heterogeneous evidential spans using sequence tagging and operator-based symbolic computation, achieving state-of-the-art results on the TAT-QA dataset, while revealing persisting challenges in precise evidence localization and hybrid data reasoning (Zhu et al., 2021).

1. Input Representation and Shared Encoder

TagOp processes concatenated hybrid data, where the input is tokenized in the form: $[\mathrm{CLS}]~q_1~q_2~...~q_n~[\mathrm{tb}_1]~\mathrm{tb}_2~...~\mathrm{tb}_m~[\mathrm{p}_1]~\mathrm{p}_2~...~\mathrm{p}_k$ , with $q_i$ as question tokens, $\mathrm{tb}_j$ as flattened table tokens, and $p_l$ as paragraph tokens. The concatenated token sequence is fed to a 12-layer Transformer encoder (RoBERTa or Tapas), producing contextualized vectors $h_1...h_N$ shared by all subsequent submodules. This unified encoding enables joint reasoning across tables and text, capturing dependencies inherent in financial QA tasks.

2. Evidence Extraction via Sequence Tagging

TagOp frames the identification of relevant evidences as token-level IO (inside/outside) tagging over all sub-tokens, spanning the entire question, table, and text. Each sub-token $t$ receives a label $\ell_t \in \{I, O\}$ . Any sub-token within a cell labeled $I$ renders the entire cell “selected”, and consecutive $I$ s in paragraphs are merged as a single evidential span.

For each sub-token $t$ with contextual embedding $h_t$ , the tagging probability is computed as:

$p^\mathrm{tag}_t = \mathrm{softmax}(\mathrm{FFN}(h_t)),$

where FFN is a two-layer feed-forward network with GELU non-linearity. Ground-truth tags $G^\mathrm{tag}$ are assigned by matching annotated answer content in table-first, then text, order.

3. Operator-Based Symbolic Reasoning

TagOp incorporates a set of ten aggregation operators (with an “Other” catch-all) that apply symbolic reasoning over extracted evidences $E = \{e_1,...,e_r\}$ . Formal operator definitions are as follows:

Operator	Formal Semantics	Selection Logic
Span-in-text	$\arg\max_{s \in E} p^\mathrm{tag}(s)$	Max-scoring text span
Cell-in-table	$\arg\max_{c \in E} p^\mathrm{tag}(c)$	Max-scoring table cell
Spans	Concatenate all selected spans/cells	All evidences
Sum	$\sum$ of $val(e_i)$ for numeric $e_i$	All numerics
Count	$\|E\|$	All evidences
Average	$\frac{1}{\|E\|} \sum val(e_i)$	All numerics
Multiplication	$\prod val(e_i)$	All numerics
Difference	$val(e_1) - val(e_2)$ (top-2 by $p^\mathrm{tag}$ )	Ordered pair
Division	$val(e_1) \div val(e_2)$	Ordered pair
Change-ratio	$(val(e_1) - val(e_2)) / val(e_2)$	Ordered pair

Operator selection is performed using a multi-class softmax over the [CLS] vector:

$p^\mathrm{op} = \mathrm{softmax}(\mathrm{FFN}(h_{[\mathrm{CLS}]8693;}))$

with chosen operator $o^* = \arg\max p^\mathrm{op}$ . For operators requiring operand order (Difference, Division, Change-ratio), a binary number-order classifier is computed as:

$p^\mathrm{order} = \mathrm{softmax}(\mathrm{FFN}(\frac{h_1+h_2}{2})),$

where $h_1,h_2$ are encoder vectors for the top-2 evidences. The binary label indicates whether input order matches $(e_1, e_2)$ .

4. Scale Prediction and Numeric Calculation

To accommodate the financial context, TagOp predicts a question-specific scale $S \in \{$ None, Thousand, Million, Billion, Percent $\}$ . Scale prediction utilizes representations aggregated from table and paragraph tokens:

$h_\mathrm{tab} = \frac{1}{M} \sum_\mathrm{table} h_i,~~ h_p = \frac{1}{K} \sum_\mathrm{para} h_j,$

$p^\mathrm{scale} = \mathrm{softmax}(\mathrm{FFN}([h_{[\mathrm{CLS}]}; h_\mathrm{tab}; h_p])),$

After symbolic execution, the model computes

$A_\mathrm{final} = A_\mathrm{sym} \times \mathrm{scale\_factor}(S),$

where $A_\mathrm{sym}$ is the operator’s output.

5. Joint Training Objective

All submodules are trained jointly with cross-entropy losses. The total loss is:

$L = \mathrm{NLL}(\log p^\mathrm{tag}, G^\mathrm{tag}) + \mathrm{NLL}(\log p^\mathrm{op}, G^\mathrm{op}) + \mathrm{NLL}(\log p^\mathrm{scale}, G^\mathrm{scale}) + \mathrm{NLL}(\log p^\mathrm{order}, G^\mathrm{order}),$

where $G^\mathrm{tag}$ marks answer-related sub-tokens, $G^\mathrm{op}$ is the oracle operator, $G^\mathrm{scale}$ is scale ground-truth, and $G^\mathrm{order}$ is defined for ordered operators only.

6. Empirical Results and Failure Analysis

On the TAT-QA test split, TagOp achieves 50.1% EM and 58.0% F1 (11.1% absolute gain over prior best), with human performance at EM 84.1 / F1 90.8. Operator classifier and scale predictor accuracies are as follows:

Operator	% of Q’s	Accuracy (%)
Span-in-text	21.3	91.6
Cell-in-table	21.6	86.7
Spans	12.6	93.8
Sum	2.5	76.2
Count	2.4	100.0
Average	5.9	100.0
Multiplication	0.1	0.0
Division	1.0	87.5
Difference	15.9	96.6
Change ratio	10.2	95.3
Other	6.6	0.0

Scale	% of Q’s	Accuracy (%)
None	50.3	90.1
Thousand	19.2	95.3
Million	12.9	90.2
Billion	–	–
Percent	17.7	95.9

Error analysis attributes approximately 55% of failures to incorrect evidence tagging and 29% to missing evidence, underscoring that precise IO tagging remains the chief challenge for hybrid-table-text QA (Zhu et al., 2021). A plausible implication is that further improvements in token-level evidence extraction are highly likely to yield performance gains.

7. Significance, Challenges, and Benchmark Status

TagOp’s architecture demonstrates that modeling both tabular and unstructured textual information with unified embeddings, explicit evidence extraction, symbolic reasoning, and contextual scale normalization enables tractable numerical QA over hybrid financial data. Nevertheless, performance remains significantly lower than the human benchmark. The chief identified bottleneck is precise IO tagging on heterogeneous inputs, a challenge occupying over half of all error cases. TagOp and TAT-QA together serve as a demanding benchmark for hybrid data QA, requiring advances across evidence localization, operator prediction, and symbolic computation (Zhu et al., 2021).

Markdown Report Issue Upgrade to Chat

References (1)

TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TagOp Model.