TagOp: Hybrid Table Aggregation for Financial QA
- TagOp Model is a hybrid table aggregation operator that processes both tabular and textual data for financial QA.
- It employs sequence tagging to extract evidential spans and operator-based symbolic computation for precise numerical reasoning.
- State-of-the-art TAT-QA performance is achieved despite challenges in accurate IO tagging of heterogeneous inputs.
TagOp (Table Aggregation Operator) is a QA model designed for numerical and semantic reasoning over hybrid input consisting of tabular and textual content, introduced as part of the TAT-QA benchmark for financial report question answering. Its architecture supports the extraction and reasoning over heterogeneous evidential spans using sequence tagging and operator-based symbolic computation, achieving state-of-the-art results on the TAT-QA dataset, while revealing persisting challenges in precise evidence localization and hybrid data reasoning (Zhu et al., 2021).
1. Input Representation and Shared Encoder
TagOp processes concatenated hybrid data, where the input is tokenized in the form: , with as question tokens, as flattened table tokens, and as paragraph tokens. The concatenated token sequence is fed to a 12-layer Transformer encoder (RoBERTa or Tapas), producing contextualized vectors shared by all subsequent submodules. This unified encoding enables joint reasoning across tables and text, capturing dependencies inherent in financial QA tasks.
2. Evidence Extraction via Sequence Tagging
TagOp frames the identification of relevant evidences as token-level IO (inside/outside) tagging over all sub-tokens, spanning the entire question, table, and text. Each sub-token receives a label . Any sub-token within a cell labeled renders the entire cell “selected”, and consecutive s in paragraphs are merged as a single evidential span.
For each sub-token with contextual embedding , the tagging probability is computed as:
where FFN is a two-layer feed-forward network with GELU non-linearity. Ground-truth tags are assigned by matching annotated answer content in table-first, then text, order.
3. Operator-Based Symbolic Reasoning
TagOp incorporates a set of ten aggregation operators (with an “Other” catch-all) that apply symbolic reasoning over extracted evidences . Formal operator definitions are as follows:
| Operator | Formal Semantics | Selection Logic |
|---|---|---|
| Span-in-text | Max-scoring text span | |
| Cell-in-table | Max-scoring table cell | |
| Spans | Concatenate all selected spans/cells | All evidences |
| Sum | of for numeric | All numerics |
| Count | All evidences | |
| Average | All numerics | |
| Multiplication | All numerics | |
| Difference | (top-2 by ) | Ordered pair |
| Division | Ordered pair | |
| Change-ratio | Ordered pair |
Operator selection is performed using a multi-class softmax over the [CLS] vector:
with chosen operator . For operators requiring operand order (Difference, Division, Change-ratio), a binary number-order classifier is computed as:
where are encoder vectors for the top-2 evidences. The binary label indicates whether input order matches .
4. Scale Prediction and Numeric Calculation
To accommodate the financial context, TagOp predicts a question-specific scale None, Thousand, Million, Billion, Percent. Scale prediction utilizes representations aggregated from table and paragraph tokens:
After symbolic execution, the model computes
where is the operator’s output.
5. Joint Training Objective
All submodules are trained jointly with cross-entropy losses. The total loss is:
where marks answer-related sub-tokens, is the oracle operator, is scale ground-truth, and is defined for ordered operators only.
6. Empirical Results and Failure Analysis
On the TAT-QA test split, TagOp achieves 50.1% EM and 58.0% F1 (11.1% absolute gain over prior best), with human performance at EM 84.1 / F1 90.8. Operator classifier and scale predictor accuracies are as follows:
| Operator | % of Q’s | Accuracy (%) |
|---|---|---|
| Span-in-text | 21.3 | 91.6 |
| Cell-in-table | 21.6 | 86.7 |
| Spans | 12.6 | 93.8 |
| Sum | 2.5 | 76.2 |
| Count | 2.4 | 100.0 |
| Average | 5.9 | 100.0 |
| Multiplication | 0.1 | 0.0 |
| Division | 1.0 | 87.5 |
| Difference | 15.9 | 96.6 |
| Change ratio | 10.2 | 95.3 |
| Other | 6.6 | 0.0 |
| Scale | % of Q’s | Accuracy (%) |
|---|---|---|
| None | 50.3 | 90.1 |
| Thousand | 19.2 | 95.3 |
| Million | 12.9 | 90.2 |
| Billion | – | – |
| Percent | 17.7 | 95.9 |
Error analysis attributes approximately 55% of failures to incorrect evidence tagging and 29% to missing evidence, underscoring that precise IO tagging remains the chief challenge for hybrid-table-text QA (Zhu et al., 2021). A plausible implication is that further improvements in token-level evidence extraction are highly likely to yield performance gains.
7. Significance, Challenges, and Benchmark Status
TagOp’s architecture demonstrates that modeling both tabular and unstructured textual information with unified embeddings, explicit evidence extraction, symbolic reasoning, and contextual scale normalization enables tractable numerical QA over hybrid financial data. Nevertheless, performance remains significantly lower than the human benchmark. The chief identified bottleneck is precise IO tagging on heterogeneous inputs, a challenge occupying over half of all error cases. TagOp and TAT-QA together serve as a demanding benchmark for hybrid data QA, requiring advances across evidence localization, operator prediction, and symbolic computation (Zhu et al., 2021).