Papers
Topics
Authors
Recent
2000 character limit reached

Page-Object Table Transformer (POTATR)

Updated 18 December 2025
  • The paper's main contribution is the introduction of a unified page-object approach that extracts tables and subcomponents from full-page document images.
  • It employs a ResNet-50 based visual encoder with a DETR-style transformer and doubled object queries to detect 250 table-related elements.
  • Robust relation modeling clarifies hierarchical table structures, achieving state-of-the-art metrics such as GriTS top (0.902) and Edge-F1 (0.746) on the PubTables-v2 dataset.

The Page-Object Table Transformer (POTATR) is an image-to-graph neural architecture designed for comprehensive extraction of tables and their substructure from full-page document images. Building on and extending the Table Transformer (TATR), POTATR simultaneously detects tables and their constituent rows, columns, header cells, spanning cells, projected headers, captions, and footers, along with their hierarchical and adjacency relationships. The model is proposed and introduced in conjunction with the PubTables-v2 dataset, serving as a unified approach for page-level table extraction and structured understanding in complex visual documents (Smock et al., 11 Dec 2025).

1. Motivation and Conceptual Basis

Traditional table extraction (TE) systems typically proceed in two stages: (1) detecting tables on each page, and (2) parsing each cropped table region in isolation to discern rows, columns, headers, and other elements. This prevailing pipeline is prone to several limitations: loss of broader page context (especially for elements like captions and footers), increased maintenance complexity as a multi-stage system, and difficulty handling tables that span multiple pages.

POTATR introduces the "page-object" TE paradigm: given an entire page image, it aims to detect all tables and their subcomponents, modeling the set of objects and their relations natively in context, without requiring per-table cropping or separate region proposal steps. This approach explicitly models the contextual and hierarchical interplay among table elements, supporting robust extraction even in the presence of rotated elements or ambiguous boundaries.

2. Input Encodings and Object Proposal Mechanisms

POTATR adopts a visual encoder based on a ResNet-50 backbone pretrained on ImageNet, followed by a 1×11\times1 projection to produce a spatial feature map of dimension d=256d=256. Sine-cosine positional encodings are added to this representation. Unlike systems that rely on external region proposals, POTATR employs a DETR-style mechanism with learned "object queries." The number of queries is doubled relative to TATR (from 125 to 250) to accommodate the increased object count present in full-page images.

While prior methods such as TATR have shown that incorporating PDF-extracted text and bounding boxes as auxiliary object proposals can be beneficial, the core implementation of POTATR in reported experiments focuses solely on visual features. However, the architecture is compatible with late fusion of text features at the query level.

3. Transformer Architecture and Relation Modeling

The core of POTATR is a transformer with 6-layer encoders and decoders, each comprising multi-head self-attention and feed-forward networks (hidden dimension d=256d=256, 8 attention heads, FFN size 2048). The decoder attends to output from the encoder and is parameterized with 250 learned object queries.

Each decoder output embedding hiR256h_i \in \mathbb{R}^{256} feeds two heads:

  • Classification head: predicts one of 17 classes (16 object types + "no object") via softmax.
  • Bounding-box head: predicts a normalized coordinate box through a 3-layer MLP.

The model further includes a relation head. For every query pair (i,j)(i, j), a directed adjacency probability is computed as

pij=σ(MLPrel([hihj]))p_{ij} = \sigma\left(\text{MLP}_\mathrm{rel}([h_i \Vert h_j])\right)

where σ\sigma is the logistic sigmoid, and [hihj][h_i \Vert h_j] denotes the concatenation of the two embeddings. The two-layer MLPrel\text{MLP}_\mathrm{rel} has hidden size 256.

4. Graph Construction and Output Representations

POTATR outputs a page-object graph where nodes correspond to detected objects (tables, structural components, captions, etc.), and edges correspond to predicted high-confidence relations (defined as pij>τp_{ij}>\tau for some threshold τ\tau). In its base configuration, POTATR models a single binary relation type: parent–child, or more generally, "belongs-to-the-same-table." This graph structure supports direct recovery of:

  • Row and column association with parent table nodes
  • Header cell assignments to columns
  • Cell grid construction by intersecting row and column spans
  • Caption and footer linkage to tables

By inferring these relations, the global structure of tables present on a page—potentially including rotated elements—is clarified in a unified output graph.

5. Loss Functions and Optimization

The POTATR loss is a weighted combination of four terms:

  • Object classification loss:

Lcls=1NiCE(ci,ci)\mathcal{L}_\mathrm{cls} = \frac{1}{N} \sum_{i} \mathrm{CE}\left(c_i, c^*_i\right)

where cic_i is the 17-class softmax prediction and cic^*_i is the ground-truth one-hot.

  • Bounding-box regression (1\ell_1 norm):

L1=1Nibibi1\mathcal{L}_{\ell_1} = \frac{1}{N} \sum_{i}\|b_i - b^*_i\|_1

  • Generalized IoU loss:

Lgiou=1Ni(1GIoU(bi,bi))\mathcal{L}_\mathrm{giou} = \frac{1}{N} \sum_{i}\left(1-\mathrm{GIoU}(b_i, b^*_i)\right)

  • Relation (edge) prediction loss:

Lrel=1250249ij[rijlogpij(1rij)log(1pij)]\mathcal{L}_\mathrm{rel} = \frac{1}{250 \cdot 249} \sum_{i \neq j}\left[-r^*_{ij}\log p_{ij} - (1 - r^*_{ij})\log (1 - p_{ij})\right]

where rij{0,1}r^*_{ij} \in \{0,1\} is the ground-truth adjacency label.

Total loss:

L=λclsLcls+λ1L1+λgiouLgiou+λrelLrel\mathcal{L} = \lambda_\mathrm{cls} \mathcal{L}_\mathrm{cls} + \lambda_{\ell_1}\mathcal{L}_{\ell_1} + \lambda_\mathrm{giou} \mathcal{L}_\mathrm{giou} + \lambda_\mathrm{rel}\mathcal{L}_\mathrm{rel}

with λcls=1\lambda_\mathrm{cls} = 1, λ1=5\lambda_{\ell_1} = 5, λgiou=2\lambda_\mathrm{giou} = 2, and λrel=0.05\lambda_\mathrm{rel} = 0.05. A reduced no-object weight (eoscoef=0.1\text{eos}_\text{coef} = 0.1) is used for classification.

6. Dataset, Training Regimen, and Implementation Notes

POTATR is trained on the PubTables-v2 Single Pages dataset, consisting of approximately 467,000 pages and 548,000 tables, with splits for public train/validation/test and a held-out hidden test set (4.3%). Optimization uses the AdamW optimizer with an initial learning rate of 5×1055\times10^{-5}, weight decay 10410^{-4}, and exponential LR decay by 0.9 every two epochs. Training is conducted for 100 epochs (one pass over all pages), using 8× NVIDIA T4 GPUs, batch size 2 per GPU (total batch 16).

Initialization uses TATR-v1.1 weights trained on PubTables-1M cropped tables; the new relation head and additional queries are randomly initialized.

7. Evaluation, Comparative Results, and Ablations

Metrics:

  • GriTStop_\text{top} and GriTScon_\text{con} (grid table similarity—topology and content)
  • Exact match accuracy (perfect row×column structure match)
  • Edge F1 (for parent–child graph edges with IoU ≥ 0.8)
  • Standard AP50_{50} for detection-level comparisons

Page-level structure recognition (POTATR-v1.0-Pub):

  • GriTStop_\text{top}: 0.9604
  • GriTScon_\text{con}: 0.9573
  • Acctop_\text{top}: 0.7377
  • Acccon_\text{con}: 0.6643
  • Edge-F1: ≈ 0.746

Comparative benchmarks:

  • Best domain-specialized VLM: granite-vision-3.2-2b, GriTStop_\text{top}: 0.8015, Acctop_\text{top}: 0.4245
  • Cropped table recognition (fine-tuned TATR-v1.2-Pub): GriTStop_\text{top}: 0.9803, Acctop_\text{top}: 0.6872
  • Image-to-graph models (small scale, 10 epochs):
Model AP50_{50} GriTStop_\text{top} Edge-F1
Relationformer 0.808 0.852 0.339
EGTR 0.791 0.850 0.707
POTATR 0.904 0.902 0.746

Ablations:

  • Explicit relation head markedly improves table–object association over overlap-based heuristics.
  • Doubling queries to 250 substantially improves AP and GriTS relative to 125.
  • Cross-page table continuation detection using side-by-side ResNet-50 or ViT-B-16 achieves F1 > 0.97 on ~4,000 pairs, indicating that continuity is visually tractable in PubTables-v2.

Strengths: End-to-end, context-aware, single-model table extraction; explicit object graph representation; leverages TATR pretraining for rapid convergence; establishes state-of-the-art results for page-level table structure recognition.

Limitations: Currently restricted to single-page context; text features are limited to PDF extraction rather than learning OCR features; only one binary relation type is modeled—opportunities exist to handle richer relations; integration with vision-LLMs is underexplored.

[POTATR is detailed in: (Smock et al., 11 Dec 2025)]

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Page-Object Table Transformer (POTATR).