Page-Object Table Transformer (POTATR)
- The paper's main contribution is the introduction of a unified page-object approach that extracts tables and subcomponents from full-page document images.
- It employs a ResNet-50 based visual encoder with a DETR-style transformer and doubled object queries to detect 250 table-related elements.
- Robust relation modeling clarifies hierarchical table structures, achieving state-of-the-art metrics such as GriTS top (0.902) and Edge-F1 (0.746) on the PubTables-v2 dataset.
The Page-Object Table Transformer (POTATR) is an image-to-graph neural architecture designed for comprehensive extraction of tables and their substructure from full-page document images. Building on and extending the Table Transformer (TATR), POTATR simultaneously detects tables and their constituent rows, columns, header cells, spanning cells, projected headers, captions, and footers, along with their hierarchical and adjacency relationships. The model is proposed and introduced in conjunction with the PubTables-v2 dataset, serving as a unified approach for page-level table extraction and structured understanding in complex visual documents (Smock et al., 11 Dec 2025).
1. Motivation and Conceptual Basis
Traditional table extraction (TE) systems typically proceed in two stages: (1) detecting tables on each page, and (2) parsing each cropped table region in isolation to discern rows, columns, headers, and other elements. This prevailing pipeline is prone to several limitations: loss of broader page context (especially for elements like captions and footers), increased maintenance complexity as a multi-stage system, and difficulty handling tables that span multiple pages.
POTATR introduces the "page-object" TE paradigm: given an entire page image, it aims to detect all tables and their subcomponents, modeling the set of objects and their relations natively in context, without requiring per-table cropping or separate region proposal steps. This approach explicitly models the contextual and hierarchical interplay among table elements, supporting robust extraction even in the presence of rotated elements or ambiguous boundaries.
2. Input Encodings and Object Proposal Mechanisms
POTATR adopts a visual encoder based on a ResNet-50 backbone pretrained on ImageNet, followed by a projection to produce a spatial feature map of dimension . Sine-cosine positional encodings are added to this representation. Unlike systems that rely on external region proposals, POTATR employs a DETR-style mechanism with learned "object queries." The number of queries is doubled relative to TATR (from 125 to 250) to accommodate the increased object count present in full-page images.
While prior methods such as TATR have shown that incorporating PDF-extracted text and bounding boxes as auxiliary object proposals can be beneficial, the core implementation of POTATR in reported experiments focuses solely on visual features. However, the architecture is compatible with late fusion of text features at the query level.
3. Transformer Architecture and Relation Modeling
The core of POTATR is a transformer with 6-layer encoders and decoders, each comprising multi-head self-attention and feed-forward networks (hidden dimension , 8 attention heads, FFN size 2048). The decoder attends to output from the encoder and is parameterized with 250 learned object queries.
Each decoder output embedding feeds two heads:
- Classification head: predicts one of 17 classes (16 object types + "no object") via softmax.
- Bounding-box head: predicts a normalized coordinate box through a 3-layer MLP.
The model further includes a relation head. For every query pair , a directed adjacency probability is computed as
where is the logistic sigmoid, and denotes the concatenation of the two embeddings. The two-layer has hidden size 256.
4. Graph Construction and Output Representations
POTATR outputs a page-object graph where nodes correspond to detected objects (tables, structural components, captions, etc.), and edges correspond to predicted high-confidence relations (defined as for some threshold ). In its base configuration, POTATR models a single binary relation type: parent–child, or more generally, "belongs-to-the-same-table." This graph structure supports direct recovery of:
- Row and column association with parent table nodes
- Header cell assignments to columns
- Cell grid construction by intersecting row and column spans
- Caption and footer linkage to tables
By inferring these relations, the global structure of tables present on a page—potentially including rotated elements—is clarified in a unified output graph.
5. Loss Functions and Optimization
The POTATR loss is a weighted combination of four terms:
- Object classification loss:
where is the 17-class softmax prediction and is the ground-truth one-hot.
- Bounding-box regression ( norm):
- Generalized IoU loss:
- Relation (edge) prediction loss:
where is the ground-truth adjacency label.
Total loss:
with , , , and . A reduced no-object weight () is used for classification.
6. Dataset, Training Regimen, and Implementation Notes
POTATR is trained on the PubTables-v2 Single Pages dataset, consisting of approximately 467,000 pages and 548,000 tables, with splits for public train/validation/test and a held-out hidden test set (4.3%). Optimization uses the AdamW optimizer with an initial learning rate of , weight decay , and exponential LR decay by 0.9 every two epochs. Training is conducted for 100 epochs (one pass over all pages), using 8× NVIDIA T4 GPUs, batch size 2 per GPU (total batch 16).
Initialization uses TATR-v1.1 weights trained on PubTables-1M cropped tables; the new relation head and additional queries are randomly initialized.
7. Evaluation, Comparative Results, and Ablations
Metrics:
- GriTS and GriTS (grid table similarity—topology and content)
- Exact match accuracy (perfect row×column structure match)
- Edge F1 (for parent–child graph edges with IoU ≥ 0.8)
- Standard AP for detection-level comparisons
Page-level structure recognition (POTATR-v1.0-Pub):
- GriTS: 0.9604
- GriTS: 0.9573
- Acc: 0.7377
- Acc: 0.6643
- Edge-F1: ≈ 0.746
Comparative benchmarks:
- Best domain-specialized VLM: granite-vision-3.2-2b, GriTS: 0.8015, Acc: 0.4245
- Cropped table recognition (fine-tuned TATR-v1.2-Pub): GriTS: 0.9803, Acc: 0.6872
- Image-to-graph models (small scale, 10 epochs):
| Model | AP | GriTS | Edge-F1 |
|---|---|---|---|
| Relationformer | 0.808 | 0.852 | 0.339 |
| EGTR | 0.791 | 0.850 | 0.707 |
| POTATR | 0.904 | 0.902 | 0.746 |
Ablations:
- Explicit relation head markedly improves table–object association over overlap-based heuristics.
- Doubling queries to 250 substantially improves AP and GriTS relative to 125.
- Cross-page table continuation detection using side-by-side ResNet-50 or ViT-B-16 achieves F1 > 0.97 on ~4,000 pairs, indicating that continuity is visually tractable in PubTables-v2.
Strengths: End-to-end, context-aware, single-model table extraction; explicit object graph representation; leverages TATR pretraining for rapid convergence; establishes state-of-the-art results for page-level table structure recognition.
Limitations: Currently restricted to single-page context; text features are limited to PDF extraction rather than learning OCR features; only one binary relation type is modeled—opportunities exist to handle richer relations; integration with vision-LLMs is underexplored.
[POTATR is detailed in: (Smock et al., 11 Dec 2025)]