Vision-Transformer Models for Table Structure Recovery

Updated 6 May 2026

The paper introduces novel vision-transformer models that efficiently recover table structures by integrating separator regression, image-to-sequence decoding, and graph-based reasoning.
The methodology combines CNN backbones with transformer decoders to predict curvilinear separators and autoregressive token sequences, enhancing precision for borderless and multi-page tables.
The approach achieves high performance benchmarks, with state-of-the-art F1 scores over 95% in complex document scenarios, demonstrating robust table structure recovery.

Table structure recovery from images using vision-transformer models is a core problem in visual document understanding, addressing the task of reconstructing machine-readable representation (e.g., HTML, grid matrix, or adjacency serialization) from a rasterized or scanned tabular input. State-of-the-art approaches utilize transformer-based architectures—either in DETR-style separator regression, image-to-sequence frameworks, or image-to-graph pipelines—to capture global and local dependencies in the tabular layout, achieving high precision even under challenging cases such as borderless, curved, or multi-page tables.

1. Model Architectures for Table Structure Recovery

Modern vision-transformer frameworks for table structure recovery fall into several design paradigms:

Separator Regression Networks: Approaches such as TSRFormer and SepFormer employ two-stage DETR-style architectures, where parallel “row” and “column” branches predict curvilinear separators directly via transformer decoders. The backbone is typically a CNN (e.g., ResNet-18/34) with FPN, feeding enhanced features to transformer modules that regress separator coordinates at fixed x/y positions. This design eliminates the need for mask-to-line postprocessing and is highly robust to deformations (Lin et al., 2022, Nguyen et al., 27 Jun 2025).
Encoder–Decoder Im2Seq with Transformers: TableFormer, UniTabNet, and Transformer models with optimized sequence tokenization (OTSL) leverage either hybrid (ConvNet+Transformer) or pure ViT backbones, followed by transformer text decoders that autoregressively output table structure as a token sequence (e.g., HTML, OTSL, or custom cell tokens). Additions such as cell bounding box (BBox) prediction heads and auxiliary decoders (e.g., physical/logical branches in UniTabNet) increasingly unify structure and geometry recovery (Nassar et al., 2022, Lysak et al., 2023, Zhang et al., 2024).
Image-to-Graph Transformers: Page-level and multi-page datasets such as PubTables-v2 support graph-based approaches (e.g., POTATR), combining vision-transformers with object and relation heads to simultaneously group detected table objects (rows, columns, captions) and reconstruct the hierarchical table graph. This pipeline enables reasoning across full-document contexts and rotated or nested structures (Smock et al., 11 Dec 2025).
Vision-LLMs (VLMs): Recent advances extend ViT backbones with modality-aligned text transformers, producing compact encoding of visual and linguistic signals. Self-supervised pipelines (e.g., TRivia) employ reinforcement learning with question-answering reward to boost structure recognition without human labels (Zhang et al., 1 Dec 2025).

2. Core Methodologies: Separator Regression, Im2Seq, and Graph Recovery

Key operational strategies are distinguished by their approach to representing and decoding table structure:

Direct Line Regression: TSRFormer, SepFormer, and DQ-DETR recast separator detection as a regression of curvilinear lines, discretized into fixed x/y positions and regressed via transformer decoders. Coarse-to-fine refinement is formalized using multi-stage or progressive decoder stacks; prior-enhanced (geometry-aware) matching accelerates convergence and increases stability during set-prediction (Lin et al., 2022, Wang et al., 2023, Nguyen et al., 27 Jun 2025). Cell merging is addressed with a relation network over grid cell features.
Autoregressive Markup Decoding (Im2Seq): Encoder–decoder models output a sequence of structured tokens, with variations such as OTSL—an optimized tokenization that reduces the vocabulary to 5 tokens and constrains sequence length. This design halves inference time and guarantees syntactic validity by grammar (Lysak et al., 2023). Transformer decoders attend globally to visual tokens, capturing complex span and nesting.
Graph Construction and Reasoning: In full-page or document-level settings, models such as POTATR frame table structure as image-to-graph mapping via DETR-style node detection and MLP relation heads. Edges terminate on parent-child types (e.g., row-to-table) and inference reconstructs the grid by grouping connected components. This approach is well suited for multi-table pages and multi-page associations (Smock et al., 11 Dec 2025).
Multi-branch Decoding: UniTabNet introduces a divide-and-conquer text decoder emitting <C>/<NL> tokens, each forking into parallel physical (polygon regression) and logical (span attribute) prediction heads. Auxiliary “guiders” align cross-attention to relevant visual regions (Vision Guider) and semantic cell representations (Language Guider) for enhanced structure comprehension (Zhang et al., 2024).

3. Training Protocols, Pre-training, and Loss Functions

Transformer-based TSR models incorporate a range of specialized objectives and schedules:

Masked Image Modeling (MIM): Self-supervised pre-training using masked prediction of VQ-VAE image tokens is critical for closing the performance gap between pure linear-projection ViTs and hybrid CNN–ViT models. MIM on in-domain table images increases TEDS scores by up to 12.5pp on complex tables and matches or outperforms CNN–ViT hybrids when fine-tuned (Peng et al., 2024).
Set-based Losses and Matching: All DETR-style architectures rely on bipartite (Hungarian) matching for set prediction. Incorporating geometric priors (e.g., proximity along fixed slices) into the matching cost reduces convergence time by half and reduces label assignments’ instability (Lin et al., 2022, Wang et al., 2023).
Polygon and Span Losses: Divide-and-conquer models employ cross-entropy for token sequences, MSE for polygon regression (cell corner coordinates), and focal loss for span attribute classification. Homoscedastic uncertainty weighting is adopted to balance the contributions of heterogeneous objectives (Zhang et al., 2024).
Self-Supervised Reinforcement Learning: TRivia introduces group relative policy optimization (GRPO) with QA-driven rewards. Each image is paired with automatically generated, answerable QA pairs, and policy gradients are used to maximize expected QA F1-score over sample groups (Zhang et al., 1 Dec 2025).

4. Performance Evaluation, Datasets, and Benchmarks

Performance is universally benchmarked using standardized metrics and large-scale datasets:

Dataset/Metric	Definition	Example SOTA Results
PubTabNet (TEDS-Struct)	Tree-edit distance on HTML structure	UniTabNet: 97.5, TRivia: 91.8
WTW (Adjacency F1)	Cell adjacency graph matching (IoU >= 0.6)	UniTabNet: 95.1, SepFormer: 93.9
PubTables-v2 (GriTS)	Grid cell-wise F1 score on multicell alignment	POTATR: 0.9604 (page), 0.980 (cropped)
iFLYTAB (TEDS-Struct)	Hierarchical structure similarity on real/doc scene images	UniTabNet: 94.0
In-house distorted	Custom real-world set for curved, borderless tables	TSRFormer: 95.2 (F1)

On large-scale and real-world document scans, vision-transformer models exhibit robust structure recovery, with per-table structure F1 exceeding 95% and holistic (TEDS, GriTS) metrics peaking above 0.97 (Zhang et al., 2024, Smock et al., 11 Dec 2025, Lin et al., 2022). Optimized tokenization (OTSL) further raises cell mAP and decodes in half the time of HTML-centric models (Lysak et al., 2023).

5. Ablation Analyses and Architectural Insights

Empirical studies elucidate crucial architectural choices:

Cross-Attention Resolution: High-resolution sampling (e.g., only at fixed columns) in transformer cross-attention achieves precise cell/line localization at low compute, matching or exceeding full-resolution pixelwise attention (Lin et al., 2022).
Dynamic Queries and Progressive Regression: DQ-DETR’s progressive pointwise refinement of separator lines outperforms direct regression on distorted or curved tables by up to 2% F1 (Wang et al., 2023).
Early Convolutional Stems: Replacing heavy CNN backbones with optimized shallow convolutional “stems” achieves a more favorable balance of receptive field and token sequence length, improving both efficiency (nearly halving MAC/FLOPS) and matching or surpassing ResNet-based models in TEDS (Peng et al., 2023).
Auxiliary Guiders and Loss Weighting: Ablations in UniTabNet confirm that vision and semantic guiders can improve descriptive cell structure accuracy by 2pp, and uncertainty-weighted multi-task losses aid optimization (Zhang et al., 2024).

6. Limitations, Open Problems, and Future Directions

Notwithstanding strong empirical results, current vision-transformer models for table structure recovery reveal several technical challenges:

Highly curved or blurred separators: Regression methods still approximate such features with discretized strips or straight lines, opening opportunities for spline/Bézier-curve based representations (Nguyen et al., 27 Jun 2025).
Sequence Length/Scale: As table size and structural complexity increase (e.g., multi-page tables, full-page fusion), model sequence length and graph size push hardware and memory limits, motivating research in scalable attention and efficient query design (Smock et al., 11 Dec 2025).
Domain Adaptability: ViT-based approaches benefit from large-scale synthetic or in-domain pre-training. However, they manifest reduced robustness under distribution shift without targeted self-supervised adaptation (cf. MIM and QA-supervised RL) (Peng et al., 2024, Zhang et al., 1 Dec 2025).
Absence of explicit OCR/Text reasoning: Models focusing solely on geometric structure may misalign cell boundaries when content cues are strong (e.g., borderless descriptive tables), suggesting tighter integration with VLM or multi-modal pipelines (Zhang et al., 2024).

A plausible implication is that future transformer-based table structure recognizers will increasingly integrate explicit graph reasoning, self-supervised multimodal representations, and efficient architectural motifs (e.g., early conv, optimized tokenization) to support robust, scalable recognition across increasingly diverse and challenging document domains.