CascadeTabNet: End-to-End Table Recognition

Updated 3 June 2026

CascadeTabNet is an end-to-end deep learning model that integrates table detection and structure recognition using a Cascade Mask R-CNN architecture with an HRNetV2p_W32 backbone.
It employs a multi-stage refinement process and specialized data augmentations such as dilation and smudge transforms to enhance segmentation accuracy in diverse document images.
Benchmark results on multiple datasets demonstrate that CascadeTabNet achieves state-of-the-art performance, setting new baselines in document image analysis.

CascadeTabNet is an end-to-end deep learning model designed for automatic table recognition from document images. It addresses both table detection and table structure recognition within a single convolutional neural network, leveraging a Cascade Mask R-CNN architecture with a high-resolution backbone (HRNetV2p_W32). The model achieves state-of-the-art performance across multiple public benchmarks, including ICDAR 2013, ICDAR 2019, and TableBank, and demonstrates effective transfer learning coupled with targeted document-centric image augmentations (Prasad et al., 2020).

1. Model Architecture

CascadeTabNet builds on the Cascade Mask R-CNN framework, integrating the HRNetV2p_W32 backbone for preservation of high-resolution spatial features throughout the network. The architecture consists of the following components:

Backbone (HRNetV2p_W32): Maintains parallel multi-resolution branches (C₂, C₃, C₄, C₅) at each stage, with information exchange and a lightweight feature pyramid merging upsampled deepest features (C₅) with shallower ones, enhancing both localization and segmentation performance.
Region Proposal Network (RPN): Slides over aggregated feature maps, generating approximately 1,000 candidate RoIs per image, subsequently filtered by non-maximum suppression.
Cascade Bbox Heads: Three-stage refinement with increasing intersection-over-union thresholds $T_1=0.5$ , $T_2=0.6$ , $T_3=0.7$ for positive anchor assignment, each with independent fully connected layers but shared RoI feature extraction.
Mask Head: Connected only at the final stage, performing per-instance binary mask segmentation.

Loss Function:

The training objective sums classification and bounding box regression losses for each cascade stage, plus a mask loss at the last stage: $L = \sum_{i=1}^3 w_i (L^{\mathrm{cls}}_i + L^{\mathrm{bbox}}_i) + L^{\mathrm{mask}},\quad w_i = 1$ where

$L^{\mathrm{cls}}_i = -\sum_k y_k \log p_k, \qquad L^{\mathrm{bbox}}_i = \sum_k \mathrm{smoothL1}(t_k - \hat{t}_k)$

$L^{\mathrm{mask}} = -\frac{1}{|M|} \sum_{u,v} [m_{u,v} \log p_{u,v} + (1-m_{u,v}) \log(1-p_{u,v})]$

$y_k$ is the ground-truth class, $p_k$ the predicted probability, $t_k,\hat{t}_k$ ground-truth and predicted box offsets, $m_{u,v}$ mask ground-truth at $T_2=0.6$ 0.

2. Table Structure Representation

CascadeTabNet outputs, for each detected table, an instance segmentation mask and a classification label indicating "bordered" or "borderless." For borderless tables, instance masks for each detected cell are also produced.

mask $T_2=0.6$ 1: Binary mask $T_2=0.6$ 2
mask $T_2=0.6$ 3: For each cell $T_2=0.6$ 4, binary mask $T_2=0.6$ 5, $T_2=0.6$ 6

Post-processing for Borderless Tables:

Compute centroid $T_2=0.6$ 7 for each mask $T_2=0.6$ 8
Cluster centroids along $T_2=0.6$ 9-axis (1D k-means) to determine $T_3=0.7$ 0 rows
Sort cells within each row by $T_3=0.7$ 1, forming $T_3=0.7$ 2 columns
Deduce grid lines by averaging adjacent cell box boundaries
Fill missing grid cells via text contour and line intersection detection
Assign row-span or column-span if a mask overlaps more than IoU threshold with multiple grid cells

Pseudo-algorithm:

$T_3=0.7$ 4

3. Training Strategies and Data Augmentation

Training is performed in two iterative stages, leveraging transfer learning from large-scale datasets and specialized document augmentations:

Stage 0: Initialize HRNetV2p backbone using ImageNet and COCO weights.
Stage 1: Fine-tune for general table detection (single class) on approximately 1,900 images (ICDAR19, Marmot, Github borderless).
Stage 2: Further fine-tune for table and cell detection (three classes: bordered, borderless, borderless-cell) on 342 annotated ICDAR19 images.

Augmentation Pipeline:

Dilation transform: Binary conversion followed by $T_3=0.7$ 3 square kernel dilation, enhancing stroke robustness.
Smudge transform: Binarization and application of distance transforms (Euclidean, Linear, Max), plus normalization to simulate degraded document conditions.

Both transformations are performed offline, tripling the effective dataset size.

Key Hyper-parameters:

Parameter	Value(s)	Notes
Learning rate (Stage 1)	0.02; step at 16/19 epochs	General detection
Learning rate (Stage 2)	0.002; step at 8/11 epochs	Structure fine-tuning
Batch size	2 images / GPU	On 1 to 4 GPUs
RPN anchor scales	[32,64,128,256,512]
RPN anchor ratios	[0.5, 1, 2]
RoIs per image	512	Positive fraction: 0.25
Cascade IoU thresholds	[0.5, 0.6, 0.7]	Stages 1–3

4. Benchmarking and Experimental Results

CascadeTabNet has been extensively evaluated on standard public datasets, employing the weighted mean of intersection-over-union (IoU) for performance reporting.

Ablation Studies: Effect of Augmentation

Augmentation setting	IoU=0.6	IoU=0.7	IoU=0.8	IoU=0.9	WAvg
Original only	0.836	0.816	0.787	0.634	0.758
Dilation only	0.869	0.855	0.835	0.705	0.807
Smudge only	0.863	0.853	0.839	0.684	0.801
Both (ours)	0.888	0.884	0.863	0.736	0.835

Comparative Performance

Model	IoU=0.6	IoU=0.7	IoU=0.8	IoU=0.9	WAvg
RetinaNet-R101	0.818	0.785	0.762	0.664	0.749
Faster-RCNN-HR	0.889	0.877	0.862	0.781	0.847
Cascade-RCNN-R101	0.929	0.913	0.903	0.852	0.895
Cascade-Mask-RCNN-R50-D	0.912	0.897	0.880	0.834	0.877
Cascade-Mask-RCNN-X101	0.931	0.925	0.909	0.868	0.905
Cascade-Mask-RCNN-HR (ours)	0.941	0.932	0.923	0.886	0.918

Detection and Structure Recognition Benchmarks

ICDAR 2019 Track A Modern (post-competition leaderboard):
- TableRadar: WAvg 0.940
- NLPR-PAL: WAvg 0.927
- CascadeTabNet: WAvg 0.901
TableBank (F1 score):
- Word + Latex: 94.33 (CascadeTabNet) vs. 92.67 (baseline)
- Latex-only: 96.60 (CascadeTabNet) vs. 95.92
- Word-only: 94.92 (CascadeTabNet) vs. 87.67
ICDAR 2013:
- CascadeTabNet: Precision=1.00, Recall=1.00, F1=0.9740
- DeepDeSRT: F1=0.962
- TableNet: F1=0.966
ICDAR 2019 Track B2 (structure recognition, WAvg IoU):
- CascadeTabNet: 0.232
- NLPR-PAL: 0.206

5. Implementation and Practical Considerations

CascadeTabNet is implemented using the MMDetection v2.3 framework (PyTorch ≥1.3, Python 3.6+), and code/configurations are available at https://github.com/DevashishPrasad/CascadeTabNet. Training and inference require substantial GPU memory (~2.5 GB for inference, ~8 GB per GPU for training at batch size 2). Single-image inference throughput is approximately 12–15 FPS for 1-megapixel images on an NVIDIA P100.

Key configuration files include cascade_mask_rcnn_hrnetv2p_w32_20e.py for general detection, with modified files for structure recognition. Post-processing for bordered and borderless tables is available in the provided scripts.

Reproducibility:

Employ the identical augmentation pipeline (dilation+smudge) as described.
Use the two-stage fine-tuning protocol.
Fix random seeds for PyTorch, NumPy, and Python to 42.
Employ at least one GPU with ≥12 GB RAM; multi-GPU setups are recommended for Stage 0 fine-tuning.

6. Significance in Document Image Analysis

CascadeTabNet unifies table detection and structure recognition into a single end-to-end learning framework grounded in state-of-the-art deep object detection and segmentation methodologies. By integrating high-resolution feature modeling, multi-stage proposal refinement, and robust data augmentation tailored for documents, it establishes new baselines for automatic table interpretation in diverse real-world scenarios. The approach demonstrates particular gains in challenging settings (e.g., borderless tables, handwritten or degraded documents) and provides a reproducible, extensible platform advancing research in document understanding (Prasad et al., 2020).

Markdown Report Issue Upgrade to Chat

References (1)

CascadeTabNet: An approach for end to end table detection and structure recognition from image-based documents (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CascadeTabNet.