TableBank Dataset Overview
- TableBank is a large-scale dataset for detecting and recognizing table structures, created using an automated weak-supervision pipeline from Microsoft Word and LaTeX documents.
- The dataset offers precise table detection with a 0.5% labeling error rate and detailed structure recognition via HTML-style tag sequences.
- Baseline models using Faster R-CNN and transformer detectors demonstrate high precision and F1 scores, highlighting its impact on advancing document image analysis.
TableBank is a large-scale, image-based benchmark dataset developed for table detection and table structure recognition in document images. It is constructed via an automatic weak-supervision pipeline from Microsoft Word and LaTeX documents sourced from the internet, resulting in high-quality labeled table regions and accompanying structure markup at unprecedented scale. TableBank has become a major resource for training and evaluating deep neural models for document image analysis, with coverage and label schema designed explicitly to facilitate robust learning and cross-domain generalization across varied document sources (Li et al., 2019).
1. Dataset Construction and Annotation Pipeline
TableBank is built via an automatic extraction and weakly supervised annotation process targeting two types of source documents:
- Microsoft Word files: .docx archives are downloaded from the web. Each document is parsed to identify table regions demarcated by
<w:tbl>Office XML tags. Colored borders are programmatically inserted around each table in the document’s XML, and an annotated PDF is produced. - LaTeX documents: Source .tex files (primarily harvested from arXiv, 2014–2018) are processed by wrapping every
\begin{table}…\end{table}or\begin{tabular}…\end{tabular}environment with a colored frame via the LaTeX packagefcolorbox, then compiled to PDF.
For both sources, table bounding boxes are extracted by aligning the colorized annotated PDF with the original PDF, followed by pixel-wise subtraction to identify table contours. Each table is represented by an axis-aligned bounding box in image coordinates. The annotation process is validated by hand-checking a sample of 1,000 random pages, yielding an observed labeling error rate of just 0.5%, reflecting high annotation fidelity (Li et al., 2019).
For the structure recognition task, table markup is further canonicalized: for Word, Office XML is converted to an HTML-style tag sequence, while for LaTeX the LaTeXML toolkit generates internal XML, which is then post-processed into compatible HTML-style tags.
2. Dataset Scale, Splits, and Content
TableBank comprises two main tasks: table detection (locating tables in document images) and table structure recognition (serializing structure from cropped table images):
| Task | Total Instances | Word Source | LaTeX Source |
|---|---|---|---|
| Table detection | 417,234 | 163,417 | 253,817 |
| Structure recognition | 145,463 | 56,866 | 88,597 |
For table detection, the annotation split is: 415,234 in training, 2,000 in validation (1,000 from each source), and 2,000 in test (1,000 from each source). Structure recognition follows a similar split with 1,000 each for validation and test.
Every labeled instance for detection specifies a single “table” class and provides an (x, y, width, height) bounding box for every table region in the PDF or image page. For structure recognition, the target output is an HTML-style tag sequence encoding the structure and cell occupancy of the cropped table.
3. Supported Tasks and Label Schema
TableBank supports:
- Table detection: Given a document page, predict all table regions as bounding boxes. Evaluation focuses on coverage and localization.
- Table structure recognition: Given a table image crop, output a structured, HTML-style tag sequence encoding headers, rows, and cell content flags (cell_y for non-empty, cell_n for empty cells).
Bounding box annotations are axis-aligned and page-referenced. Structure labels employ nested tags (tabular, thead, tbody0, 1tr2, 3td4, etc.) with normalization to support consistent parsing.
4. Evaluation Protocols and Metrics
Detection
- Performance is assessed by area-based metrics:
- Precision: 5
- Recall: 6
- F1: Harmonic mean of precision and recall.
- Metrics are aggregated globally per test split. No instance-level IoU threshold is mandated; overlaps are accounted for at pixel/area granularity.
Recognition
- Sequence generation accuracy is measured using 4-gram BLEU scores against the ground-truth tag sequence.
Alternative evaluation protocols using intersection-over-union (IoU) thresholds, mean Average Precision (mAP), and Average Recall (AR) are also employed in later studies (Shehzadi et al., 2023), adapting COCO-style object detection metrics.
5. Baseline Models and Experimental Results
The original TableBank evaluation (Li et al., 2019) uses:
- Table detection: Faster R-CNN with ResNeXt-101/152 backbones, implemented in Detectron.
- Table structure recognition: CNN encoder with attention-based RNN decoder, implemented in OpenNMT.
Baseline performance (ResNeXt-152, F1 score) is as follows:
| Domain | Precision | Recall | F1 |
|---|---|---|---|
| Word-only | 0.9530 | 0.8829 | 0.9166 |
| LaTeX-only | 0.9867 | 0.9754 | 0.9810 |
| Mixed | 0.9657 | 0.8989 | 0.9311 |
Structure recognition (BLEU) yields scores of 0.7507 (Word test) and 0.7653 (LaTeX test) when trained and tested in-domain; cross-domain drops are evident. Failure analyses identify most common errors as missed/detected tables and partial detections.
Recent work evaluates advanced transformer-based object detectors (e.g., DETR) on TableBank (Shehzadi et al., 2023), reporting dramatic improvements:
- Best mAP on TableBank7: 96.9% (DETR + anchor/points + positive/negative noise, Dilation+Smudge pre-processing).
- Comparison to R-CNNs: HybridTabNet and CasTabDetectoRS perform strongly; new DETR variants can surpass them, especially after augmenting object queries.
- Ablation: Best N=10 object queries; Dilation+Smudge pre-processing yields up to 1% relative mAP gain.
6. Pre-processing, Variants, and Downstream Usage
Multiple strategies have been studied for improving detector robustness:
- Pre-processing: Image dilation (2×2 kernel), "smudge" (distance-based black-pixel spread), or both; the combination leads to highest mAP.
- Query variations in DETR: Object queries are encoded as points, anchor boxes, or anchor boxes with positive/negative noise. The anchor box + noise scheme improves robustness to table size and position variation.
- Generalization challenges: Table layouts vary widely (with/without ruling lines, multi-column/row spans), posing difficulties for vanilla CNN or transformer detectors. Small or densely packed tables remain failure modes.
TableBank is often used alongside other graphical object detection datasets such as PubLayNet and PubTables, enabling systematic comparison across academic and commercial document domains (Shehzadi et al., 2023).
7. Access, Licensing, and Community Adoption
TableBank is publicly available with full datasets, task-specific splits, and baseline code at https://github.com/doc-analysis/TableBank (Li et al., 2019). The data is organized by source (Word, LaTeX), split (train/val/test), and task (detection, structure recognition). Users are directed to the repository for licensing details. The benchmark has powered the development and evaluation of new deep document understanding architectures, including both R-CNN and transformer-based detectors, and is integral to performance reporting for table detection and recognition tasks in the literature (Li et al., 2019, Shehzadi et al., 2023).