PubTables-v2: Multi-Page Scientific Table Extraction
- PubTables-v2 is a large-scale annotated dataset designed for comprehensive extraction of tables from scientific PDFs, emphasizing multi-page and full-page cases.
- It comprises millions of object-level and cell-level annotations across cropped tables, single pages, and full documents sourced from biomedical literature.
- The dataset supports advanced extraction tasks—including table detection, structure recognition, and multi-page reconstruction—benchmarking state-of-the-art models with detailed metrics.
PubTables-v2 is a large-scale annotated dataset designed for the comprehensive extraction of tables from scientific PDFs, with a particular emphasis on challenging full-page and multi-page table scenarios. It is the first large-scale benchmark explicitly supporting multi-page table structure recognition, and addresses the scarcity of annotated data that previously limited progress in vision-LLMs (VLMs) and end-to-end table extraction research. The dataset comprises millions of structured annotations tied to the biomedical and clinical literature collected from PubMed articles published between 2023 and 2025, providing both object-level and cell-level labels for a suite of table extraction and understanding tasks (Smock et al., 11 Dec 2025).
1. Dataset Composition and Statistics
PubTables-v2 consists of three primary “collections” capturing different levels of context:
- Cropped Tables: 135,578 images, each containing a single large table (≥30 rows or ≥12 columns).
- Single Pages: 467,541 PDF page images with a total of 548,414 annotated tables (~1.17 tables/page).
- Full Documents: 9,172 multi-page PDF documents housing 24,862 tables.
A distinctive feature is the large-scale inclusion of multi-page tables: within the Full Documents collection, 9,492 tables span at least two pages, comprising 38.2% of those tables (and approximately 1.7% of the total dataset). The span distribution is heavily skewed—7,817 tables over 2 pages, 1,134 over 3 pages, down to a handful spanning as many as 13 pages.
Every cell in each annotated table is labeled with its textual content, though the precise aggregate cell count is not enumerated. The dataset’s entire domain is biomedical/clinical, reflecting the PubMed corpus, with no subdomain breakdown provided.
2. Annotation Schema and Data Format
PubTables-v2 is annotated for both object detection and structure recognition, supporting a range of downstream models:
- Object-level annotations are provided in PASCAL-VOC style XML with one object per line (class and bounding box). There are 8 primary classes:
table,column,row,column_header,spanning_cell,projected_row_header,caption,footer; as well as 8 rotated versions for 90° objects. - Structure recognition (TSR) labels use a grid or matrix stored in JSON, compatible with the GriTS metric; each cell encodes coverage, cell spans, and text.
- Relational graph annotations: page-level objects are connected by explicit “has_child” relations, e.g. .
- Multi-page tables: JSON encodes logical tables spanning pages with a list mapping table ID to the page-bbox pairs that constitute corresponding fragments.
Example annotation entry: 6
Table structure forms a graph with nodes as annotated objects and edges as explicit parent-child relationships. For multi-page tables, all fragments with the same table_id are treated as a single logical table for TSR.
3. Dataset Splits and Usage Protocols
The dataset provides public and hidden test splits to facilitate robust benchmarking and prevent contamination:
- Cropped Tables: Public train/val/test sets plus a hidden test set containing 5,804 samples (4.3%). Public subsets sum to 129,774 samples.
- Single Pages: Public train/val/test and hidden test (∼4–5% of pages).
- Full Documents: Held-out test set of 878 documents; user-defined train/val splits for the remainder.
Hidden test sets serve to detect overfitting and pre-training leakage. There are no official cross-validation folds, but standard k-fold is recommended on the public splits. For domain adaptation scenarios, fine-tuning on single-page collections before document-level evaluation is suggested.
4. Supported Tasks, Evaluation, and Metrics
PubTables-v2 enables a broad array of table extraction and understanding tasks:
- Table Detection (TD): Locating and bounding tables within a page.
- Table Structure Recognition (TSR): Recovering segmentation into rows, columns, and spanning cells.
- Cell Content Extraction (CE): Recognizing and extracting cell-level text content.
- End-to-End Page-Level Table Extraction: Simultaneous detection and structure/content recovery for every table on a page.
- Multi-Page Table Reconstruction: Merging table fragments across pages into a unified structure.
- Cross-Page Table Continuation: Binary classification of whether a table on page continues onto .
The dataset provides ground truth for the following metrics:
| Metric | Description | Formula / Notes |
|---|---|---|
| Precision | Ratio of true positives among predictions | |
| Recall | Recall of true positives among all ground truth objects | |
| Harmonic mean of precision and recall | ||
| GriTS, GriTS | Grid-based pseudo-F1 scores; topological vs. content-aware | |
| Acc0, Acc1 | Fraction of tables with exact match TSR correctness | – |
| Edge F1 (graph) | F1 over predicted graph relations (IoU≥0.8) | – |
Editor's term: “TE” denotes Table Extraction.
5. Baseline Methods and Benchmark Results
Multiple model paradigms are evaluated using PubTables-v2:
- Vision-LLMs (VLMs): Qwen2.5-VL-3B, granite-vision-3.2-2B, SmolDocling-256M-preview, GraniteDocling.
- Non-VLM baselines: TATR-v1.1-Pub (Table Transformer trained on PubTables-1M) and TATR-v1.2-Pub (fine-tuned on PubTables-v2 cropped tables).
- Proposed model: POTATR-v1.0-Pub (Page-Object Table Transformer), which extends the Table Transformer to image-to-graph page-level extraction. POTATR uses a DETR backbone (initialized from TATR-v1.1-Pub), 250 object queries, and a dedicated relation head for "has_child" edge prediction. Output consists of object class/box and adjacency matrix.
Key results on the test sets:
- Cropped TSR (long/wide only):
- TATR-v1.2-Pub achieves GriTS2 = 0.9803, GriTS3 = 0.9801, Acc4 = 0.6872, Acc5 = 0.6831.
- granite-vision-2B: GriTS6 = 0.8714, Acc7 = 0.2155.
- Page-Level TSR:
- POTATR-v1.0-Pub: GriTS8 = 0.9604, GriTS9 = 0.9573, Acc0 = 0.7377, Acc1 = 0.6643.
- granite-vision-2B: GriTS2 = 0.8015, Acc3 = 0.4245.
- Document-Level Multi-Page TE: Qwen2.5-VL-3B achieves GriTS4 = 0.0775, GriTS5 = 0.0472.
- Cross-Page Table Continuation: ViT-B-16 achieves recall 0.995, precision 0.987, F1 0.991.
The results demonstrate that VLMs still lag behind DETR-based architectures on both page- and multi-page TE.
6. Notable Examples and Annotation Cases
Selected annotation instances highlight core capabilities:
- Single-page extract: Bounding boxes for each table component (e.g., table, column, row, spanning cell, etc.) alongside explicit parent–child relations.
- Multi-page table: Tables spanning three pages are annotated by linking corresponding bounding boxes across pages via table_id.
- Cropped TSR annotation: Grid-based JSON structure for large tables, supporting detailed cell-wise evaluation.
- Cross-page classification: Page pairs annotated as positive (continuing table) or negative, serving cross-page linking models.
Sample JSON structures clarify how objects, relations, and grids are encoded, supporting both image-to-graph and grid-based learning.
7. Limitations and Future Directions
PubTables-v2 has several documented limitations:
- Domain Bias: All source PDFs are from PubMed, which introduces style and structure biases representative of biomedical literature.
- Annotation Caveats: The Full Documents collection annotates table-level bounding boxes only (not individual rows/columns) to reduce cost. Very rare rotated tables pose challenges for relation-head approaches.
- Model Gaps: VLMs perform below specialized DETR-based models on many TE benchmarks. Some limited annotation noise exists, particularly in DocTags to HTML parsing for SmolDocling.
Planned or recommended dataset extensions include:
- Expanding the domain to include legal, financial, and patent documents,
- Incorporating more complex hierarchical relations (such as cross-references to figures),
- Enhancing multi-page sequence modeling beyond current pairwise approaches,
- Releasing larger hidden splits to safeguard against future training contamination,
- Exploring pre-training of VLMs directly on PubTables-v2 to mitigate performance disparities with DETR-based models (Smock et al., 11 Dec 2025).
PubTables-v2 thus establishes a comprehensive, multi-faceted benchmark for table extraction research, presenting a foundation for advancing document understanding methods in both focused and open-domain scientific settings.