Hierarchical PDF Segmentation
- Hierarchical PDF segmentation is the process of extracting nested document elements from PDFs and organizing them into a structured tree hierarchy.
- It integrates semantic segmentation, instance detection, and LLM-based parsing to tackle varied layouts in scanned forms, legal texts, and textbooks.
- Practical implementations use advanced CNN backbones, heuristic rules, and TOC extraction to accurately determine multi-level containment and reading order.
Hierarchical PDF segmentation is the automated extraction of the nested, multi-level logical structure of documents encoded in the PDF format, producing a representation where document elements (e.g., sections, paragraphs, tables, fields) are organized into a hierarchy that reflects their containment and reading order relations. This task encompasses segmentation of both visually complex scanned documents and digitally typeset files, requiring systems to operate with minimal external metadata to infer rich, arbitrarily deep structural trees. Solutions combine semantic segmentation, instance detection, layout analysis, and relation modeling, leveraging both classic computer vision and state-of-the-art machine learning techniques, including CNNs, instance-segmentation backbones, weak supervision pipelines, and, most recently, LLM-enabled structure inference.
1. Hierarchical Structure Representations
All hierarchical PDF segmentation approaches output representations that encode both entities and their relations in a tree or tree-like structure. DocParser (Rausch et al., 2019) formalizes each document page as a pair , where is the set of detected entities, each with a semantic type , bounding box , and confidence , and is a set of typed binary relations. The core structural relations include , signifying containment/nesting, and , signifying reading order among siblings. The hierarchy is thus an ordered rooted tree, constrained so every non-top-level entity has one parent and siblings are totally ordered.
In form-centric pipelines such as that of Sarkar et al. (Sarkar et al., 2019), document structure is decoded into four explicit levels—TextRuns/Widgets (leaves), TextBlocks/ChoiceGroup Titles (blocks), TextFields/ChoiceFields (fields), and ChoiceGroups (containers)—each corresponding to a CNN segmentation head. The hierarchy is made explicit during annotation via per-instance instance masks and containment links, and is enforced in the model via multi-head supervision.
In textbook and legal document contexts (Wehnert et al., 31 Aug 2025), the output is a nested segmentation over text spans, with hierarchy levels tied to sectioning structure (e.g., chapters, sections, subsections), and boundaries inferred from headings detected via TOC, layout, or LLMs.
2. Algorithmic Approaches
The three dominant families of algorithms for hierarchical PDF segmentation are:
| Approach | Key Characteristics | Exemplars |
|---|---|---|
| Hierarchical Semantic Segmentation | Strip-wise, high-res, multi-level semantic masks, CNN-based | Sarkar et al. (Sarkar et al., 2019) |
| Instance Segmentation + Heuristics | Per-entity detection (boxes/masks) + rule-based tree assembly | DocParser (Rausch et al., 2019) |
| Headline Detection & Parsing | Heading detectors via TOC, layout, or LLM, then tree induction | HiPS (Wehnert et al., 31 Aug 2025) |
Hierarchical Semantic Segmentation: Sarkar et al. employ a deep CNN operating on high-resolution overlapping image strips, where the segmentation mask of one strip acts as a prior for the next. The model jointly outputs semantic segmentation masks for each hierarchy level, enforcing level consistency and enabling fine-grained compositional parsing (Sarkar et al., 2019). This strategy addresses challenges of continuity for structures spanning multiple image tiles and resolves both visual segmentation and containment simultaneously.
Instance Segmentation & Heuristics: DocParser detects all entities (tables, figures, headings, etc.) as instances using a Mask R-CNN backbone, then applies a grammar-aware cascade of deterministic heuristic rules to infer nesting (parent–child) and sibling order relations based on geometric containment, domain grammars, and layout (Rausch et al., 2019). Weak supervision is leveraged during training using noisy annotations from reverse-LaTeX rendering, ensuring scalability even with limited ground-truth data.
Section Headline Detection & Parsing: The HiPS system (Wehnert et al., 31 Aug 2025) approaches hierarchical segmentation for complex books by first detecting section titles via TOC extraction, document layout parsing, or LLM-based refinement (sometimes with OCR features). It assigns hierarchy levels to each detected heading and infers section boundaries by matching headings to text locations via normalized, substring, or fuzzy criteria. Tree structure is then induced by the heading sequence and assigned levels.
3. Model Architectures and System Design
Strip-based Hierarchical Segmentation (Sarkar et al., 2019):
- Input: Overlapping horizontal strips of grayscale image , concatenated with binary prior mask ( channels).
- Image Encoder: Stacked 2D convolutions, ReLU activations, and downsampling. Multiple skip connections allow fine detail recovery.
- Context Encoder: Four blocks of bi-directional 1D dilated convolutions applied vertically and horizontally, capturing long-range dependencies critical for hierarchical parsing of forms.
- Decoder: U-Net–style upsampling with skip connections. Each upsampled representation passes through independent 1x1 conv "heads" yielding a -class softmax mask per hierarchy level.
- Prior Propagation: At inference, the predicted mask’s bottom overlap region is used as the prior for the next strip, maintaining structure continuity across segments.
- Loss: Multi-head cross-entropy, optionally augmented by prior-consistency MSE across overlap; empirically, supervision with cross-entropy suffices ().
- Post-processing: Convex hull smoothing for instance crispness; DOM-tree assembly via containment of polygons.
DocParser Instance Approach (Rausch et al., 2019):
- Backbone: ResNet-110 with Feature Pyramid Network (FPN).
- Entity Detection: Mask R-CNN paradigm, producing bounding boxes, classes, and masks.
- Hierarchy Rules: No learned relation module—instead, four levels of containment/orientation heuristics (geometric overlap, domain grammar, directness filtering, max-IoU parent selection) yield and links.
- Supervision: Loss sums classification, box regression, and mask segmentation; used for both strong and weak supervision.
- Weak Supervision: LaTeX SyncTeX–derived boxes as noisy labels.
Headline-Driven Segmentation (HiPS, (Wehnert et al., 31 Aug 2025)):
- TOC-Based: Extracts hierarchy directly from PDF outline metadata; matching to text by normalized/fuzzy string search.
- Layout-Based: Font-size, spacing, and margin heuristics (e.g., PDFstructure, Marker).
- LLM-Based: XML and OCR-based candidate lines are filtered and ranked by a prompting strategy using instruction-tuned LLMs (GPT-5, Llama 3), assigning level and purging non-headings.
- Section Boundary Detection: Algorithmic scan, creating segments upon matching heading; level assignment either direct from TOC or via LLM.
4. Datasets and Evaluation Protocols
Document Forms Dataset (Sarkar et al., 2019):
- Scale: 52,490 human-annotated forms, ∼1800×1000px.
- Annotations: Hierarchical mask and bounding box labels for TextRuns, Widgets, TextBlocks, ChoiceGroups, etc.; explicit containment relations.
DocParser Datasets (Rausch et al., 2019):
- \datasetmanual/: 362 arXiv papers (up to 30 pages each), 23 entity categories, complete nest/order annotation.
- \datasetauto/: 127,472 arXiv papers, 14 category weak labels from SyncTeX.
HiPS Evaluation (Wehnert et al., 31 Aug 2025):
- Application to legal textbooks and complex academic books; public data and code.
Metrics Used:
| Metric | Formalization/Reference | Context |
|---|---|---|
| Pixel mean Intersection-over-Union (MIoU) | Semantic segmentation (Sarkar et al., 2019) | |
| Object-level precision/recall | at | Instance mask match (Sarkar et al., 2019) |
| mAP (IoU=0.5, 0.65) | mean Average Precision | Entity detection (Rausch et al., 2019) |
| Relation F1 | Triple matching on | Relation structure (Rausch et al., 2019) |
| Tree Edit Distance | Hierarchy match (Wehnert et al., 31 Aug 2025) | |
| , WindowDiff | Beeferman/Pevzner–Hearst segmentation metrics | Section boundary detection (Wehnert et al., 31 Aug 2025) |
| Edit-distance relaxed | Tolerant match allowing | Title detection (Wehnert et al., 31 Aug 2025) |
5. Quantitative Performance and Comparative Results
State-of-the-art results are reported by each system in its target domain.
- Sarkar et al. (Sarkar et al., 2019): On forms, Highresnet yields MIoU of 92.7% (TextRun), 83.0% (ChoiceGroup), substantially outperforming DeepLabV3+ and MFCN baselines. Object-level F1 for TextRuns improves with increased model complexity (Lowres→Highres: 62.6→73.2). On generalization sets, table MIoU and F1 match or surpass literature benchmarks.
- DocParser (Rausch et al., 2019): Weak-supervision improves mAP from 49.9% (no WS) to 69.4% (full WS+fine-tune), and relation F1 from 0.453 to 0.615 (+35.8%). On ICDAR-2013, DocParser reaches F1=0.9292 (prior art: 0.9144).
- HiPS (Wehnert et al., 31 Aug 2025): TOC-based has precision , recall 0.6–0.95 depending on metadata; XML+OCR+GPT-5 achieves , , significantly outperforming structure-based tools on deep hierarchies and boundary accuracy as measured by , WindowDiff, and tree edit distance.
6. Practical Implementation and Deployment Considerations
Preprocessing steps crucially impact performance. For semantic segmentation, precise rasterization to high DPI and zero-padding/cropping ensure uniform input dimensions (Sarkar et al., 2019). Strip slicing (e.g., ) with overlap enables globally consistent segmentation. In heading-driven systems, XML layout features and OCR on rendered pages substantially reduce false positives and enhance boundary/level assignment (Wehnert et al., 31 Aug 2025).
Postprocessing includes instance mask smoothing (convex hulls), DOM-tree assembly from masks or entities, and emission of HTML or TEI for downstream use (Sarkar et al., 2019, Wehnert et al., 31 Aug 2025).
System selection depends on document properties. Where detailed TOC metadata is available and trustworthy, TOC-based headline parsing offers unmatched speed and depth. For complex, poorly structured, or handwritten/scanned pages, LLM-refined methods with signal from both OCR and PDF structure provide state-of-the-art hierarchical extraction at increased computational expense (Wehnert et al., 31 Aug 2025).
7. Strengths, Limitations, and Recommendations
| Approach | Strengths | Limitations |
|---|---|---|
| TOC-Based PageParser | Near-perfect precision, supports deep hierarchies when metadata is full | Recall zero for missing headings, brittle to TOC errors |
| Structural Parsers | Fast, metadata-independent, strong for 2–3 levels | Sharp degradation on deeper or non-academic layouts |
| LLM-Refined PageParser | Discovers missing headings, deep and flexible hierarchy, high recall | Higher compute, prompt/LLM dependence, risk of hallucination |
For large-scale document ingestion, structural parsers like Marker or PDFstructure suffice for shallow segmentation. For accuracy-critical or deep hierarchical extraction, especially with incomplete metadata, structure-aware LLM-based refinement is currently preferred (Wehnert et al., 31 Aug 2025) with OCR and XML features. High-resolution semantic segmentation (as in (Sarkar et al., 2019)) delivers robust results for visually-structured forms and scanned documents.
References
- "Document Structure Extraction using Prior based High Resolution Hierarchical Semantic Segmentation" (Sarkar et al., 2019)
- "DocParser: Hierarchical Structure Parsing of Document Renderings" (Rausch et al., 2019)
- "HiPS: Hierarchical PDF Segmentation of Textbooks" (Wehnert et al., 31 Aug 2025)
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free