Papers
Topics
Authors
Recent
2000 character limit reached

Hierarchical PDF Segmentation

Updated 17 November 2025
  • Hierarchical PDF segmentation is the process of extracting nested document elements from PDFs and organizing them into a structured tree hierarchy.
  • It integrates semantic segmentation, instance detection, and LLM-based parsing to tackle varied layouts in scanned forms, legal texts, and textbooks.
  • Practical implementations use advanced CNN backbones, heuristic rules, and TOC extraction to accurately determine multi-level containment and reading order.

Hierarchical PDF segmentation is the automated extraction of the nested, multi-level logical structure of documents encoded in the PDF format, producing a representation where document elements (e.g., sections, paragraphs, tables, fields) are organized into a hierarchy that reflects their containment and reading order relations. This task encompasses segmentation of both visually complex scanned documents and digitally typeset files, requiring systems to operate with minimal external metadata to infer rich, arbitrarily deep structural trees. Solutions combine semantic segmentation, instance detection, layout analysis, and relation modeling, leveraging both classic computer vision and state-of-the-art machine learning techniques, including CNNs, instance-segmentation backbones, weak supervision pipelines, and, most recently, LLM-enabled structure inference.

1. Hierarchical Structure Representations

All hierarchical PDF segmentation approaches output representations that encode both entities and their relations in a tree or tree-like structure. DocParser (Rausch et al., 2019) formalizes each document page as a pair T=(E,R)T = (E, R), where E={Ej}E = \{E_j\} is the set of detected entities, each with a semantic type cjc_j, bounding box BjB_j, and confidence PjP_j, and R={(Ei,Ej,Ψk)}R = \{ (E_i, E_j, \Psi_k ) \} is a set of typed binary relations. The core structural relations include parent_of\mathit{parent\_of}, signifying containment/nesting, and followed_by\mathit{followed\_by}, signifying reading order among siblings. The hierarchy is thus an ordered rooted tree, constrained so every non-top-level entity has one parent and siblings are totally ordered.

In form-centric pipelines such as that of Sarkar et al. (Sarkar et al., 2019), document structure is decoded into four explicit levels—TextRuns/Widgets (leaves), TextBlocks/ChoiceGroup Titles (blocks), TextFields/ChoiceFields (fields), and ChoiceGroups (containers)—each corresponding to a CNN segmentation head. The hierarchy is made explicit during annotation via per-instance instance masks and containment links, and is enforced in the model via multi-head supervision.

In textbook and legal document contexts (Wehnert et al., 31 Aug 2025), the output is a nested segmentation over text spans, with hierarchy levels tied to sectioning structure (e.g., chapters, sections, subsections), and boundaries inferred from headings detected via TOC, layout, or LLMs.

2. Algorithmic Approaches

The three dominant families of algorithms for hierarchical PDF segmentation are:

Approach Key Characteristics Exemplars
Hierarchical Semantic Segmentation Strip-wise, high-res, multi-level semantic masks, CNN-based Sarkar et al. (Sarkar et al., 2019)
Instance Segmentation + Heuristics Per-entity detection (boxes/masks) + rule-based tree assembly DocParser (Rausch et al., 2019)
Headline Detection & Parsing Heading detectors via TOC, layout, or LLM, then tree induction HiPS (Wehnert et al., 31 Aug 2025)

Hierarchical Semantic Segmentation: Sarkar et al. employ a deep CNN operating on high-resolution overlapping image strips, where the segmentation mask of one strip acts as a prior for the next. The model jointly outputs semantic segmentation masks for each hierarchy level, enforcing level consistency and enabling fine-grained compositional parsing (Sarkar et al., 2019). This strategy addresses challenges of continuity for structures spanning multiple image tiles and resolves both visual segmentation and containment simultaneously.

Instance Segmentation & Heuristics: DocParser detects all entities (tables, figures, headings, etc.) as instances using a Mask R-CNN backbone, then applies a grammar-aware cascade of deterministic heuristic rules to infer nesting (parent–child) and sibling order relations based on geometric containment, domain grammars, and layout (Rausch et al., 2019). Weak supervision is leveraged during training using noisy annotations from reverse-LaTeX rendering, ensuring scalability even with limited ground-truth data.

Section Headline Detection & Parsing: The HiPS system (Wehnert et al., 31 Aug 2025) approaches hierarchical segmentation for complex books by first detecting section titles via TOC extraction, document layout parsing, or LLM-based refinement (sometimes with OCR features). It assigns hierarchy levels to each detected heading and infers section boundaries by matching headings to text locations via normalized, substring, or fuzzy criteria. Tree structure is then induced by the heading sequence and assigned levels.

3. Model Architectures and System Design

Strip-based Hierarchical Segmentation (Sarkar et al., 2019):

  • Input: Overlapping horizontal strips of grayscale image Sh×wS_h \times w, concatenated with binary prior mask (CC channels).
  • Image Encoder: Stacked 2D convolutions, ReLU activations, and downsampling. Multiple skip connections allow fine detail recovery.
  • Context Encoder: Four blocks of bi-directional 1D dilated convolutions applied vertically and horizontally, capturing long-range dependencies critical for hierarchical parsing of forms.
  • Decoder: U-Net–style upsampling with skip connections. Each upsampled representation passes through KK independent 1x1 conv "heads" yielding a CkC_k-class softmax mask per hierarchy level.
  • Prior Propagation: At inference, the predicted mask’s bottom overlap region is used as the prior for the next strip, maintaining structure continuity across segments.
  • Loss: Multi-head cross-entropy, optionally augmented by prior-consistency MSE across overlap; empirically, supervision with cross-entropy suffices (λp=0\lambda_p = 0).
  • Post-processing: Convex hull smoothing for instance crispness; DOM-tree assembly via containment of polygons.

DocParser Instance Approach (Rausch et al., 2019):

  • Backbone: ResNet-110 with Feature Pyramid Network (FPN).
  • Entity Detection: Mask R-CNN paradigm, producing bounding boxes, classes, and masks.
  • Hierarchy Rules: No learned relation module—instead, four levels of containment/orientation heuristics (geometric overlap, domain grammar, directness filtering, max-IoU parent selection) yield parent_ofparent\_of and followed_byfollowed\_by links.
  • Supervision: Loss sums classification, box regression, and mask segmentation; used for both strong and weak supervision.
  • Weak Supervision: LaTeX SyncTeX–derived boxes as noisy labels.

Headline-Driven Segmentation (HiPS, (Wehnert et al., 31 Aug 2025)):

  • TOC-Based: Extracts hierarchy directly from PDF outline metadata; matching to text by normalized/fuzzy string search.
  • Layout-Based: Font-size, spacing, and margin heuristics (e.g., PDFstructure, Marker).
  • LLM-Based: XML and OCR-based candidate lines are filtered and ranked by a prompting strategy using instruction-tuned LLMs (GPT-5, Llama 3), assigning level and purging non-headings.
  • Section Boundary Detection: Algorithmic scan, creating segments upon matching heading; level assignment either direct from TOC or via LLM.

4. Datasets and Evaluation Protocols

Document Forms Dataset (Sarkar et al., 2019):

  • Scale: 52,490 human-annotated forms, ∼1800×1000px.
  • Annotations: Hierarchical mask and bounding box labels for TextRuns, Widgets, TextBlocks, ChoiceGroups, etc.; explicit containment relations.

DocParser Datasets (Rausch et al., 2019):

  • \datasetmanual/: 362 arXiv papers (up to 30 pages each), 23 entity categories, complete nest/order annotation.
  • \datasetauto/: 127,472 arXiv papers, 14 category weak labels from SyncTeX.

HiPS Evaluation (Wehnert et al., 31 Aug 2025):

  • Application to legal textbooks and complex academic books; public data and code.

Metrics Used:

Metric Formalization/Reference Context
Pixel mean Intersection-over-Union (MIoU) MIoU=1Cc=1CTPcTPc+FPc+FNc\text{MIoU} = \frac{1}{C}\sum_{c=1}^C \frac{\mathrm{TP}_c}{\mathrm{TP}_c + \mathrm{FP}_c + \mathrm{FN}_c} Semantic segmentation (Sarkar et al., 2019)
Object-level precision/recall TP\mathrm{TP} at IoUτ\text{IoU} \geq \tau Instance mask match (Sarkar et al., 2019)
mAP (IoU=0.5, 0.65) mean Average Precision Entity detection (Rausch et al., 2019)
Relation F1 Triple matching on (Ei,Ej,Ψ)(E_i, E_j, \Psi) Relation structure (Rausch et al., 2019)
Tree Edit Distance dETD(Tpred,Tgt)d_{\mathrm{ETD}}(T_{\text{pred}}, T_{\text{gt}}) Hierarchy match (Wehnert et al., 31 Aug 2025)
PkP_k, WindowDiff Beeferman/Pevzner–Hearst segmentation metrics Section boundary detection (Wehnert et al., 31 Aug 2025)
Edit-distance relaxed PED,REDP_{ED}, R_{ED} Tolerant match allowing ED2ED \le 2 Title detection (Wehnert et al., 31 Aug 2025)

5. Quantitative Performance and Comparative Results

State-of-the-art results are reported by each system in its target domain.

  • Sarkar et al. (Sarkar et al., 2019): On forms, Highresnet yields MIoU of 92.7% (TextRun), 83.0% (ChoiceGroup), substantially outperforming DeepLabV3+ and MFCN baselines. Object-level F1 for TextRuns improves with increased model complexity (Lowres→Highres: 62.6→73.2). On generalization sets, table MIoU and F1 match or surpass literature benchmarks.
  • DocParser (Rausch et al., 2019): Weak-supervision improves mAP from 49.9% (no WS) to 69.4% (full WS+fine-tune), and relation F1 from 0.453 to 0.615 (+35.8%). On ICDAR-2013, DocParser reaches F1=0.9292 (prior art: 0.9144).
  • HiPS (Wehnert et al., 31 Aug 2025): TOC-based has precision PED>0.98P_{ED} > 0.98, recall REDR_{ED} 0.6–0.95 depending on metadata; XML+OCR+GPT-5 achieves PED0.95P_{ED} \approx 0.95, RED0.90R_{ED} \approx 0.90, significantly outperforming structure-based tools on deep hierarchies and boundary accuracy as measured by PkP_k, WindowDiff, and tree edit distance.

6. Practical Implementation and Deployment Considerations

Preprocessing steps crucially impact performance. For semantic segmentation, precise rasterization to high DPI and zero-padding/cropping ensure uniform input dimensions (Sarkar et al., 2019). Strip slicing (e.g., Sh=600,Oh=200S_h = 600, O_h = 200) with overlap enables globally consistent segmentation. In heading-driven systems, XML layout features and OCR on rendered pages substantially reduce false positives and enhance boundary/level assignment (Wehnert et al., 31 Aug 2025).

Postprocessing includes instance mask smoothing (convex hulls), DOM-tree assembly from masks or entities, and emission of HTML or TEI for downstream use (Sarkar et al., 2019, Wehnert et al., 31 Aug 2025).

System selection depends on document properties. Where detailed TOC metadata is available and trustworthy, TOC-based headline parsing offers unmatched speed and depth. For complex, poorly structured, or handwritten/scanned pages, LLM-refined methods with signal from both OCR and PDF structure provide state-of-the-art hierarchical extraction at increased computational expense (Wehnert et al., 31 Aug 2025).

7. Strengths, Limitations, and Recommendations

Approach Strengths Limitations
TOC-Based PageParser Near-perfect precision, supports deep hierarchies when metadata is full Recall zero for missing headings, brittle to TOC errors
Structural Parsers Fast, metadata-independent, strong for 2–3 levels Sharp degradation on deeper or non-academic layouts
LLM-Refined PageParser Discovers missing headings, deep and flexible hierarchy, high recall Higher compute, prompt/LLM dependence, risk of hallucination

For large-scale document ingestion, structural parsers like Marker or PDFstructure suffice for shallow segmentation. For accuracy-critical or deep hierarchical extraction, especially with incomplete metadata, structure-aware LLM-based refinement is currently preferred (Wehnert et al., 31 Aug 2025) with OCR and XML features. High-resolution semantic segmentation (as in (Sarkar et al., 2019)) delivers robust results for visually-structured forms and scanned documents.

References

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Hierarchical PDF Segmentation.