DocTree Extraction: Hierarchical Document Parsing

Updated 18 May 2026

DocTree extraction is the automated recovery of a document’s hierarchical structure by mapping elements like headings, sections, and paragraphs into a rooted, labeled tree.
Approaches span deterministic DOM-based methods, neural sequence modeling, and multimodal vision-language fusion, achieving high precision and recall across diverse benchmarks.
Applications include enhanced information retrieval, robust document conversion, and form analysis, with ongoing research tackling scalability and complex layout challenges.

Document Tree (DocTree) Extraction is the automated recovery of a document’s hierarchical structure as a rooted, labeled tree whose nodes represent logical or semantic elements such as headings, sections, paragraphs, figures, tables, and their parent–child relations. This structural representation underpins information retrieval, document understanding, form analysis, and robust document conversion (e.g., PDF to Markdown). Approaches span deterministic algorithms leveraging typographic and layout cues, sequence or graph modeling with neural architectures, multimodal vision-LLMs, and hybrid pipelines integrating rule-based and learned components. The field has seen rapid evolution from DOM-centric extraction in webpages to robust, large-scale, multimodal parsing of raw PDFs—driven by the demands of scholarly, financial, and business documents.

1. Formal Task Definition and Problem Scope

At its core, DocTree extraction seeks to parse a structured or semi-structured document $D$ —which may be an HTML file, PDF, or scanned image—into a directed tree $T=(V,E)$ , where:

$V$ is a finite set of nodes, each corresponding to a semantic, physical, or logical block (e.g., heading, paragraph, table, figure).
$E$ encodes parent–child relations, reflecting hierarchical containment (e.g., “section 2.1” as a child of “section 2”).

Essential properties include:

Rootedness: A synthetic or natural root node serves as the unique ancestor.
Reading Order Compatibility: A linearization of the tree (pre-order traversal) corresponds to the intended reading sequence.
Heterogeneity: Nodes carry class labels (type, level), geometric or style attributes (font size, layout), and optionally semantic content.

Recent frameworks (e.g., READoc (Li et al., 2024), HELD (Cao et al., 2021), MoDora (Xu et al., 26 Feb 2026), Detect-Order-Construct (Wang et al., 2024)) recast the objective as an end-to-end mapping from unstructured input (PDF, scan, web page) to a canonical tree-based representation suitable for downstream applications.

2. Key Models and Extraction Algorithms

The diversity of document genres and encodings has yielded a spectrum of methodologies.

DOM-Based Content Extraction

In HTML/webpage analysis, the document is first parsed into a DOM tree $T=(N,E)$ , with each node $n$ as a tag or text node. López et al. (López et al., 2012) introduced a bottom-up scoring mechanism using the chars-nodes ratio (CNR):

$\mathrm{CNR}(n) = \frac{\textrm{Number of nonblank characters in subtree}}{\textrm{Number of nodes in subtree}}$

Non-content nodes (e.g., $<script>$ , $<style>$ , navigation elements) are pruned. Maximal CNR subtrees are greedily merged into cohesive “content blocks”, with the largest block selected as the main content. This method achieves $94.4\%$ mean recall and $T=(V,E)$ 0 mean precision for main-content extraction.

Sequence/Binary-Insertion and Neural Hierarchy Modeling

HELD (Cao et al., 2021) and similar approaches construct the DocTree sequentially: Given a list of $T=(V,E)$ 1 document objects sorted in reading order, insertion positions are enumerated along the evolving tree’s rightmost branch. Each “put-or-skip” insertion is a binary classification, modeled via a neural network that encodes context features (Bi-LSTM on object, parent, siblings; formatting embeddings). Traversal strategies (root-to-leaf, traversal-all) balance efficiency and accuracy. The two-step variant first extracts headings and then attaches non-headings, yielding accuracy up to $T=(V,E)$ 2 (Chinese financial), $T=(V,E)$ 3 (English financial), and $T=(V,E)$ 4 (arXiv papers).

Multimodal and Vision-Language Approaches

The Multimodal Tree Decoder (MTD) (Hu et al., 2022) employs a unified model for table-of-contents (ToC) extraction:

A multimodal encoder fuses visual (RoIAlign on ResNet-34), semantic (BERT), and layout features.
A classifier selects headings using BiGRU with focal loss.
A tree-structured decoder, based on transformer-attention and recurrent decoding, predicts parent/sibling/identity relations among headings. On scientific papers, MTD achieves TEDS (tree-edit-distance similarity) $T=(V,E)$ 5 and F1 $T=(V,E)$ 6. Ablations underline the criticality of combining modalities; removing any single stream drops TEDS by $T=(V,E)$ 7– $T=(V,E)$ 8 points.

End-to-End Detection-Order-Construct Paradigm

Detect-Order-Construct (DOC) (Wang et al., 2024) tackles HDSA by:

Detecting page and object regions (Mask2Former, DINO for graphical objects; transformer-based text-line grouping).
Predicting intra- and inter-region reading order via transformer-based relation heads.
Constructing the tree among headings with a greedy tree-insertion, driven by parent and sibling relation scores.

On the Comp-HRDoc benchmark, DOC achieves Micro-STEDS $T=(V,E)$ 9 and end-to-end full-hierarchy Micro/STEDS $V$ 0, outperforming previous MTD-style baselines (compared TEDS: $V$ 1).

Hybrid and Rule-Based Pipelines

Pipelines such as NovaLAD (Ulla, 23 Feb 2026) integrate optimized YOLO-based element/layout detection, parallel vision-language (ViT/LLM) enrichment, and deterministic grouping by layout, producing DocTrees as JSON, Markdown, and knowledge graphs. On DP-Bench, NovaLAD achieves TEDS $V$ 2 and NID $V$ 3 on CPU.

Rule-based systems (e.g., parsing directory pages (Shrivastava et al., 2021)) rely on hand-crafted segment classifiers, geometric features, and bottom-up dominance traversal to construct reading trees from visually irregular layouts. These methods, while less generalizable, remain robust for single-page, highly structured templates.

3. Multimodal Feature Representation

DocTree systems fuse diverse signals:

Semantic: Transformer-based text embeddings (BERT, RoBERTa, BiGRU).
Layout: Bounding box coordinates, font/indentation, grid-partitioning, and typographic style.
Visual: CNN or layout-deep features on cropped regions (e.g., images, bold/italic) and RoI-aligned outputs.
Joint Embeddings: Feature fusion is typically realized via concatenation and gated units (as in DocStruct (Wang et al., 2020)), with learned gates modulating the influence of visual vs. textual information.

The integration of these representations is imperative for handling forms, complex layouts, and ambiguous labeling. For key–value form understanding, asymmetric relation scoring and negative sampling (see DocStruct) further disambiguate hierarchical attachment.

4. Datasets, Benchmarks, and Evaluation Protocols

Evaluation protocols have evolved toward holistic, multi-level assessments:

TEDS (Tree-Edit-Distance Similarity): $V$ 4 provides a global tree-structure similarity measure, penalizing node misattachment and content mismatches (Hu et al., 2022, Li et al., 2024).
F1 over Relations/Headings: Measures exact assignment of hierarchical or key–value relations.
REDS (Reading Edit Distance Score): Quantifies order reconstruction fidelity for blocks or graphical regions (Wang et al., 2024).
NID (Normalized Indel Distance): Edit-based flat sequence evaluation, complementary to TEDS (Ulla, 23 Feb 2026).

Reference datasets include HierDoc (ToC-labeled scientific papers) (Hu et al., 2022), Comp-HRDoc (full document tree reconstruction) (Wang et al., 2024), \datasetmanual/\datasetauto (arXiv renderings) (Rausch et al., 2019), ESGDoc (complex financial reports) (Wang et al., 2023), READoc (PDF→Markdown, 2,233 documents) (Li et al., 2024), and DP-Bench (document parsing benchmark) (Ulla, 23 Feb 2026).

5. Challenges, Limitations, and Error Analysis

Key challenges persist:

Hierarchical Disambiguation: Long documents with variable heading conventions, implicit structure, or missing typographic cues (e.g., GitHub READMEs) reduce global TEDS by $V$ 5– $V$ 6 points, with expert models failing to generalize (Li et al., 2024).
Layout Complexity: Multi-column, irregular, or visually complex documents require models to jointly consider geometric, reading-order, and typographic cues, a source of persistent performance drops (6–7 pts on TEDS in multi-column layouts).
Scalability: Pairwise modeling of all heading pairs is intractable on long documents; node-centric or local-tree approaches as in CMM (Wang et al., 2023) and HELD (Cao et al., 2021) mitigate this.
Error Propagation: Early-stage detection/classification errors (e.g., heading misrecognition) often propagate catastrophically during tree assembly (Wang et al., 2024).
Data Scarcity and Weak Supervision: Manual annotation is expensive; weakly supervised learning from LaTeX source (DocParser) (Rausch et al., 2019) increases entity-relation F1 by $V$ 7 and reduces required annotated data by $V$ 8.

6. Applications and Downstream Impact

DocTree extraction underlies critical document intelligence tasks:

Content Extraction and Filtering: DOM-based CNR scoring has been applied to web scraping and content-targeted mobile rendering (López et al., 2012).
Passage Retrieval and QA: Tree-structured representations enable hierarchy-aware passage retrieval (e.g., HELD’s hierarchy-based BM25 features which boost recall@1 from $V$ 9 to $E$ 0 (Cao et al., 2021)).
Form Understanding: Structured parsing of forms and contracts is driven by multimodal key–value extraction, as in DocStruct (Wang et al., 2020).
RAG and Knowledge Graph Construction: NovaLAD’s pipeline outputs RAG-ready document chunks and knowledge graphs, accelerating downstream generative AI (Ulla, 23 Feb 2026).
Benchmarking: The READoc benchmark characterizes gaps in realistic, unified DocTree extraction, finding that single-page or pipeline methods underperform on global structure, and VLMs, although flexible, are inefficient and brittle in layout-diverse settings (Li et al., 2024).

7. Directions for Ongoing and Future Research

Emerging trends and open problems include:

Holistic Modeling: End-to-end networks jointly inferring entities, reading order, and hierarchy (Detect-Order-Construct, READoc) promise to close the gap between specialized and general-purpose systems (Wang et al., 2024, Li et al., 2024).
Scalable, Local-Context Architectures: Algorithms that model node-centric contexts (CMM pipeline) or incremental tree building (HELD) offer linearly scaling solutions for massive documents (Wang et al., 2023, Cao et al., 2021).
Multimodal and Multilingual Extension: Robustness to non-Latin scripts, low-quality scans, and non-academic layouts remains limited; future work includes multimodal pretraining and domain expansion (Hu et al., 2022).
Weak and Semi-supervised Learning: Exploiting weak annotations from markup source/text alignment dramatically reduces annotation costs and improves generalization (Rausch et al., 2019).
Hybrid Pipelines: Integrating deterministic layout parsing with neural tree generation offers efficiency and robustness, particularly in production or hybrid AI scenarios (Ulla, 23 Feb 2026, Xu et al., 26 Feb 2026).
Richer Graph Structures: Beyond tree hierarchies, documents with cross-references, footnotes, and parallel structures motivate generalization to DAGs and richer graphs, an area not yet addressed in current methods (Wang et al., 2024).

DocTree extraction now occupies a central role in document understanding, with increasingly sophisticated, multimodal, and scalable models driving advances in information access and downstream document analysis.