Format Graph Construction

Updated 18 May 2026

Format graph construction is a process that recovers the full hierarchical document structure including parent–child and reading order relations from raw or semi-structured formats.
Methodologies range from rule-based heuristics to sequential insertion with learned models and hybrid multimodal approaches that integrate text, layout, and visual cues.
Empirical studies show high accuracy in tasks like ToC extraction and passage retrieval, highlighting its critical role in document understanding and generative AI applications.

Format graph construction refers to the class of algorithms and system architectures that recover the full hierarchical, document-level structure (typically represented as a tree or forest) from raw, semi-structured, or unstructured digital formats such as PDF, HTML, or document images. The resulting format graph (more commonly called a document tree or DocTree in recent literature) encodes not only the reading sequence but also the parent–child and left–right ordering relations that organize headings, paragraphs, lists, tables, figures, and other semantic units. Format graph construction underpins downstream document understanding, passage retrieval, information extraction, and reflow for generative AI models.

1. Formal Definitions and Core Taxonomy

In contemporary research, format graph construction is equivalently termed DocTree, Document Structured Extraction (DSE), hierarchical document structure analysis (HDSA), or Table of Contents (ToC) extraction (Wang et al., 2024, Li et al., 2024, Hu et al., 2022). The canonical output is a rooted, ordered tree $T = (V, E)$ :

Nodes ( $V$ ): Semantic units such as headings of various levels, paragraphs or text blocks, tables, figures, lists, formulas, and code blocks.
Edges ( $E$ ): Directed parent–child relations, encoding nesting (e.g., section–subsection), and sibling order, plus sometimes supplementary or layout-specific edges for headers, footers, and sidebars.

In formal terms, given an input document $D$ (e.g., raw PDF, HTML, or OCR fragment set), format graph construction is learning a function $f : D \to T$ , where $T$ captures both the correct reading order and the hierarchical, nested structure of the document (Li et al., 2024, Wang et al., 2024, Xu et al., 26 Feb 2026).

Node type taxonomies typically include five core categories: Heading, PlainText, Formula, Table, CodeBlock (Li et al., 2024), sometimes extended to figures, images, lists, captions, footers, and supplementary blocks (Xu et al., 26 Feb 2026, Wang et al., 2024, Hu et al., 2022, Rausch et al., 2019).

2. Methodological Paradigms

Three principal paradigms dominate format graph construction:

a. Rule-based and Heuristic Tree Construction

Methods such as those in (López et al., 2012, Shrivastava et al., 2021, Wang et al., 2023, Ulla, 23 Feb 2026) construct initial trees using geometric and heuristics:

HTML DOM-based: Model the document as a DOM tree $(N, E)$ , compute per-node statistics (e.g., chars-nodes ratio, CNR), prune nodes (e.g., scripts, styles), and aggregate high-CNR nodes into blocks via lowest-common-ancestor merging (López et al., 2012).
Reading-Order and Font-size Trees: Text blocks are sorted in reading order (left-to-right, top-to-bottom); each block is made a child of the nearest prior block with larger font size, building an initial tree (CMM, (Wang et al., 2023)).
Spatial Clustering/Assignment: Page elements are assigned to layout regions (columns, multi-column, row groups) by geometric clustering, then grouped into segments or subtrees (NovaLAD, (Ulla, 23 Feb 2026)).

These approaches rely on hand-crafted rules (e.g., for parent assignment or block merging), spatial/visual cues (font size, bounding-box position), and limited statistical thresholds.

b. Sequential or Pairwise Insertion with Learned Models

Frameworks such as HELD (Cao et al., 2021) and tree-decoder architectures (Hu et al., 2022, Wang et al., 2024) treat tree construction as a sequence of insertion or link-prediction decisions:

Sequential Insertion (HELD): Given a list of detected objects, each is inserted at the optimal position among the current tree’s legal candidate slots (down the rightmost path), using a “put-or-skip” binary classifier (e.g., BiLSTM-based). Traversal and candidate scoring strategies control trade-offs between inference quality and computational cost.
Pairwise Link Prediction: For each candidate node, predict its parent and/or sibling via neural scoring functions, often using multimodal encoders (textual, layout, vision features) and dependency parsing analogs (Wang et al., 2024, Hu et al., 2022).
Tree-Decoder Models: Heading entities are encoded multimodally, then a sequential decoder with (hard/soft) attention predicts parent/sibling/identity links, building the tree in document order (Hu et al., 2022).

c. Hybrid and Multimodal Methods

Recent systems leverage deep multimodal fusion:

Multimodal Feature Encoding: Jointly encode semantic (BERT or similar), layout (2D geometry, font), and visual (CNN) representations for each region or OCR fragment, often with gated fusion (Wang et al., 2020, Hu et al., 2022).
Graph Neural Networks: Local node-centric subtrees are modeled with BiGRU (for reading order context) and Graph Attention Networks (GATs) to integrate local and long-range context, supporting scalable refinement actions (keep/delete/move) (Wang et al., 2023).
Relation Prediction and Dependency Parsing: Tree construction formulated as parallel parent/sibling prediction (Construct phase) with cross-entropy losses over possible edges, often optimized end-to-end (Wang et al., 2024).

Many frameworks partition the task into modular phases: object detection, classification (heading detection), reading order, and hierarchical tree construction. Single-pass or pipeline approaches dominate, with downstream coreference and entity linking sometimes included.

3. Evaluation Protocols, Benchmarks, and Metrics

Evaluation of format graph construction emphasizes both structural and content correctness:

Tree Edit Distance Similarity (TEDS):

$\text{TEDS}(T_1, T_2) = 1 - \frac{\text{TED}(T_1,T_2)}{\max(|T_1|,|T_2|)}$

where TED is the tree-edit distance, and $|T|$ is the number of nodes (Hu et al., 2022, Li et al., 2024, Wang et al., 2024, Wang et al., 2023).

Edit Distance Similarity (EDS): Normalized Levenshtein distance over text spans or tables (Li et al., 2024).
Kendall’s Tau Distance Similarity (KTDS): Measures reading-order or sequence correlation between predicted and gold (Li et al., 2024).
Vocabulary-level F1: Word-level overlap of predicted and gold text tokens (Li et al., 2024).
Downstream Metrics: Passage retrieval accuracy (mAP, recall@k), entity detection mAP (COCO-style), hierarchical-reconstruction micro/macro STEDS (Wang et al., 2024).

Public benchmarks include READoc (PDF→Markdown, 2,233 documents) (Li et al., 2024), Comp-HRDoc (1,500 documents, complex layouts, hierarchical ToC, (Wang et al., 2024)), HierDoc (650 arXiv-style papers, (Hu et al., 2022)), ESGDoc (annual reports, 1,093 docs, (Wang et al., 2023)), and DP-Bench (structured JSON, Markdown, NID, and TEDS metrics, (Ulla, 23 Feb 2026)).

4. Scalability and Complexity Analysis

Different approaches yield distinct computational and memory profiles:

Rule-based methods (e.g., font-size trees, spatial clustering) typically run in near $O(N)$ time, where $V$ 0 is the number of layout elements (Wang et al., 2023, López et al., 2012, Ulla, 23 Feb 2026). The CMM pipeline is designed for $V$ 1 document-level complexity by constraining node-centric subtrees.
Pairwise or exhaustive models (e.g., MTD, pairwise insertion) require $V$ 2 runtime and memory for all possible parent–child pairs, which can be prohibitive for multi-hundred-page documents (Hu et al., 2022, Wang et al., 2023). CMM and similar strategies avoid this via local subtree extraction.
End-to-end deep models (Detect-Order-Construct, (Wang et al., 2024)) parallelize detection, ordering, and construction; practical GPU-based training handles thousands of documents with up to 50 pages efficiently.
CPU-optimized pipelines (NovaLAD) exploit thread pools for parallel YOLO detection, image classification, and layout grouping, maintaining inference time per document (e.g., 8.5s for DP-Bench) (Ulla, 23 Feb 2026).

Most modern systems incorporate scalable data structures (e.g., R-trees for candidate lookup), early pruning strategies, and model ablation to balance accuracy with throughput on large-scale corpora.

5. Empirical Findings and Comparative Results

Rigorous comparisons on public and proprietary benchmarks demonstrate:

MTD achieves TEDS=87.2% and F1=88.1% on scientific ToC extraction (HierDoc), and heading-detection F1 of 96.1% (Hu et al., 2022).
CMM yields 88.1% TEDS on HierDoc, but greatly outperforms MTD on ESGDoc full-length reports (TEDS 33.2% vs. 12.8%) and scales efficiently to 500+ page documents (Wang et al., 2023).
HELD sets new state-of-the-art on variable-depth logical hierarchy extraction, with accuracies of 97.3%, 73.0%, 95.8% (Chinese, English financial, arXiv) (Cao et al., 2021).
Detect-Order-Construct achieves micro-STEDS up to 86.05% on Comp-HRDoc and 95.04% on simpler HRDoc instances, outperforming prior pipelines and jointly optimizing detection, reading order, and hierarchy (Wang et al., 2024).
NovaLAD attains 96.49% TEDS and 98.51% NID on DP-Bench, surpassing both commercial and open-source baselines in CPU-only settings (Ulla, 23 Feb 2026).
Weak supervision, multimodal fusion, and bottom-up structure refinement substantially improve relation parsing F1 and detection (DocParser, DocStruct) (Rausch et al., 2019, Wang et al., 2020).
Ablation studies show multimodal fusion and graph-based context integration yield large TEDS and F1 gains over pure text or layout features (Hu et al., 2022, Wang et al., 2023, Wang et al., 2020).

A summary of TEDS scores from selected systems is provided below:

System	Dataset	TEDS (%)
MTD	HierDoc	87.2
CMM	HierDoc	88.1
CMM	ESGDoc (full)	33.2
HELD	arXiv	95.8
DOC-ORDER-CON	Comp-HRDoc	86.1
NovaLAD	DP-Bench	96.5

6. Advanced Topics and Limitations

Despite significant progress, several open problems remain:

Cross-Page Hierarchy: Most systems process pages in isolation; few treat the document as a single structured object, hindering robust parent–child linking across page boundaries (Li et al., 2024).
Multi-Modality & Visual Semantics: Highly variable table, formula, float, and sidebar layouts demand tight integration of vision, layout, and text modules.
Deep Layout Complexity: Multi-column, float-heavy, or highly decorative formats cause degradation in recall and tree similarity, especially as ToC depth increases (Li et al., 2024).
Error Propagation: Page object detection and reading-order prediction errors propagate to hierarchy construction; robust end-to-end joint models remain scarce (Wang et al., 2024).
Non-Tree Structures: Directed acyclic graph conventions (e.g., cross-reference links, multi-parent nodes) are not supported; most systems assume strict tree structure.
Resource Efficiency: Vision-LLM (VLM) approaches have high memory/compute costs; CPU-optimized pipelines (NovaLAD) demonstrate speed but may underperform on ambiguous structure detection.

Unified loss functions (e.g., differentiable tree-edit distance objectives) and graph-structured logic representations (beyond trees) are targets for future research (Li et al., 2024, Wang et al., 2024).

7. Applications and Impact

Format graph construction enables a suite of downstream tasks:

Passage-level Retrieval & Information Extraction: Accurate tree paths and passage metadata enable substantial mAP and recall@k improvements in passage retrieval, supporting semantic search and QA (Cao et al., 2021, Li et al., 2024).
Summarization and Generative AI: Hierarchical segmentation and node metadata allow efficient chunking for LLMs and retrieval-augmented generation (Ulla, 23 Feb 2026).
Knowledge Base Construction: Structured, layout- and semantics-aware graphs can be directly converted to knowledge graphs, JSON, Markdown, and other structured outputs (Ulla, 23 Feb 2026, Li et al., 2024).
Small-Screen, Accessibility, and Reflow: Accurate region- and block-level hierarchy extraction is central to document reflow for small-screen or accessible reading (López et al., 2012).
Document Understanding in Finance, Government, and Enterprise: Organization chart extraction, form understanding, and legal/financial directory parsing (e.g., directory blocks, key–value relations) depend on robust tree-structured recovery (Wang et al., 2020, Shrivastava et al., 2021).

Recent unified benchmarks (READoc, Comp-HRDoc, DP-Bench) and code releases (NovaLAD, MoDora, MTD) catalyze ongoing improvement and foster rigorous standardization across the field (Ulla, 23 Feb 2026, Li et al., 2024, Hu et al., 2022, Xu et al., 26 Feb 2026).