Docling Document Conversion Toolkit

Updated 24 December 2025

Docling is an open-source, modular document conversion toolkit that leverages state-of-the-art AI models for layout analysis and table structure detection.
It employs a sequential, page-wise pipeline integrating DocLayNet, TableFormer, and OCR to accurately extract and serialize document structures.
The toolkit’s extensibility and integration with NLP workflows enable seamless deployment in retrieval-augmented generation and business automation applications.

Docling is an end-to-end, open-source document conversion toolkit that leverages state-of-the-art AI models for layout analysis and table structure recognition, supporting robust PDF and multi-format-to-JSON/Markdown conversion. It is architected for extensibility, efficiency, and modular integration, and has been widely adopted across the open-source and research communities for document understanding, data extraction, and downstream natural language processing workflows (Auer et al., 19 Aug 2024, Livathinos et al., 27 Jan 2025, Livathinos et al., 15 Sep 2025).

1. System Architecture and Data Flow

Docling implements a modular, straight-line, page-wise processing pipeline. The system architecture decomposes document conversion into sequential processing units, each modular and independently extensible.

Parser Backends: Support multiple input formats. For PDF, options include docling-parse (qpdf-based, extracting token geometry and rendering bitmaps) and pypdfium2. Additional backends ingest HTML (via BeautifulSoup), Markdown (via Marko), and office formats (DOCX, PPTX, XLSX via python-docx, python-pptx, openpyxl). Each yields either low-level content (tokens + bounding boxes + bitmap) or high-level elements.
Model Pipeline: Processes each page in parallel. Core models include the DocLayNet layout analysis module (based on real-time DETR variants) and TableFormer table structure recognition (transformer-based Im2Seq), with optional OCR (EasyOCR/Tesseract) for non-tokenized bitmaps.
Assembly & Post-Processing: Aggregates per-page predictions into a unified document object, applies language detection (Watson NLP), reading order correction, figure-caption linkage, and metadata extraction (title, authors, references).
Serialization: Outputs structured representations—lossless JSON (preserving all semantics and geometry) or lossy Markdown/HTML (retaining only text and hierarchy) (Auer et al., 19 Aug 2024, Livathinos et al., 27 Jan 2025).

The core document object model, DoclingDocument, is a tree/graph $G=(V,R)$ of block-level elements (TextBlock, Image, Table, ListItem, etc.) with relations encoding page assignment, ordering, and hierarchical structure.

2. Algorithms, Models, and Training

2.1 Layout Analysis: DocLayNet and Successors

DocLayNet: Implements a RT-DETR (real-time DETR) variant with a CNN backbone and transformer decoder. Input is a resized (typically 800×800 px) bitmap; outputs are bounding boxes and class labels for up to 13-17 canonical layout types (Title, Paragraph, Table, Figure, Formula, etc.) (Livathinos et al., 27 Jan 2025).
Losses: Standard DETR bipartite matching (Hungarian), combining cross-entropy for class labels, $\ell_1$ regression for box geometry, and Generalized IoU:

$\mathcal{L} = \mathcal{L}_{\text{match}} + \lambda_1 \mathcal{L}_{\text{box}} + \lambda_2 \mathcal{L}_{\text{GIoU}}$

Advanced Models: New models (heron, heron-101, egret-m/l/x) leverage RT-DETRv2 and DFINE architectures, trained on a heterogeneous corpus of 150,000 documents labeled with up to 17 classes, yielding mAP improvements up to 23.9% over the DocLayNet baseline. For example, heron-101 (RT-DETRv2 with ResNet-101-vd backbone) attains 78.0% mAP at 28 ms/image on A100 (Livathinos et al., 15 Sep 2025).
Post-Processing: Includes score thresholding ( $\tau_{\text{conf}}=0.5$ ), element categorization, PDF-cell alignment, wrapper logic, and overlap resolution by rule-based heuristics.

2.2 Table Structure Recognition: TableFormer

Architecture: Vision transformer encoder (Swin or ViT-based) processes table crops; a lightweight transformer decoder emits Optimized Table Structure Language (OTSL) sequences, describing cell boundaries, row/column spans, and header flags.
Objective: Sequence-level cross-entropy over OTSL tokens:

$\mathcal{L}_\mathrm{seq} = -\sum_{t=1}^T \log P(w_t^* | w_{<t}, \mathbf{F})$

where $\mathbf{F}$ is the image feature map (Auer et al., 19 Aug 2024).

Performance: On key benchmarks, TableFormer achieves cell-adjacency F1 ≈ 0.89 (ICDAR2013) and structure-level F1 ≈ 0.92 (fast-flavor variant) (Livathinos et al., 27 Jan 2025).
Integration: Detections from DocLayNet are cropped and passed to TableFormer; the resulting logical structure is associated with detected PDF coordinates and text.

2.3 End-to-End VLMs: SmolDocling

Architecture: SmolDocling is a 256M parameter vision-LLM combining a SigLIP vision encoder with a SmolLM-2 transformer decoder. It autoregressively emits "DocTags," an XML-style markup fully encoding content, hierarchy, and location for all block types (including code, formulas, tables, charts) (Nassar et al., 14 Mar 2025).
Unified Training: Trained with a next-token log-likelihood loss over the DocTags vocabulary:

$L_{\mathrm{CE}} = -\sum_{t=1}^T \log p_\theta(y_t|y_{<t},\text{Image})$

Benchmarking: Outperforms or matches much larger LVLMs (up to 27× size) in OCR, layout, table, chart, and formula conversion across diverse domains (Nassar et al., 14 Mar 2025).

3. Datasets, Preprocessing, and Evaluation

3.1 Datasets

DocLayNet: ∼80,000-150,000 human-annotated pages (public and proprietary) covering a diverse set of layouts and block types. Canonical-DocLayNet variant achieves 17-class coverage (Auer et al., 19 Aug 2024, Livathinos et al., 15 Sep 2025).
Specialized Tables: PubTables-1M, TableBank, FinTabNet, WordScape, WikiTableSet.
Charts, Code, and Formula: SynthChartNet, PlotQA, FigureQA, SynthCodeNet, SynthFormulaNet.

3.2 Preprocessing

Page images resized (commonly 640–800 px wide, 72–150 dpi).
On-the-fly augmentations: geometric transforms, random scaling, horizontal flips.
Tokens: Text blocks and table crops supplied both as bitmap and raw PDF coordinates for cross-alignment.

3.3 Evaluation

Layout: [email protected]:0.95 (mean average precision over IoU thresholds) for all box classes:

$\text{Precision}_c = \frac{\mathrm{TP}_c}{\mathrm{TP}_c + \mathrm{FP}_c}, \quad \text{Recall}_c = \frac{\mathrm{TP}_c}{\mathrm{TP}_c + \mathrm{FN}_c}$

and F1-score per class.

Tables: Cell-level adjacency F1, TEDS (structure + content).
OCR: Edit distance, BLEU, METEOR, and F1.
Efficiency: Throughput benchmarks (pages/s), per-model and system memory/latency, linear complexity in number of detected elements (Auer et al., 19 Aug 2024, Livathinos et al., 27 Jan 2025, Nassar et al., 14 Mar 2025, Livathinos et al., 15 Sep 2025).

4. Implementation, Extensibility, and Best Practices

4.1 Software Modularization

Pure-python implementation structured as several modules:

docling-parse: PDF and markup backends.
docling-core: Document object model and type definitions.
pipelines: Model pipelines for standard and custom flows.
ibm_models / external: Wrappers for deployed models.

4.2 Extending Docling

Custom Pipelines: Subclass the pipeline base and override individual stages. All per-page models must implement __call__(Iterator[Page]) → Iterator[Page], emitting enriched predictions.
Schema Extension: Register new element classes (e.g., “Formula”), add new prediction fields, and populate during post-processing.
Integration: Prebuilt adapters for LangChain, LlamaIndex, and spaCy facilitate direct use in vector databases, retrieval augmented generation (RAG), and NLP workflows (Livathinos et al., 27 Jan 2025).

4.3 Best Practices

Curate canonical datasets; filter noisy supervision rigorously.
Use aggressive augmentation and warm-start backbones.
Tune score thresholds and apply overlap-resolution post-processing.
Select model variants to optimize latency/accuracy per deployment environment (e.g., egret-m for latency, heron-101 for accuracy) (Livathinos et al., 15 Sep 2025).

5. Performance, Resource Requirements, and Benchmarking

Model/Hardware	mAP (canonical DocLayNet)	Inference Latency (A100, batch)	Pages/s (CPU/M3/GPU)	Peak Mem (GB)
old-docling (RT-DETR v1)	0.541	28 ms/image (200)	0.60 – 2.45 (CPU)	2.4/6.2
heron-101 (RT-DETR v2)	0.780	28 ms/image (200)	7.5 (GPU)	3.0
TableFormer (fast flavor)	0.92 (cell-F1)	0.11 s/page (L4)	9.0 (GPU, table)	—
SmolDocling (VLM, 256M)	0.231 (layout mAP)	0.35 s/page (A100)	—	0.49

Benchmarks are based on evaluation on multi-thousand-page corpora and a range of devices. TableFormer incurs 2–6 s/table on CPU, but this is substantially reduced on GPU (Auer et al., 19 Aug 2024, Livathinos et al., 27 Jan 2025, Nassar et al., 14 Mar 2025, Livathinos et al., 15 Sep 2025).

6. Applications, Integrations, and Community

6.1 Downstream Use Cases

Retrieval-Augmented Generation: Document chunking, passage embedding, and indexing for RAG pipelines via LlamaIndex and LangChain.
Data Preparation: Integration with IBM Data Prep Kit for visual+semantic dataset construction.
Knowledge Base Extraction: ESG-KPI extraction, statements framework for tabular reasoning.
End-to-End Business Automation: SmolDocling supports key-value extraction, form parsing, and code documentation within a single VLM (Auer et al., 19 Aug 2024, Livathinos et al., 27 Jan 2025, Nassar et al., 14 Mar 2025).

6.2 Community and Ecosystem

Open-source MIT license; >10,000 GitHub stars in initial launch period. Significant community engagement with contributions in backends, performance, and language support (Livathinos et al., 27 Jan 2025).
All trained checkpoints, documentation and models released on HuggingFace under permissive licensing (Livathinos et al., 15 Sep 2025).

6.3 Limitations and Future Directions

Moderate recall for visually ambiguous or complex layout types may still require task-specific post-processing.
Error cases include well-formedness of complex tag structures in end-to-end VLM outputs; future work includes structure-constrained decoding and improved high-resolution processing (Nassar et al., 14 Mar 2025, Livathinos et al., 15 Sep 2025).

7. Summary and Comparative Impact

Docling defines the reference architecture for efficient, extensible, and accurate document-to-structure conversion in the open-source ecosystem. By combining strong modularity, a powerful Python API, and state-of-the-art models (DocLayNet, TableFormer, and advanced layout detectors), Docling achieves high-precision layout and table recovery, robust performance across hardware, and seamless integration in core NLP and RAG pipelines (Auer et al., 19 Aug 2024, Livathinos et al., 27 Jan 2025, Livathinos et al., 15 Sep 2025). The emergence of end-to-end compact VLMs (e.g., SmolDocling) extends the paradigm to fully unified, parameter-efficient page conversion, broadening practical deployment and enabling structure-aware information retrieval and document analysis at unprecedented scale.

PDF Markdown Chat (Pro)

References (4)

Docling Technical Report (2024)

Docling: An Efficient Open-Source Toolkit for AI-driven Document Conversion (2025)

Advanced Layout Analysis Models for Docling (2025)

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Docling Technical Report.