Docling: AI Document Conversion Toolkit

Updated 6 May 2026

Docling is a Python toolkit designed for high-fidelity document conversion, offering robust layout analysis and table recognition across diverse formats.
It integrates modular parsing backends, transformer-based models, and OCR to accurately process both born-digital and scanned documents.
The toolkit’s extensibility and performance make it ideal for applications in RAG workflows, regulatory review, and enterprise data extraction.

Docling is an open-source, MIT-licensed Python toolkit designed for high-fidelity, AI-driven document conversion, enabling transformation of a broad range of formats (PDF, scanned images, Office files, HTML) into structured, machine-consumable representations. Docling integrates state-of-the-art layout analysis and table structure recognition models in a modular, extensible pipeline suitable for large-scale academic, business, regulatory, or question-answering (QA) workflows. It is recognized for efficiency, rapid adoption, and extensibility, operating reliably on commodity CPUs and GPUs and supporting integration with Retrieval-Augmented Generation (RAG) pipelines, annotation systems, and hybrid agentic frameworks (Livathinos et al., 27 Jan 2025, Auer et al., 2024, Livathinos et al., 15 Sep 2025, Santos et al., 30 Mar 2026, Yashwant et al., 17 Oct 2025, Gupta et al., 30 Apr 2026, Kocbek et al., 18 Dec 2025).

1. Architecture and Modular Pipeline

Docling’s architecture centers on a multi-stage pipeline with tight separation between backend parsing, model processing, and output assembly (Auer et al., 2024, Livathinos et al., 27 Jan 2025). Its core stages are:

Parsing Backend: Extracts text tokens (with 2D coordinates), embedded images, and renders each page as a bitmap (for vision tasks). Two main backends are provided: docling-parse (qpdf-based, maximal quality) and pypdfium (optimized for speed and memory).
Model Pipeline: Sequentially applies specialized models for document layout analysis (DocLayNet), table structure recognition (TableFormer), and optional OCR (EasyOCR/Tesseract). Models are implemented as callable pipeline steps, allowing custom insertions or reordering.
Post-Processing & Assembly: Combines predictions into a global document representation (the Pydantic DoclingDocument), resolves reading order, links figures with captions, suppresses noise regions (e.g., headers/footers), and serializes to Markdown or JSON.
Extensibility: The pipeline supports easy integration of custom models (e.g., figure classifiers, schema mappers) via Python entry points (Auer et al., 2024, Livathinos et al., 27 Jan 2025).

The system efficiently handles both born-digital and scanned documents, and includes robust support for batch processing, metadata extraction, and performance tuning.

2. Core AI Models: Layout Analysis and Table Recognition

The document understanding capabilities are driven by two open-source optimized models:

DocLayNet (Layout Analysis): Based on RT-DETR(v2) and DFINE model families, DocLayNet performs real-time, transformer-based object detection on page images to identify blocks such as paragraphs, tables, list items, figures, captions, and section headers. Recent upgrades (“heron” RT-DETRv2-R50 and “heron-101” RT-DETRv2-R101) have improved mAP by over 23 pp (to ~78% mAP at IoU@[.5:.95], 28ms/image on A100 GPU) (Livathinos et al., 15 Sep 2025). Training draws from a heterogeneous corpus (DocLayNet, DocLayNet-v2, WordScape) and careful post-processing merges, re-labels, and hierarchically nests layout elements.
TableFormer (Table Structure Recognition): Adopts a vision-transformer encoder plus a structure-decoding transformer head emitting sequences in Optimized Table Structure Language (OTSL). The system decodes image table crops and detected text tokens into structured representations (JSON tables with rowSpan and colSpan). Benchmarks on PubTabNet report cell-level F1 ≈ 0.96, EM ≈ 0.82 (Auer et al., 2024, Livathinos et al., 27 Jan 2025).
OCR Integration: For documents lacking a reliable text layer, Docling enables OCR modules (EasyOCR, Tesseract), merging outputs to handle complex, low-quality scans.

These models are modular and independently replaceable or extensible via the pipeline configuration interface.

3. Unified Document Representation and Outputs

Docling’s pipeline produces a unified, tree-structured DoclingDocument, encapsulating:

Hierarchical Page Elements: Each page contains an ordered list of elements (Paragraph, Heading, ListItem, Table, Figure, Caption, etc.), each with type, text/content or child elements, bounding boxes, provenance metadata (page, backend), and optional reading-order indices.
Chunking and Export: Documents can be chunked for RAG or agentic retrieval systems, with chunk boundaries respecting semantic and layout structure. Output formats include enriched Markdown (with tables, figures as images/captions, native headers, code blocks), JSON, and HTML.
Metadata and Enrichment: Breadcrumb metadata, section hierarchies, and embedded image captions (via Vision-LLMs) are included where requested. Tables and formulas are converted and normalized for downstream analyses.

This representation is designed for optimal downstream compatibility: enabling accurate semantic search, QA, regulatory review, or further machine annotation (Santos et al., 30 Mar 2026, Kocbek et al., 18 Dec 2025, Gupta et al., 30 Apr 2026).

4. Applications in Retrieval-Augmented Generation and QA

Docling is recognized as a leading backbone for RAG-enabled QA frameworks (Santos et al., 30 Mar 2026, Kocbek et al., 18 Dec 2025, Gupta et al., 30 Apr 2026). Key applications and empirical findings:

PDF→RAG Pipelines: Docling’s hierarchical extraction and metadata enrichment yield marked improvements in downstream QA accuracy. For domain-specific QA on administrative documents, Docling with hierarchy-aware splitting and VLM image descriptions achieves 94.1% LLM-judged accuracy (vs. 86.9% for naïve PDFLoader and 97.1% for human-curated Markdown). Structure- and metadata-aware chunking contributes more to accuracy than converter choice alone (Santos et al., 30 Mar 2026).
Biomedical MM-RAG: In visually dense domains, Docling supports both text-centric and visual retrieval (OCR-free). For mid-size base models, conversion to text (multi-modal summaries) outperforms image-only retrieval; for frontier vision-LLMs, late-interaction visual retrieval (ColFlor/ColPali) achieves parity or slight advantages (e.g., GPT-5+ColFlor reaches 0.828 accuracy, with high efficiency). Chunking and table/figure summaries are generated automatically using Docling’s pipeline (Kocbek et al., 18 Dec 2025).
Chartered Accountancy RAG: A layout-aware Docling extraction ensures structure is preserved in retrieval, significantly improving table-centric query precision and retrieval robustness for legal/financial documents. Maintaining table fidelity yields ~12% improvement for table queries. Docling consistently outperforms alternatives such as Dolphin or PaddleOCR on layout and structure tasks (Gupta et al., 30 Apr 2026).
Invoice Information Extraction: The “CV-first” Docling parser, using a multimodal transformer (DocFormer-style), provides layout-anchored key field extraction; its outputs serve as inputs to hybrid LLM-based QA or validation services (Yashwant et al., 17 Oct 2025).

Docling is widely integrated with frameworks such as LangChain, LlamaIndex, and spaCy, with built-in adapters for API, CLI, and microservice deployment (Livathinos et al., 27 Jan 2025, Auer et al., 2024).

5. Evaluation, Benchmarks, and Empirical Performance

Performance evaluations span speed, accuracy, and downstream pipeline efficacy:

Model	mAP@[.5:.95]	Cell F1 (tables)	CPU sec/page	GPU ms/page	Highlights
DocLayNet (heron)	~0.78	—	—	28–44	Fast layout, >23 pp over v1
TableFormer	—	~0.96	1.7	400	Lines/tabular/nested cases
Full Docling	—	—	0.79–3.1	114–490	Mac, x86, L4 GPU

Scalability: Docling converts 1.3–2.4 pages/s on modern CPUs (M3 Max), and up to 480 ms/page on Nvidia L4. Models operate within 7 GB RAM, suitable for local, cloud, and large-batch pipelines (Auer et al., 2024, Livathinos et al., 27 Jan 2025).
Downstream QA Impact: RAG-QA pipelines with Docling preprocessing achieve up to 94.1% accuracy with enriched chunking, close to human-curated text (Santos et al., 30 Mar 2026, Kocbek et al., 18 Dec 2025).
Layout Model Upgrades: Incorporating RT-DETRv2 and DFINE architectures, mAP improved from 54.1% (old-docling) to 78% (heron-101), with no sacrifice in throughput (28 ms/image on A100) (Livathinos et al., 15 Sep 2025).

Benchmark datasets include DocLayNet, PubTabNet, PubLayNet, FUNSD, and domain-specific sets for biomedical and regulatory texts.

6. Extensibility, Licensing, and Integrations

Docling emphasizes modular extensibility (Auer et al., 2024, Livathinos et al., 27 Jan 2025):

Custom Models: The pipeline can load arbitrary detection, classification, or NLP models via subclassing pipeline steps or registering via Python entry points. Examples include hybrid LLM post-correction, advanced figure classifiers, or domain-specific extractors.
Open-Source License: The MIT license allows unrestricted commercial and academic use, including redistribution, modification, and embedding in proprietary workflows, conditional on attribution.
API, CLI, and Language Bindings: Docling can be invoked via a Python API, CLI (docling convert), or as a microservice. Out-of-the-box integrations support large agentic and data-processing frameworks (Ray, Dask, Bee Agent, Data-Prep-Kit), as well as annotation environments via Pydantic models and REST APIs.
Community and Benchmarking: The project has achieved broad adoption, extensive documentation, and public release of trained checkpoints and conversion scripts (notably the HuggingFace ds4sd/models space) (Livathinos et al., 15 Sep 2025, Auer et al., 2024).

7. Research Directions and Specialized Variants

Several orthogonal research and system extensions leverage or complement Docling:

End-to-End Vision-Language Alternatives: SmolDocling, a 256M parameter compact VLM, processes entire pages end-to-end, emitting DocTags markup (content + normalized bounding boxes), achieving competitive accuracy (e.g., F1 = 0.80 text, F1 = 0.95 equations) relative to models 10–27× larger. SmolDocling’s purely autoregressive, holistic strategy contrasts with Docling’s modular pipeline and achieves high throughput (0.35 s/page on A100) (Nassar et al., 14 Mar 2025).
Hybrid Validation Frameworks: RaV-IDP builds on Docling extraction, introducing a reconstruction-as-validation loop that measures fidelity label-free, invoking GPT-4.1 vision fallback when fidelity is below calibrated thresholds, improving error recovery and delivering robust QA accuracy (ANLS = 0.4224 vs. 0.3910 for top open-source alternatives) (Jha, 26 Apr 2026).
General-purpose Annotation Backends: The LAB project (distinct from Docling) demonstrates the use of NLP-driven APIs for transcription and glossing in linguistics, facilitating model adaptation and integration into professional annotation software (Neubig et al., 2018).

This suggests that Docling’s modular approach represents a prevailing paradigm in document understanding, while end-to-end and validation-centric frameworks continue to expand its capabilities and domains.

References:

(Livathinos et al., 27 Jan 2025) Docling: An Efficient Open-Source Toolkit for AI-driven Document Conversion
(Auer et al., 2024) Docling Technical Report
(Livathinos et al., 15 Sep 2025) Advanced Layout Analysis Models for Docling
(Kocbek et al., 18 Dec 2025) Exploration of Augmentation Strategies in Multi-modal Retrieval-Augmented Generation for the Biomedical Domain: A Case Study Evaluating Question Answering in Glycobiology
(Santos et al., 30 Mar 2026) From PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering
(Yashwant et al., 17 Oct 2025) Invoice Information Extraction: Methods and Performance Evaluation
(Gupta et al., 30 Apr 2026) Retrieval-Augmented Reasoning for Chartered Accountancy
(Jha, 26 Apr 2026) RaV-IDP: A Reconstruction-as-Validation Framework for Faithful Intelligent Document Processing
(Nassar et al., 14 Mar 2025) SmolDocling: An ultra-compact vision-LLM for end-to-end multi-modal document conversion
(Neubig et al., 2018) Towards a General-Purpose Linguistic Annotation Backend