Papers
Topics
Authors
Recent
Search
2000 character limit reached

DocLayout-YOLO: Real-time Layout Analysis

Updated 18 March 2026
  • The paper introduces DocLayout-YOLO, a framework that integrates synthetic pre-training with a novel multi-scale GL-CRM module to achieve both high accuracy and real-time inference.
  • It leverages the extensive DocSynth-300K synthetic dataset to simulate diverse document layouts, yielding measurable mAP improvements across various benchmarks.
  • Evaluations demonstrate that DocLayout-YOLO narrows the accuracy gap with multimodal approaches while maintaining over 80 FPS, ideal for mobile scanning and high-volume archiving.

DocLayout-YOLO is a document layout analysis framework designed to achieve real-time inference speeds while delivering high accuracy across a broad spectrum of document types, leveraging document-specific synthetic data generation and a novel multi-scale receptive-field module within the YOLO (You Only Look Once) architecture (Zhao et al., 2024). The system addresses the need for scalable and robust detection of layout elements in complex and heterogeneous documents—surpassing prior unimodal (visual-only) YOLO variants and significantly narrowing the accuracy gap to state-of-the-art multimodal approaches, without incurring their computational cost.

1. Motivation and Context

Document layout analysis is foundational for document understanding in applications such as information extraction, conversion, and retrieval from scientific, financial, and commercial documents. Multimodal systems (e.g., LayoutLMv3, DiT-Cascade) set accuracy benchmarks by fusing OCR-generated text with visual representations, attaining AP50 scores near 80% on standard datasets (e.g., D⁴LA), but are hindered by high inference latency (≤ 15 FPS on A100 GPUs) due to large transformer-based backbones. In contrast, unimodal detectors (YOLOv10, DINO) can reach real-time speeds—YOLOv10 achieves ≈ 144.9 FPS—but historically lag in mAP (typically 76–81%) (Zhao et al., 2024).

DocLayout-YOLO is motivated by practical domains that demand both mAP > 70% and throughput beyond 80 FPS—such as mobile scanning, live document ingestion in retrieval-augmented generation (RAG) pipelines, and high-volume archiving. The framework targets three principal challenges: (1) generalization to diverse and non-academic layouts, (2) handling multi-scale variability in layout elements, and (3) recovering the accuracy shortfall relative to multimodal systems while retaining real-time performance.

2. Synthetic Data Generation: DocSynth-300K

The training of DocLayout-YOLO is underpinned by DocSynth-300K, a synthetic data corpus constructed to reflect the statistical and geometric diversity of real-world documents (Zhao et al., 2024). Document layout synthesis is framed as a two-dimensional bin-packing problem with the Mesh-candidate BestFit algorithm:

  • Each page is modeled as a bin of size W×HW\times H.
  • Candidate elements eie_i (text blocks, tables, figures) are drawn from an element pool and have specific wi×hiw_i \times h_i sizes.
  • The algorithm greedily selects placements that maximize the filled area—at each step, it tests candidate elements against meshgrid cells, computes the fill rate frij=wihiwgjhgjfr_{ij} = \frac{w_i h_i}{w_{g_j} h_{g_j}}, and places the highest scoring candidate if frij>τfr_{ij} > \tau, enforcing non-overlap and boundary constraints.

Element pools are sourced from M⁶Doc, comprised of ≈2.8K pages and 74 element categories. Underrepresented types (< 100 samples) are augmented using flips, contrast and brightness jitter, random cropping, edge extraction, elastic transformations, and Gaussian noise. This process ensures both element and layout-level diversity, enabling the annotation of 300,000 high-fidelity, synthetic document images. LayoutGAN++ tools quantify the quality of generations: the BestFit method yields an align score of 0.0009 and a density of 0.645 (lower and higher, respectively, indicating greater realism).

Pre-training on DocSynth-300K (image longer side 1600 px, batch size 128, SGD optimizer at lr = 0.02) followed by fine-tuning on real downstream datasets produces measurable accuracy gains. For instance, on DocStructBench-Academic, pre-training with DocSynth yields 82.1% mAP versus 81.0% for PubLayNet and 81.6% for DocBank. Consistent improvements of +2.6 to +3.0 mAP are observed on D⁴LA and DocLayNet relative to prior sources (Zhao et al., 2024).

3. Model Architecture: Global-to-Local Controllable Receptive Module

DocLayout-YOLO builds on the YOLOv10 (v10m) anchor-free architecture, introducing the Global-to-Local Controllable Receptive Module (GL-CRM) as a replacement for standard CSP bottlenecks (Zhao et al., 2024). The pipeline retains the YOLO backbone, FPN/PAFPN neck, and decoupled detection head for regressing box offsets and class scores.

The Controllable Receptive Module (CRM) operates as follows:

  • For a feature map XRC×H×WX \in \mathbb{R}^{C\times H\times W}, with kernel size kk and dilations d1,,dnd_1,\dots,d_n, convolutions at each dilation produce outputs:

Fi=GELU(BN(Conv(X;w,dilation=di))),i=1nF_i = \operatorname{GELU}(\operatorname{BN}(\operatorname{Conv}(X; w, \text{dilation}=d_i))),\quad i=1\dots n

  • The outputs F1,,FnF_1,\dots,F_n are concatenated; a gating mask MM is produced using a grouped 1×1 convolution and sigmoid activation.
  • The CRM output is computed by adaptively combining gated features:

XCRM=X+GELU(BN(Convout(MF^)))X_{\text{CRM}} = X + \operatorname{GELU}\Bigl(\operatorname{BN}(\operatorname{Conv}_{\text{out}}(M \odot \hat{F}))\Bigr)

  • GL-CRM modules are deployed at three scales: global-level (shallow layers, kernel=5, dilations=1,2,3), block-level (intermediate, kernel=3, dilations=1,2,3), and local-level (deep, standard bottleneck with no dilation). This hierarchy encodes page-wide context alongside local detail, improving detection of both large and fine-grained elements.

4. Training, Loss Functions, and Hyperparameters

DocLayout-YOLO utilizes standard YOLO training losses with modifications for the CRM modules. The complete loss is a weighted sum: L=λbboxLbbox+λclsLcls+λobjLobj+λiouLiouL = \lambda_{\mathrm{bbox}} L_{\mathrm{bbox}} + \lambda_{\mathrm{cls}} L_{\mathrm{cls}} + \lambda_{\mathrm{obj}} L_{\mathrm{obj}} + \lambda_{\mathrm{iou}} L_{\mathrm{iou}} where

  • Lbbox=ib^ibi1L_{\mathrm{bbox}} = \sum_i \|\,\hat b_i - b_i\|_1 (bounding-box regression)
  • Lcls=i,cyi,clogpi,cL_{\mathrm{cls}} = -\sum_{i, c} y_{i, c}\log p_{i, c} (cross-entropy classification)
  • LobjL_{\mathrm{obj}} and LiouL_{\mathrm{iou}} are objectness and intersection-over-union terms, unmodified from YOLOv10

Pre-training is performed for 30 epochs on DocSynth-300K. Fine-tuning hyperparameters are dataset-specific, with longer sides in [1120, 1600], learning rates in [0.02, 0.04], and batch sizes 16–32. Training is distributed over 8× A100 GPUs, utilizing standard YOLO data augmentations.

5. Evaluation and Comparative Benchmarking

DocLayout-YOLO’s efficacy is validated on the DocStructBench benchmark, which encompasses 9,310 training and 2,645 test images across four domains: Academic, Textbook, Market Analysis, and Financial. Ten object categories are annotated, including table, figure, and formula variations (Zhao et al., 2024).

Quantitative results indicate that DocLayout-YOLO outperforms prior unimodal and multimodal methods across mAP, AP50, and FPS metrics:

Dataset Method mAP AP50 FPS
D⁴LA LayoutLMv3-B 60.0 72.6 9
DiT-Cascade-L 68.2 80.1 6
DINO-4scale-R50 64.7 76.9 26.7
YOLOv10 (v10m) 68.6 80.7 144.9
DocLayout-YOLO 70.3 82.4 85.5
DocLayNet DINO-4scale 77.7 93.5 26.7
YOLOv10 76.2 93.0 144.9
DocLayout-YOLO 79.7 93.4 85.5
DocStructBench DiT-Cascade-L 70.8 80.8 6
YOLOv10 80.5 95.0 144.9
DINO-4scale 80.5 95.4 26.7
LayoutLMv3-B 76.5 94.9 9
DocLayout-YOLO 81.8 95.8 85.5

Ablation studies demonstrate additive improvements: GL-CRM and DocSynth-300K pre-training each provide ≈+1.0–1.7 mAP individually, and GL-CRM placement at multiple scales outperforms single-scale alternatives. Synthetic pre-training based on random or diffusion-derived layouts yields lower generalization than DocSynth-300K, especially in non-academic domains.

6. Discussion, Strengths, Limitations, and Future Work

DocLayout-YOLO advances the speed–accuracy trade-off frontier in document layout analysis by coupling a purpose-built synthetic pre-training corpus with an efficient, adaptive receptive module. This enables real-time operation (> 80 FPS) and mAP > 70% across diverse, heterogeneous document styles (Zhao et al., 2024).

However, all annotations are restricted to axis-aligned bounding boxes; highly skewed, rotated, or three-dimensional layouts are not addressed. The synthetic element pipeline is dependent on real-world design pools and may not represent fully novel graphical compositions.

Planned directions include expanding synthetic augmentation to encompass rotated/skewed elements, integrating weak text supervision for OCR-informed detection, and investigating transformer-free bin-packing solvers to dynamically augment training batches. Incorporation of transformer-based attention remains an avenue to bridge the remaining gap to high-latency, high-accuracy multimodal systems.

Earlier document layout detection systems using YOLOv5 architectures focus on architectural repurposing with minimal modifications, adopting standard detection heads and document object taxonomies (e.g., seven classes: Title, Text, Image, Caption, Image_caption, Table, Table_caption) (Sugiharto et al., 2023). These models demonstrate high precision (0.911), recall (0.971), and F1-scores (0.939) on curated academic/test sets, though on small, in-house datasets (e.g., 153 page examples). Data augmentation exploits conventional YOLO methods, and training employs standard multi-term losses.

A plausible implication is that the evolution from baseline YOLOv5S through the more tailored DocLayout-YOLO architecture—augmented with synthetic pre-training and CRM modules—yields improvements in both robustness and domain generalization, particularly for complex or non-standard document layouts.

DocLayout-YOLO, by unifying highly diverse synthetic data with scalable, adaptive detection modules, situates itself as a reference architecture at the intersection of speed, usability, and accuracy for modern document understanding tasks (Zhao et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DocLayout-YOLO.