DocOwl 1.5: OCR-Free Multimodal Document Understanding

Updated 1 December 2025

DocOwl 1.5 is an OCR-free multimodal language model that jointly learns document structure and text without relying on OCR.
It introduces a novel H-Reducer module and a unified structure learning curriculum to efficiently merge spatial layout cues and enable detailed reasoning.
Benchmark evaluations show state-of-the-art performance across ten tasks, outperforming previous models by over 10 absolute points on key datasets.

DocOwl 1.5 is an OCR-free Multimodal LLM (MLLM) for document understanding, with technical innovations focused on unified structure learning and efficient processing of text-rich images. The framework demonstrates state-of-the-art accuracy on ten established document and scene-text benchmarks by fusing a spatially-sensitive vision-to-text adapter, large-scale structure-aware pre-training, and refined instruction tuning for multi-turn explanation. Developed by the X-PLUG group, DocOwl 1.5 addresses the limitations of prior MLLMs that struggled with the structural regularities of documents, tables, and web layouts, aiming for robust reading and reasoning in an end-to-end, OCR-free setting (Hu et al., 19 Mar 2024).

1. Motivation: Unified Structure in OCR-Free Document Understanding

Contemporary MLLMs such as MiniGPT-4, InstructBLIP, and mPLUG-Owl2 perform strongly in image captioning but underperform on documents, tables, and charts due to neglect of document structure—row/column layout, spatially ordered text blocks, and regular reading order. Traditional document understanding systems incorporate explicit OCR and post-hoc alignment modules, leading to brittle, multi-step pipelines. DocOwl 1.5 directly targets these deficiencies by positing structure as a central learning signal: the goal is an MLLM that reads and organizes text jointly, obviating intermediate OCR stages and manual mapping (Hu et al., 19 Mar 2024).

2. Architecture and Key Modules

DocOwl 1.5 preserves the standard “vision encoder → vision-to-text module → LLM decoder” pipeline, introducing two principal advancements:

H-Reducer Vision-to-Text Module: This adapter compresses visual patch features while retaining layout signals. A 1×4 convolution horizontally merges adjacent ViT patches within each crop, exploiting the left-to-right organization of text lines and reducing sequence length.
Unified Structure Learning Curriculum: The model is pre-trained via multi-domain, structure-aware parsing and multi-grained text localization, teaching not only text reading but also structural segmentation and reasoning.

Vision Input Handling and Cropping

Each image is resized to a global 448×448 view plus up to nine local crops. All patches from these views are encoded via a frozen ViT-L/14. Each crop’s output, a feature sequence, is fed into H-Reducer for horizontal merging and then projected into the LLM space.

H-Reducer: Horizontal Merging Mechanism

Let $V_i = (v_i^1, ..., v_i^L) \in \mathbb{R}^{L \times D}$ denote ViT patch features for crop $i$ . H-Reducer performs:

$\overline v_i^j = f(v_i^{4j-3}, v_i^{4j-2}, v_i^{4j-1}, v_i^{4j}), \quad j = 1...\frac{L}{4}$

where $f(\cdot)$ is a 1×4 convolution (stride 4), yielding merged $\overline V_i \in \mathbb{R}^{(L/4) \times D}$ . This compressed sequence is then linearly projected to match the LLM's embedding dimension. The LLM input concatenates fixed-position tokens (e.g., <row2_col1>), the processed features, and user instructions. This architecture enables efficient handling of high-resolution documents without sacrificing spatial order (Hu et al., 19 Mar 2024).

Comparison to mPLUG-DocOwl

The original mPLUG-DocOwl used a “Visual Abstractor”—a learnable set of cross-attending tokens—to connect ViT outputs with a frozen Vicuna/LLaMA LLM. LoRA adapters enable targeted fine-tuning within the LLM, while only abstractor and adapter parameters are updated during document-specific training (Ye et al., 2023).

3. Unified Structure Learning and Datasets

To infuse knowledge of structure, DocOwl 1.5 employs a multi-domain curriculum built upon two task classes:

Structure-Aware Parsing: Tasks include
- Document/Web Parsing: Generating plain text with \n and spaces to reflect document layout.
- Table Parsing: Markdown output with explicit column and row span signals (<COLSPAN=x>, <ROWSPAN=y>).
- Chart Parsing: Translating visual charts into structured tables.
- Natural Image Parsing: Creating fused descriptions of image content and detected scene text.
Multi-Grained Text Localization: For word, phrase, line, and block levels, the model learns grounding (identify bounding boxes for given text spans, evaluated via [email protected]) and recognition (predict text given a bounding box, measured by BLEU $_n$ ).

The training corpus, DocStruct4M, comprises approximately 4 million samples, drawn from public datasets including CCpdf, RVL-CDIP, VisualMRC, DUE-Benchmark, TURL, PubTabNet, PlotQA, FigureQA, DVQA, ChartQA, and OCR-CC. Text localization pairs are sourced from all main document and chart VQA benchmarks (Hu et al., 19 Mar 2024).

Additionally, the 25,877-example DocReason25K dataset is curated from human and GPT-annotated QA pairs, emphasizing concise final answers alongside detailed, stepwise reasoning, with average answer lengths approaching 90 tokens per sample.

4. Training Procedures and Regimes

Training proceeds in two explicit stages:

Stage	Parameters Updated	Data Used	Iterations / Batch Size	Learning Rate
Unified Structure Learning (Stage 1)	Vision encoder, H-Reducer, MAM	4M DocStruct4M	12,000 / 1,024	$1 \times 10^{-4}$
Multi-task Fine-tuning (Stage 2)	H-Reducer, MAM, LLM	All tasks + DocReason25K	6,500 / 256	$2 \times 10^{-5}$

All stages freeze substantial backbone parameters (e.g., the pre-trained LLM) to preserve general capabilities acquired in pre-training. Ablation studies indicate that two-stage training is more sample-efficient and yields higher accuracy compared to mixed one-stage training, especially when scaling from 0.5M to 4M structure samples (Hu et al., 19 Mar 2024).

5. Quantitative Performance and Benchmarking

DocOwl 1.5 achieves state-of-the-art accuracy among OCR-free MLLMs with fewer than 10B parameters across ten representative tasks:

Model	Params	DocVQA	InfoVQA	DeepForm	KLC	WTQ	TabFact	ChartQA	TextVQA	TextCaps	VisualMRC
UReader	7.1B	65.4	42.2	49.5	32.8	29.4	67.6	59.3	57.6	118.4	221.7
CogAgent	17.3B	81.6	44.5	–	–	–	–	68.4	76.1	–	–
DocOwl 1.5	8.1B	81.6	50.4	68.8	37.9	39.8	80.4	70.5	68.8	132.0	239.5

On five benchmarks (DocVQA, DeepForm, WTQ, TabFact, ChartQA), DocOwl 1.5 exceeds the strongest prior 7B model by more than 10 absolute points. Performance is also strong for key information extraction, chart reasoning, and scene-text recognition, establishing a new standard for structure-sensitive, OCR-free document understanding (Hu et al., 19 Mar 2024).

6. Design Choices and Ablation Analyses

Ablation studies provide empirical support for each major component:

H-Reducer: On DocVQA, 1×4 merging outperforms both wider (1×8) and block-based (2×2) fusions as well as the original “Abstractor” from mPLUG-Owl2, confirming alignment with Western-style horizontal text layout.
Structure Learning Curriculum: Adding structure parsing boosts DocVQA from 72.8 to 77.7, and further adding multi-grained grounding lifts scores to 81.6. Including explicit text tokens for crop location yields modest further gains.
Training Regime: Two-stage (pretrain, then multi-task fine-tuning) is superior to joint one-stage sampling for both accuracy and GPU efficiency.

A plausible implication is that spatially-aware merging, in conjunction with deep structure-aware pretraining, is more effective than deeper but spatially-agnostic cross-attention modules found in earlier modularized designs (Hu et al., 19 Mar 2024, Ye et al., 2023).

7. Significance, Limitations, and Research Directions

DocOwl 1.5 demonstrates that unified, end-to-end learning of reading and structure can match or exceed systems that rely on OCR or manual pipeline integration. The new data resources DocStruct4M and DocReason25K, along with the H-Reducer module, establish a framework for future OCR-free research. However, limitations remain in handling layouts not well represented by horizontal text merging (e.g., vertical scripts or forms with complex spatial relations), and further work may be required for robust handling of non-standard and noisy layouts.

These advancements illuminate a path for the development of universal document understanding models that are both scalable and adaptable to new domains, provided appropriate structure and localization signals are available at scale (Hu et al., 19 Mar 2024, Ye et al., 2023).

PDF Markdown Chat (Pro)

References (2)

mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding (2024)

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding (2023)

DocOwl 1.5: OCR-Free Multimodal Document Understanding

1. Motivation: Unified Structure in OCR-Free Document Understanding

2. Architecture and Key Modules

Vision Input Handling and Cropping

H-Reducer: Horizontal Merging Mechanism

Comparison to mPLUG-DocOwl

3. Unified Structure Learning and Datasets

4. Training Procedures and Regimes

5. Quantitative Performance and Benchmarking

6. Design Choices and Ablation Analyses

7. Significance, Limitations, and Research Directions

Whiteboard

Follow Topic

Continue Learning

DocOwl 1.5: OCR-Free Multimodal Document Understanding

1. Motivation: Unified Structure in OCR-Free Document Understanding

2. Architecture and Key Modules

Vision Input Handling and Cropping

H-Reducer: Horizontal Merging Mechanism

Comparison to mPLUG-DocOwl

3. Unified Structure Learning and Datasets

4. Training Procedures and Regimes

5. Quantitative Performance and Benchmarking

6. Design Choices and Ablation Analyses

7. Significance, Limitations, and Research Directions

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics