PaddleOCR-VL: Multilingual Document Parsing

Updated 17 October 2025

PaddleOCR-VL is a compact vision–language model designed for multilingual document parsing with detailed element-level recognition across text, tables, formulas, and charts.
It employs a two-stage pipeline that decouples layout analysis from content recognition, using a NaViT-style dynamic resolution encoder and a lightweight ERNIE language model.
The model achieves competitive benchmark results with low resource consumption through native resolution processing, asynchronous execution, and efficient inference strategies.

PaddleOCR-VL is a state-of-the-art, ultra-compact vision–LLM (VLM) and framework for multilingual document parsing and element-level recognition. Its centerpiece is PaddleOCR-VL-0.9B, which integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B LLM. Designed for efficient support across 109 languages and robust parsing of complex elements—including text, tables, formulas, and charts—the system delivers strong competitiveness against leading VLMs while maintaining minimal resource consumption and fast inference. PaddleOCR-VL advances document understanding by decoupling layout analysis and content recognition stages and optimizing its components for practical deployment in real-world scenarios (Cui et al., 16 Oct 2025).

1. Model Architecture and Design

PaddleOCR-VL operates as a two-stage pipeline:

Stage 1: Layout analysis via PP-DocLayoutV2, which detects and classifies document elements, predicting their reading order. This separation from recognition minimizes cumulative errors and enables task-specific optimization.
Stage 2: Element recognition with PaddleOCR-VL-0.9B, which comprises:
- NaViT-style dynamic resolution visual encoder: Accepts native-resolution images, avoiding tiling and preserving detail essential for dense or variable input layouts. The encoder is initialized from a Keye-VL–style model and extracts features for all document modalities.
- MLP projector with GELU activation (merge size=2): Bridges the visual feature space to the LLM dimension; enhances positional representations (using 3D-RoPE).
- ERNIE-4.5-0.3B LLM: Provides structured textual output via autoregressive decoding, maintaining spatial and sequential integrity.
- The pointer network used in layout analysis produces an $N \times N$ pairwise order matrix via bilinear similarity: $LCP(i, j) = (W_q \cdot f_i)^T (W_k \cdot f_j)$ , feeding a deterministic win–accumulation decoder for reading order.

This modular architecture directly addresses the challenge of element recognition within complex, variable-resolution documents without incurring distortion.

2. Multilingual and Multimodal Capabilities

PaddleOCR-VL supports 109 languages across diverse writing systems and scripts (Chinese, English, French, Hindi, Cyrillic, Devanagari, Arabic, etc.). Its training corpus integrates:

Large open-source datasets
Synthetic data generated with varied fonts, CSS, and rendering techniques
In-house collections from multiple document domains
Automated annotation pipelines leveraging expert layout models and advanced VLMs

This enables highly robust parsing of printed, handwritten, vertical, and noisy text. The model also manages mixed-language and mixed-modality documents, handling reading direction and script variation effectively.

3. Performance Metrics and Benchmarking

PaddleOCR-VL achieves outstanding results on public and internal benchmarks:

OmniDocBench v1.5: Top overall score of 92.56, lowest text edit distance (0.035), formula CDM score of 91.43, and leading table TEDS/edit distance.
OmniDocBench v1.0: Optimum performance across Chinese/English subtasks (reading order, text, formulas).
Element-level benchmarks (olmOCR-Bench): State-of-the-art CDM scores, e.g., 0.9453 on formula blocks; high TEDS scores for table structure extraction.

Additionally, PaddleOCR-VL demonstrates competitive or leading accuracy under challenging conditions and complex layouts against pipeline-based and other vision–LLMs.

Benchmark	Overall Score	Text Edit Distance	Formula CDM	Table TEDS
OmniDocBench v1.5	92.56	0.035	91.43	High
olmOCR-Bench	—	—	0.9453	—

These results highlight strong generalization and element-level recognition performance.

4. Element Recognition: Text, Tables, Formulas, and Charts

PaddleOCR-VL supports element recognition for:

Text: Printed, handwritten, vertical, mixed-language, and noisy text; semantic reading order preserved.
Tables: Accurate extraction of table boundaries, cell structures, and relations—critical for parsing academic, financial, and structured documents into Markdown/JSON.
Formulas: Robust parsing of mathematical expressions (printed/handwritten/vertical), conversion to LaTeX, and high character-level CDM precision.
Charts: Recognition of chart text (label/title) and reconstruction of chart structure into machine-readable formats.

Its capacity for element-level conversion greatly improves structured data extraction for further downstream retrieval, analysis, or archiving.

5. Resource Efficiency and Scalability

With only 0.9B parameters, PaddleOCR-VL attains resource efficiency via:

Native resolution processing without tiling, reducing computation
Lightweight LLM (ERNIE-4.5-0.3B) and minimal-overhead MLP projector
Asynchronous execution at the pipeline level, enabling parallelism and maximum batch throughput
Compatibility with efficient inference backends (vLLM, SGLang, FastDeploy) for low-latency GPU deployment

Performance measurements confirm high throughput and low GPU memory consumption versus heavier VLMs.

6. Deployment and Real-World Applications

PaddleOCR-VL is designed for practical deployment:

Automated conversion of PDFs and scanned documents to structured formats (Markdown, JSON)
Backbone for Retrieval-Augmented Generation (RAG) in intelligent systems, enhancing downstream NLP applications
Robust parsing of complex documents, including handwritten, mixed-language, and chart-rich content
Scalable for production on hardware ranging from NVIDIA A100 to RTX 3060/4090D; supports batch processing and low-latency serving

Case studies document its successful deployment in environments where prior pipeline-based approaches fail due to error propagation or latency accumulation.

7. Comparative Analysis and Impact

PaddleOCR-VL demonstrably outperforms pipeline-based solutions and achieves strong competitiveness with top-tier VLMs in both accuracy and speed. Its dynamic NaViT-style encoder is distinct among compact models, overcoming the need for fixed grid or tiling preprocessing prevalent in legacy systems.

The model’s compact size and efficiency position it as a preferred choice for multimodal document parsing, offering a viable alternative to high-parameter VLMs without compromising element-level recognition fidelity.

PaddleOCR-VL constitutes a rigorous, efficient, and highly multilingual framework for document parsing, integrating advanced visual encoding and compact language modeling. Its architecture supports detailed extraction of text, tables, formulas, and charts, and achieves leading benchmark results in real-world, high-throughput settings (Cui et al., 16 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to PaddleOCR-VL.