PaddleOCR-VL: Multilingual Document Parsing
- PaddleOCR-VL is a compact vision–language model designed for multilingual document parsing with detailed element-level recognition across text, tables, formulas, and charts.
- It employs a two-stage pipeline that decouples layout analysis from content recognition, using a NaViT-style dynamic resolution encoder and a lightweight ERNIE language model.
- The model achieves competitive benchmark results with low resource consumption through native resolution processing, asynchronous execution, and efficient inference strategies.
PaddleOCR-VL is a state-of-the-art, ultra-compact vision–LLM (VLM) and framework for multilingual document parsing and element-level recognition. Its centerpiece is PaddleOCR-VL-0.9B, which integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B LLM. Designed for efficient support across 109 languages and robust parsing of complex elements—including text, tables, formulas, and charts—the system delivers strong competitiveness against leading VLMs while maintaining minimal resource consumption and fast inference. PaddleOCR-VL advances document understanding by decoupling layout analysis and content recognition stages and optimizing its components for practical deployment in real-world scenarios (Cui et al., 16 Oct 2025).
1. Model Architecture and Design
PaddleOCR-VL operates as a two-stage pipeline:
- Stage 1: Layout analysis via PP-DocLayoutV2, which detects and classifies document elements, predicting their reading order. This separation from recognition minimizes cumulative errors and enables task-specific optimization.
- Stage 2: Element recognition with PaddleOCR-VL-0.9B, which comprises:
- NaViT-style dynamic resolution visual encoder: Accepts native-resolution images, avoiding tiling and preserving detail essential for dense or variable input layouts. The encoder is initialized from a Keye-VL–style model and extracts features for all document modalities.
- MLP projector with GELU activation (merge size=2): Bridges the visual feature space to the LLM dimension; enhances positional representations (using 3D-RoPE).
- ERNIE-4.5-0.3B LLM: Provides structured textual output via autoregressive decoding, maintaining spatial and sequential integrity.
- The pointer network used in layout analysis produces an pairwise order matrix via bilinear similarity: , feeding a deterministic win–accumulation decoder for reading order.
This modular architecture directly addresses the challenge of element recognition within complex, variable-resolution documents without incurring distortion.
2. Multilingual and Multimodal Capabilities
PaddleOCR-VL supports 109 languages across diverse writing systems and scripts (Chinese, English, French, Hindi, Cyrillic, Devanagari, Arabic, etc.). Its training corpus integrates:
- Large open-source datasets
- Synthetic data generated with varied fonts, CSS, and rendering techniques
- In-house collections from multiple document domains
- Automated annotation pipelines leveraging expert layout models and advanced VLMs
This enables highly robust parsing of printed, handwritten, vertical, and noisy text. The model also manages mixed-language and mixed-modality documents, handling reading direction and script variation effectively.
3. Performance Metrics and Benchmarking
PaddleOCR-VL achieves outstanding results on public and internal benchmarks:
- OmniDocBench v1.5: Top overall score of 92.56, lowest text edit distance (0.035), formula CDM score of 91.43, and leading table TEDS/edit distance.
- OmniDocBench v1.0: Optimum performance across Chinese/English subtasks (reading order, text, formulas).
- Element-level benchmarks (olmOCR-Bench): State-of-the-art CDM scores, e.g., 0.9453 on formula blocks; high TEDS scores for table structure extraction.
Additionally, PaddleOCR-VL demonstrates competitive or leading accuracy under challenging conditions and complex layouts against pipeline-based and other vision–LLMs.
Benchmark | Overall Score | Text Edit Distance | Formula CDM | Table TEDS |
---|---|---|---|---|
OmniDocBench v1.5 | 92.56 | 0.035 | 91.43 | High |
olmOCR-Bench | — | — | 0.9453 | — |
These results highlight strong generalization and element-level recognition performance.
4. Element Recognition: Text, Tables, Formulas, and Charts
PaddleOCR-VL supports element recognition for:
- Text: Printed, handwritten, vertical, mixed-language, and noisy text; semantic reading order preserved.
- Tables: Accurate extraction of table boundaries, cell structures, and relations—critical for parsing academic, financial, and structured documents into Markdown/JSON.
- Formulas: Robust parsing of mathematical expressions (printed/handwritten/vertical), conversion to LaTeX, and high character-level CDM precision.
- Charts: Recognition of chart text (label/title) and reconstruction of chart structure into machine-readable formats.
Its capacity for element-level conversion greatly improves structured data extraction for further downstream retrieval, analysis, or archiving.
5. Resource Efficiency and Scalability
With only 0.9B parameters, PaddleOCR-VL attains resource efficiency via:
- Native resolution processing without tiling, reducing computation
- Lightweight LLM (ERNIE-4.5-0.3B) and minimal-overhead MLP projector
- Asynchronous execution at the pipeline level, enabling parallelism and maximum batch throughput
- Compatibility with efficient inference backends (vLLM, SGLang, FastDeploy) for low-latency GPU deployment
Performance measurements confirm high throughput and low GPU memory consumption versus heavier VLMs.
6. Deployment and Real-World Applications
PaddleOCR-VL is designed for practical deployment:
- Automated conversion of PDFs and scanned documents to structured formats (Markdown, JSON)
- Backbone for Retrieval-Augmented Generation (RAG) in intelligent systems, enhancing downstream NLP applications
- Robust parsing of complex documents, including handwritten, mixed-language, and chart-rich content
- Scalable for production on hardware ranging from NVIDIA A100 to RTX 3060/4090D; supports batch processing and low-latency serving
Case studies document its successful deployment in environments where prior pipeline-based approaches fail due to error propagation or latency accumulation.
7. Comparative Analysis and Impact
PaddleOCR-VL demonstrably outperforms pipeline-based solutions and achieves strong competitiveness with top-tier VLMs in both accuracy and speed. Its dynamic NaViT-style encoder is distinct among compact models, overcoming the need for fixed grid or tiling preprocessing prevalent in legacy systems.
The model’s compact size and efficiency position it as a preferred choice for multimodal document parsing, offering a viable alternative to high-parameter VLMs without compromising element-level recognition fidelity.
PaddleOCR-VL constitutes a rigorous, efficient, and highly multilingual framework for document parsing, integrating advanced visual encoding and compact LLMing. Its architecture supports detailed extraction of text, tables, formulas, and charts, and achieves leading benchmark results in real-world, high-throughput settings (Cui et al., 16 Oct 2025).