Papers
Topics
Authors
Recent
Search
2000 character limit reached

Qianfan-OCR: Unified Vision-Language Model

Updated 19 March 2026
  • Qianfan-OCR is a unified vision-language model integrating OCR, document parsing, layout analysis, and advanced document understanding within a single 4B-parameter architecture.
  • It employs a modular design featuring a high-resolution Vision Transformer, a cross-modal MLP adapter, and a Qwen3-4B language backbone to achieve state-of-the-art results on diverse benchmarks.
  • The novel Layout-as-Thought mechanism enables explicit retrieval of spatial document structure, enhancing tasks such as table extraction and complex layout analysis.

Qianfan-OCR is a 4-billion-parameter end-to-end vision-LLM that unifies optical character recognition (OCR), document parsing, explicit layout analysis, and advanced document understanding within a single neural architecture. Developed by Baidu and released in 2026, it supports direct image-to-Markdown conversion and a wide spectrum of prompt-driven document tasks such as table extraction, chart analysis, document-level question answering, and key information extraction (KIE). Qianfan-OCR achieves state-of-the-art or near state-of-the-art performance across multiple structured OCR, document intelligence, and multi-language benchmarks, while introducing novel architectural and training methodologies, notably the Layout-as-Thought mechanism for explicit layout representation (Dong et al., 11 Mar 2026).

1. Unified Model Architecture and Training Pipeline

Qianfan-OCR features a modular, unified VLM pipeline comprising a high-resolution Vision Transformer (Qianfan-ViT), a cross-modal MLP adapter, and a Qwen3-4B language modeling backbone. The full model contains 4B parameters, with 3.6B assigned to non-embedding parameters.

  • Vision Encoder (Qianfan-ViT): Inputs of arbitrary resolution are partitioned into up to 16 tiles of 448×448 pixels. Each tile is processed by a 24-layer Vision Transformer (hidden size 1024, 16 heads) with patch size 14×14 (yielding 256 tokens per tile). This AnyResolution tiling enables practical handling of full document pages.
  • Cross-Modal Adapter: A two-layer MLP with GELU activation (mapping 1024-dim visual features to 2560-dim language token space). Initially trained for alignment, it is subsequently fine-tuned end-to-end with the rest of the system.
  • LLM Backbone (Qwen3-4B): Based on 36 transformer layers (hidden size 2560), equipped with Grouped-Query Attention (GQA: 32 query heads, 8 key-value heads) and RMSNorm normalization, supporting input contexts up to 32K (extendable to 131K) tokens.

Multi-stage Curriculum:

Training proceeds in four distinct stages, each with targeted objectives:

  1. Cross-Modal Alignment: Adapter-only training on generic image-caption pairs and simple OCR instances (50B tokens, LR 1e-3).
  2. Foundational OCR: Full model training over diverse OCR data (2T tokens)—document OCR (45%), scene OCR (25%), captions, handwriting, formulas, tables, and multilingual scenarios.
  3. Domain-Specific Enhancement: Full model is further trained (800B tokens) on complex tables, formula recognition, chart understanding, KIE, multilingual and complex document data; 30% of batches remain general VLM tasks to prevent catastrophic forgetting.
  4. Instruction Tuning: The model is tuned on millions of prompt-engineered tasks, spanning parsing, table extraction, QA, layout analysis, and KIE, across multi-page and multi-lingual instances.

AdamW is used throughout (β₁=0.9, β₂=0.95, weight decay 0.05, cosine LR schedule), with batch sizes and learning rates specified for each stage.

Ablation of these stages (on sibling Qianfan-VL-8B) demonstrates that all contribute additively to final OCR accuracy (Dong et al., 11 Mar 2026, Dong et al., 19 Sep 2025).

2. Layout-as-Thought Mechanism

End-to-end OCR models typically lose explicit access to document layout (e.g., bounding boxes, reading order, structural tags) previously available in multi-stage pipeline systems. Qianfan-OCR introduces the Layout-as-Thought mechanism—an intervention that enables two-phase response generation in presence of a special prompt token >.

  • Operational Flow: Upon the <think> trigger, the model first emits a block-structured layout description:
    1
    2
    3
    4
    5
    6
    
    <think>
      <box>[[x₁,y₁,x₂,y₂]]</box>
      <label>paragraph_title</label>
      <brief>Introduction</brief>
      ...
    </think>
    Here, coordinates are normalized to [0, 999] and encoded via dedicated <COORD_*> tokens. After layout emission, the model produces the downstream output (OCR, structured Markdown, or QA result).
  • Technical Realization: The process can be schematized as:
    1
    2
    3
    4
    5
    6
    
    def generate_with_thought(image, prompt):
        vis = vision_encoder(image)
        tokens = adapter(vis) + tokenize(prompt)
        thought = lm.generate(tokens, until="</think>")
        out = lm.generate(tokens + tokenize(thought), until end-of-response)
        return thought, out
  • Impact: This mechanism restores bounding box, type, and reading order information on demand, which is crucial for document post-processing and downstream tasks that require spatial grounding. On OmniDocBench v1.5, Layout-as-Thought increases table extraction metrics (TableTEDs +0.19, TableTEDss +0.18) and yields substantial gains on high-entropy (complex) layouts while imposing mild overhead or slight performance regression on simple homogeneous pages.

3. Prompt-Driven End-to-End OCR and Structured Output

Qianfan-OCR supports fully prompt-conditioned operation, enabling seamless mapping from document image to highly structured output formats, notably Markdown.

  • Workflow: A user supplies a document image and a natural or instruction-style prompt (e.g., "Convert to Markdown," "Extract tables," "What date is listed in the contract?"). The model then emits a Markdown, table, or JSON format output in a single pass—or opt-in two-phase with layout block if <think> is used.
  • Output Capabilities: The system natively emits:
    • Markdown with headings, paragraphs, fenced LaTeX (formula), Markdown/HTML tables, and image placeholders.
    • Structured table extractions, e.g.:
    • 1
      2
      3
      
      | Column A | Column B | Column C |
      |----------|----------|----------|
      |   1.23   |   foo    |    bar   |
    • Natural language responses for chart or document QA.
    • JSON-encoded KIE results.

This enables a unified solution for a variety of enterprise and research document intelligence workloads, superseding the need for separate OCR, parsing, and reasoning stages.

4. Quantitative Benchmark Performance

Qianfan-OCR is evaluated on standard OCR and document understanding benchmarks, both as an end-to-end system and compared to multi-stage pipelines and large generalist VLMs.

Model OmniDocBench v1.5 Overall OCRBench OlmOCR Bench CCOCR Overall DocVQA ChartQA KIE Mean
Qianfan-OCR 93.12 880 79.8 79.3 92.8 88.1 87.9
Qwen3-VL-4B n/a 873 79.2 76.5 94.9 83.3 83.5
PaddleOCR-VL 1.5 94.50 n/a 80.0 n/a n/a n/a n/a
Gemini-3.1-Pro n/a n/a n/a n/a n/a n/a 79.2
Seed-2.0 n/a n/a n/a n/a n/a n/a 78.0
  • On OmniDocBench v1.5, Qianfan-OCR leads among end-to-end models (93.12) and is competitive with specialist pipelines (Dong et al., 11 Mar 2026).
  • On OlmOCR Bench, Qianfan-OCR outperforms Qwen3-VL-4B (79.8 vs. 79.2).
  • For general OCR (OCRBench/CCOCR), Qianfan-OCR shows the highest published numbers among VLMs at comparable scale.
  • On document understanding (DocVQA, ChartQA, CharXiv), two-stage pipelines collapse, particularly on spatially complex QA, while Qianfan-OCR maintains robust performance.
  • For KIE, Qianfan-OCR achieves the highest mean score across public benchmarks, exceeding Gemini-3.1-Pro, Seed-2.0, and Qwen3-VL-235B.

Throughput: Qianfan-OCR achieves 1.024 pages/sec on an NVIDIA A100 (W8A8 quantized), half the speed of traditional pipelines at native precision, but achieves parity with quantized inference.

5. Experimental Analysis and Ablations

  • Training Stage Ablation: On the Qianfan-VL-8B model, each stage of the curriculum (alignment, foundational, domain-enhancement, instruction) contributes distinct, additive improvements to OCR accuracy.
  • Layout-as-Thought Ablation: Activation of <think> slightly reduces overall OmniDocBench score (92.64 from 93.12), but specifically boosts table metrics and benefits high-entropy page layouts.
  • Failure Modes: Layout-as-Thought adds unnecessary computational cost and may marginally reduce performance on simple, single-column or homogeneous layouts. At present, layout reasoning is rigid (supervised structure), not adaptive to downstream goals. Its extensions to KIE and QA tasks remain unquantified.

6. Context, Innovations, and Extensions

Qianfan-OCR is the first end-to-end 4B-parameter VLM to match or exceed specialist OCR pipelines on structured parsing tasks, with unified support for layout recovery and downstream QA/analytics within a single neural solution. Its design draws on domain-enhancement strategies developed for the larger Qianfan-VL family (Dong et al., 19 Sep 2025), demonstrating the enduring value of staged curriculum learning and high-precision synthetic data for OCR, as well as architectural scalability (from edge to cloud deployments).

Notable architectural elements and potential improvements:

  • The Layout-as-Thought mechanism is conceptual analog to chain-of-thought prompting in text-based LMs, adapting it to layout grounding. Current implementation is deterministic and fixed-form; future directions include reinforcement learning for task-adaptive or free-form structure generation.
  • Further gain is anticipated from model/data scaling, advanced distillation/pruning for lighter variants, and extension to video, 3D text, or highly stylized and cursive scripts.
  • Applications include automated industrial processing (KYC, invoice/contract review), academic publishing (PDF-to-Markdown/LaTeX), grading or exam analysis, and chart analytics.

Qianfan-OCR’s public availability on Baidu AI Cloud Qianfan platform facilitates wide adoption in research and enterprise. Its demonstrated ability to unify high-performance OCR, layout analysis, and advanced document understanding under a single architecture establishes a new methodological foundation for document AI (Dong et al., 11 Mar 2026, Dong et al., 19 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Qianfan-OCR.