Papers
Topics
Authors
Recent
2000 character limit reached

HunyuanOCR: Open-Source Vision-Language OCR

Updated 26 November 2025
  • HunyuanOCR is a commercial-grade, open-source VLM that integrates core OCR functions and vision-language tasks in a unified, end-to-end architecture.
  • It combines a native resolution Vision Transformer and a lightweight language model with an adaptive MLP connector to efficiently process and preserve fine-grained text features.
  • The model employs RL-enhanced, data-driven training and a streamlined inference pipeline, achieving superior benchmark performance and efficient deployment in diverse applications.

HunyuanOCR is a commercial-grade, open-source, and lightweight Vision-LLM (VLM) dedicated to Optical Character Recognition (OCR) and associated vision-language tasks. At approximately 1 billion parameters, it offers a unified, end-to-end paradigm that integrates core OCR functions—spotting, parsing, information extraction (IE), visual question answering (VQA), and text-image translation—within an efficient architecture. This approach eliminates traditional pipeline dependencies and error propagation, advancing both research and industrial deployment (Team et al., 24 Nov 2025).

1. Model Composition and Architectural Design

HunyuanOCR combines a Native Resolution Vision Transformer backbone (Hunyuan-ViT, 0.4B parameters) and a lightweight LLM (Hunyuan-0.5B, 0.5B parameters), linked via an adaptive MLP connector.

  • Native Vision Transformer Backbone: The encoder is built upon SigLIP-v2-400M, supporting arbitrary image resolutions. Images of height HH and width WW are partitioned into LL patches, adaptively preserving aspect ratio. Each patch xiRp×p×Cx_i \in \mathbb{R}^{p \times p \times C} is linearly embedded:

h(0)=[x1Wp;;xLWp]RL×dh^{(0)} = [x_1 W_p; \ldots; x_L W_p] \in \mathbb{R}^{L \times d}

Multi-head self-attention and MLP blocks operate across NN layers. At each layer \ell, for H=h(1)H = h^{(\ell-1)}:

Q=HWQ,K=HWK,V=HWVQ = H W^Q,\quad K = H W^K,\quad V = H W^V

Attention(H)=softmax(QK/d)V\text{Attention}(H) = \text{softmax}(QK^\top / \sqrt{d}) V

Native patch attention avoids resizing, preserving fine-grained text features vital for handling long lines and geometrically distorted scans.

  • Adaptive MLP Connector: This bridge performs learnable pooling, reducing the spatial sequence LL to LL', suitable for LLM input. For encoder output hRL×dh \in \mathbb{R}^{L \times d}, the adapter applies two-layer MLP per token:

h=W2σ(W1h)h' = W_2\, \sigma(W_1 h)

where W1Rd×rW_1 \in \mathbb{R}^{d \times r}, W2Rr×dW_2 \in \mathbb{R}^{r \times d}, σ\sigma is GeLU, with rdr \ll d.

  • Lightweight LLM: Hunyuan-0.5B utilizes XD-RoPE, decomposing positional embeddings into four subspaces (text, height, width, time), facilitating native alignment for 1D sequences, 2D layouts, and spatiotemporal context.

2. Data Strategy and Training Procedure

The model is trained with data-driven and reinforcement learning (RL) techniques, emphasizing high-quality, diverse samples and structured instruction.

  • OCR Training Corpus: Pre-training leverages \sim200M image-text pairs across nine domains (documents, street views, receipts, screenshots, video frames, etc.) and over 130 languages. Sources encompass:
    • Web-crawled images with manual/auto annotations
    • Synthetic element-level data via an extended SynthDog framework (supporting controllable fonts, bidirectional text, warping, noise, etc.)
    • Cross-task annotation reuse
  • Four-Stage Supervised Pre-Training:

| Stage | Description | Token Count | |-------|-------------------|--------------------| | 1 | Vision-language alignment, ViT+adapter only | 50B | | 2 | Joint multimodal learning (all params) | 300B| | 3 | Context extension (up to 32k tokens) | 80B | | 4 | Application-oriented SFT | 24B |

LGRPO(θ)=EqD,{oi}πθold[1Gi=1G[min(riAi,clip(ri,1ϵ,1+ϵ)Ai)]βDKL(πθπref)]L_{GRPO}(\theta) = \mathbb{E}_{q \sim D,\, \{o_i\} \sim \pi_{\theta_{\text{old}}}} \left[ \frac{1}{G} \sum_{i=1}^G \left[ \min( r_i A_i, \text{clip}(r_i, 1-\epsilon, 1+\epsilon) A_i ) \right] - \beta D_{KL}(\pi_\theta \,\|\, \pi_{\text{ref}}) \right]

where ri=πθ(oiq)/πθold(oiq)r_i = \pi_\theta(o_i|q)/\pi_{\theta_{\text{old}}}(o_i|q) and AiA_i is the group-aggregated advantage. Distinct reward structures are applied for spotting, parsing, VQA, and translation.

  • Data Augmentation and Curriculum: Geometric warping replicates folds, perspective distortion, blur, noise, and lighting variation. Question–Answer pairs are auto-generated and verified via cross-model consistency and manual curation. The curriculum transitions from basic alignment to instruction-based SFT.

3. End-to-End Unified Pipeline

HunyuanOCR operates on a streamlined, single-pass inference paradigm driven by natural language prompts. This approach entirely removes pre-processing modules (layout analysis, segmentation) and post-processing logic. Task instructions direct the model to perform text spotting, parsing, IE, VQA, or translation in arbitrary mixtures.

  • For example: “Detect and recognize text…output <ref>…</ref><quad>(x1,y1),(x2,y2)</quad>” yields bounding boxes and recognized text in one step.

This unified paradigm mitigates error propagation, lowers engineering complexity, and seamlessly supports instruction chaining for multi-step or hybrid tasks. A plausible implication is a reduction in deployment friction and simplified integration in multi-modal workflows.

4. Benchmark Results and Empirical Performance

HunyuanOCR demonstrates superior performance over a range of commercial solutions, traditional pipelines, and larger VLMs.

Task Competitor (≤3B params) HunyuanOCR
ICDAR 2025 DIMT (Small Model) Track 1st Place w/ complex layout translation 1st Place
OCRBench COMET Score Qwen3-VL-235B: 920; Gemini-2.5-Pro: 858 860
Text Spotting (900 imgs) PaddleOCR: 53.38%; BaiduOCR API: 61.9%; Qwen3-VL-235B: 53.62%; Seed-1.6-Vision: 59.23% 70.92%
Cards IE Accuracy Qwen3-VL-235B: 75.59% 92.29%
Receipts IE Accuracy Qwen3-VL-235B: 78.40% 92.53%
Video Subtitles IE Qwen3-VL-235B: 50.74% 92.87%
DoTA Translation (en→zh) Qwen3-VL-4B: 78.45 83.48
DocML Translation (other→zh) Qwen3-VL-4B: 70.29 73.62
DocML Translation (other→en) Qwen3-VL-4B: 70.38 73.38

Standard precision, recall, and F₁ definitions apply: Precision=TP/(TP+FP)\text{Precision} = \text{TP}/(\text{TP}+\text{FP}), Recall=TP/(TP+FN)\text{Recall} = \text{TP}/(\text{TP}+\text{FN}), F1=2PrecisionRecall/(Precision+Recall)\text{F}_1 = 2 \cdot \text{Precision} \cdot \text{Recall}/(\text{Precision} + \text{Recall}).

5. Deployment Efficiency and Scalability

The model is delivered as a vLLM-based inference server, achieving top-tier production throughput (tokens/sec) and sub-100 ms per image latency on a single GPU. Its 1B parameter footprint fits within 8–12 GB GPU memory, permitting edge and on-device deployments. In comparison, general VLMs exceeding 2B parameters typically require more than 24 GB and incur higher latency.

This efficiency renders HunyuanOCR suitable for diverse practical OCR services while minimizing infrastructure cost. A plausible implication is expanded applicability in memory-constrained, latency-sensitive environments.

6. Industrial and Research Impact

HunyuanOCR addresses both research and production needs across a wide spectrum of text-centric vision-language tasks:

  • Automated document parsing (invoices, forms, legal texts)
  • End-to-end IE in receipts and cards
  • VQA for manuals and scientific charts
  • Real-time translation of multilingual signage, posters, and academic papers

The open-source release on HuggingFace and GitHub facilitates extension for new languages, domains, and pipeline integrations (e.g., Retrieval-Augmented Generation). This suggests acceleration of research progress and broader industrial adoption.

In summary, HunyuanOCR exemplifies a compact, RL-enhanced, end-to-end VLM with demonstrated superiority over larger and more resource-intensive systems in OCR, parsing, IE, and text-image translation, offering efficient and robust deployment for real-world scenarios (Team et al., 24 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to HunyuanOCR.