HunyuanOCR: Open-Source Vision-Language OCR

Updated 26 November 2025

HunyuanOCR is a commercial-grade, open-source VLM that integrates core OCR functions and vision-language tasks in a unified, end-to-end architecture.
It combines a native resolution Vision Transformer and a lightweight language model with an adaptive MLP connector to efficiently process and preserve fine-grained text features.
The model employs RL-enhanced, data-driven training and a streamlined inference pipeline, achieving superior benchmark performance and efficient deployment in diverse applications.

HunyuanOCR is a commercial-grade, open-source, and lightweight Vision-LLM (VLM) dedicated to Optical Character Recognition (OCR) and associated vision-language tasks. At approximately 1 billion parameters, it offers a unified, end-to-end paradigm that integrates core OCR functions—spotting, parsing, information extraction (IE), visual question answering (VQA), and text-image translation—within an efficient architecture. This approach eliminates traditional pipeline dependencies and error propagation, advancing both research and industrial deployment (Team et al., 24 Nov 2025).

1. Model Composition and Architectural Design

HunyuanOCR combines a Native Resolution Vision Transformer backbone (Hunyuan-ViT, 0.4B parameters) and a lightweight LLM (Hunyuan-0.5B, 0.5B parameters), linked via an adaptive MLP connector.

Native Vision Transformer Backbone: The encoder is built upon SigLIP-v2-400M, supporting arbitrary image resolutions. Images of height $H$ and width $W$ are partitioned into $L$ patches, adaptively preserving aspect ratio. Each patch $x_i \in \mathbb{R}^{p \times p \times C}$ is linearly embedded:

$h^{(0)} = [x_1 W_p; \ldots; x_L W_p] \in \mathbb{R}^{L \times d}$

Multi-head self-attention and MLP blocks operate across $N$ layers. At each layer $\ell$ , for $H = h^{(\ell-1)}$ :

$Q = H W^Q,\quad K = H W^K,\quad V = H W^V$

$\text{Attention}(H) = \text{softmax}(QK^\top / \sqrt{d}) V$

Native patch attention avoids resizing, preserving fine-grained text features vital for handling long lines and geometrically distorted scans.

Adaptive MLP Connector: This bridge performs learnable pooling, reducing the spatial sequence $L$ to $L'$ , suitable for LLM input. For encoder output $h \in \mathbb{R}^{L \times d}$ , the adapter applies two-layer MLP per token:

$h' = W_2\, \sigma(W_1 h)$

where $W_1 \in \mathbb{R}^{d \times r}$ , $W_2 \in \mathbb{R}^{r \times d}$ , $\sigma$ is GeLU, with $r \ll d$ .

Lightweight LLM: Hunyuan-0.5B utilizes XD-RoPE, decomposing positional embeddings into four subspaces (text, height, width, time), facilitating native alignment for 1D sequences, 2D layouts, and spatiotemporal context.

2. Data Strategy and Training Procedure

The model is trained with data-driven and reinforcement learning (RL) techniques, emphasizing high-quality, diverse samples and structured instruction.

OCR Training Corpus: Pre-training leverages $\sim$ $\sim$ 200M image-text pairs across nine domains (documents, street views, receipts, screenshots, video frames, etc.) and over 130 languages. Sources encompass:
- Web-crawled images with manual/auto annotations
- Synthetic element-level data via an extended SynthDog framework (supporting controllable fonts, bidirectional text, warping, noise, etc.)
- Cross-task annotation reuse
Four-Stage Supervised Pre-Training:

| Stage | Description | Token Count | |-------|-------------------|--------------------| | 1 | Vision-language alignment, ViT+adapter only | 50B | | 2 | Joint multimodal learning (all params) | 300B| | 3 | Context extension (up to 32k tokens) | 80B | | 4 | Application-oriented SFT | 24B |

Reinforcement Learning: The Group Relative Policy Optimization (GRPO) framework is introduced with task-specific rewards:

$L_{GRPO}(\theta) = \mathbb{E}_{q \sim D,\, \{o_i\} \sim \pi_{\theta_{\text{old}}}} \left[ \frac{1}{G} \sum_{i=1}^G \left[ \min( r_i A_i, \text{clip}(r_i, 1-\epsilon, 1+\epsilon) A_i ) \right] - \beta D_{KL}(\pi_\theta \,\|\, \pi_{\text{ref}}) \right]$

where $r_i = \pi_\theta(o_i|q)/\pi_{\theta_{\text{old}}}(o_i|q)$ and $A_i$ is the group-aggregated advantage. Distinct reward structures are applied for spotting, parsing, VQA, and translation.

Data Augmentation and Curriculum: Geometric warping replicates folds, perspective distortion, blur, noise, and lighting variation. Question–Answer pairs are auto-generated and verified via cross-model consistency and manual curation. The curriculum transitions from basic alignment to instruction-based SFT.

3. End-to-End Unified Pipeline

HunyuanOCR operates on a streamlined, single-pass inference paradigm driven by natural language prompts. This approach entirely removes pre-processing modules (layout analysis, segmentation) and post-processing logic. Task instructions direct the model to perform text spotting, parsing, IE, VQA, or translation in arbitrary mixtures.

For example: “Detect and recognize text…output <ref>…</ref><quad>(x1,y1),(x2,y2)</quad>” yields bounding boxes and recognized text in one step.

This unified paradigm mitigates error propagation, lowers engineering complexity, and seamlessly supports instruction chaining for multi-step or hybrid tasks. A plausible implication is a reduction in deployment friction and simplified integration in multi-modal workflows.

4. Benchmark Results and Empirical Performance

HunyuanOCR demonstrates superior performance over a range of commercial solutions, traditional pipelines, and larger VLMs.

Task	Competitor (≤3B params)	HunyuanOCR
ICDAR 2025 DIMT (Small Model) Track	1st Place w/ complex layout translation	1st Place
OCRBench COMET Score	Qwen3-VL-235B: 920; Gemini-2.5-Pro: 858	860
Text Spotting (900 imgs)	PaddleOCR: 53.38%; BaiduOCR API: 61.9%; Qwen3-VL-235B: 53.62%; Seed-1.6-Vision: 59.23%	70.92%
Cards IE Accuracy	Qwen3-VL-235B: 75.59%	92.29%
Receipts IE Accuracy	Qwen3-VL-235B: 78.40%	92.53%
Video Subtitles IE	Qwen3-VL-235B: 50.74%	92.87%
DoTA Translation (en→zh)	Qwen3-VL-4B: 78.45	83.48
DocML Translation (other→zh)	Qwen3-VL-4B: 70.29	73.62
DocML Translation (other→en)	Qwen3-VL-4B: 70.38	73.38

Standard precision, recall, and F₁ definitions apply: $\text{Precision} = \text{TP}/(\text{TP}+\text{FP})$ , $\text{Recall} = \text{TP}/(\text{TP}+\text{FN})$ , $\text{F}_1 = 2 \cdot \text{Precision} \cdot \text{Recall}/(\text{Precision} + \text{Recall})$ .

5. Deployment Efficiency and Scalability

The model is delivered as a vLLM-based inference server, achieving top-tier production throughput (tokens/sec) and sub-100 ms per image latency on a single GPU. Its 1B parameter footprint fits within 8–12 GB GPU memory, permitting edge and on-device deployments. In comparison, general VLMs exceeding 2B parameters typically require more than 24 GB and incur higher latency.

This efficiency renders HunyuanOCR suitable for diverse practical OCR services while minimizing infrastructure cost. A plausible implication is expanded applicability in memory-constrained, latency-sensitive environments.

6. Industrial and Research Impact

HunyuanOCR addresses both research and production needs across a wide spectrum of text-centric vision-language tasks:

Automated document parsing (invoices, forms, legal texts)
End-to-end IE in receipts and cards
VQA for manuals and scientific charts
Real-time translation of multilingual signage, posters, and academic papers

The open-source release on HuggingFace and GitHub facilitates extension for new languages, domains, and pipeline integrations (e.g., Retrieval-Augmented Generation). This suggests acceleration of research progress and broader industrial adoption.

In summary, HunyuanOCR exemplifies a compact, RL-enhanced, end-to-end VLM with demonstrated superiority over larger and more resource-intensive systems in OCR, parsing, IE, and text-image translation, offering efficient and robust deployment for real-world scenarios (Team et al., 24 Nov 2025).

Markdown Upgrade to Chat

References (1)

HunyuanOCR Technical Report (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HunyuanOCR.