HunyuanOCR: Open-Source Vision-Language OCR
- HunyuanOCR is a commercial-grade, open-source VLM that integrates core OCR functions and vision-language tasks in a unified, end-to-end architecture.
- It combines a native resolution Vision Transformer and a lightweight language model with an adaptive MLP connector to efficiently process and preserve fine-grained text features.
- The model employs RL-enhanced, data-driven training and a streamlined inference pipeline, achieving superior benchmark performance and efficient deployment in diverse applications.
HunyuanOCR is a commercial-grade, open-source, and lightweight Vision-LLM (VLM) dedicated to Optical Character Recognition (OCR) and associated vision-language tasks. At approximately 1 billion parameters, it offers a unified, end-to-end paradigm that integrates core OCR functions—spotting, parsing, information extraction (IE), visual question answering (VQA), and text-image translation—within an efficient architecture. This approach eliminates traditional pipeline dependencies and error propagation, advancing both research and industrial deployment (Team et al., 24 Nov 2025).
1. Model Composition and Architectural Design
HunyuanOCR combines a Native Resolution Vision Transformer backbone (Hunyuan-ViT, 0.4B parameters) and a lightweight LLM (Hunyuan-0.5B, 0.5B parameters), linked via an adaptive MLP connector.
- Native Vision Transformer Backbone: The encoder is built upon SigLIP-v2-400M, supporting arbitrary image resolutions. Images of height and width are partitioned into patches, adaptively preserving aspect ratio. Each patch is linearly embedded:
Multi-head self-attention and MLP blocks operate across layers. At each layer , for :
Native patch attention avoids resizing, preserving fine-grained text features vital for handling long lines and geometrically distorted scans.
- Adaptive MLP Connector: This bridge performs learnable pooling, reducing the spatial sequence to , suitable for LLM input. For encoder output , the adapter applies two-layer MLP per token:
where , , is GeLU, with .
- Lightweight LLM: Hunyuan-0.5B utilizes XD-RoPE, decomposing positional embeddings into four subspaces (text, height, width, time), facilitating native alignment for 1D sequences, 2D layouts, and spatiotemporal context.
2. Data Strategy and Training Procedure
The model is trained with data-driven and reinforcement learning (RL) techniques, emphasizing high-quality, diverse samples and structured instruction.
- OCR Training Corpus: Pre-training leverages 200M image-text pairs across nine domains (documents, street views, receipts, screenshots, video frames, etc.) and over 130 languages. Sources encompass:
- Web-crawled images with manual/auto annotations
- Synthetic element-level data via an extended SynthDog framework (supporting controllable fonts, bidirectional text, warping, noise, etc.)
- Cross-task annotation reuse
- Four-Stage Supervised Pre-Training:
| Stage | Description | Token Count | |-------|-------------------|--------------------| | 1 | Vision-language alignment, ViT+adapter only | 50B | | 2 | Joint multimodal learning (all params) | 300B| | 3 | Context extension (up to 32k tokens) | 80B | | 4 | Application-oriented SFT | 24B |
- Reinforcement Learning: The Group Relative Policy Optimization (GRPO) framework is introduced with task-specific rewards:
where and is the group-aggregated advantage. Distinct reward structures are applied for spotting, parsing, VQA, and translation.
- Data Augmentation and Curriculum: Geometric warping replicates folds, perspective distortion, blur, noise, and lighting variation. Question–Answer pairs are auto-generated and verified via cross-model consistency and manual curation. The curriculum transitions from basic alignment to instruction-based SFT.
3. End-to-End Unified Pipeline
HunyuanOCR operates on a streamlined, single-pass inference paradigm driven by natural language prompts. This approach entirely removes pre-processing modules (layout analysis, segmentation) and post-processing logic. Task instructions direct the model to perform text spotting, parsing, IE, VQA, or translation in arbitrary mixtures.
- For example: “Detect and recognize text…output <ref>…</ref><quad>(x1,y1),(x2,y2)</quad>” yields bounding boxes and recognized text in one step.
This unified paradigm mitigates error propagation, lowers engineering complexity, and seamlessly supports instruction chaining for multi-step or hybrid tasks. A plausible implication is a reduction in deployment friction and simplified integration in multi-modal workflows.
4. Benchmark Results and Empirical Performance
HunyuanOCR demonstrates superior performance over a range of commercial solutions, traditional pipelines, and larger VLMs.
| Task | Competitor (≤3B params) | HunyuanOCR |
|---|---|---|
| ICDAR 2025 DIMT (Small Model) Track | 1st Place w/ complex layout translation | 1st Place |
| OCRBench COMET Score | Qwen3-VL-235B: 920; Gemini-2.5-Pro: 858 | 860 |
| Text Spotting (900 imgs) | PaddleOCR: 53.38%; BaiduOCR API: 61.9%; Qwen3-VL-235B: 53.62%; Seed-1.6-Vision: 59.23% | 70.92% |
| Cards IE Accuracy | Qwen3-VL-235B: 75.59% | 92.29% |
| Receipts IE Accuracy | Qwen3-VL-235B: 78.40% | 92.53% |
| Video Subtitles IE | Qwen3-VL-235B: 50.74% | 92.87% |
| DoTA Translation (en→zh) | Qwen3-VL-4B: 78.45 | 83.48 |
| DocML Translation (other→zh) | Qwen3-VL-4B: 70.29 | 73.62 |
| DocML Translation (other→en) | Qwen3-VL-4B: 70.38 | 73.38 |
Standard precision, recall, and F₁ definitions apply: , , .
5. Deployment Efficiency and Scalability
The model is delivered as a vLLM-based inference server, achieving top-tier production throughput (tokens/sec) and sub-100 ms per image latency on a single GPU. Its 1B parameter footprint fits within 8–12 GB GPU memory, permitting edge and on-device deployments. In comparison, general VLMs exceeding 2B parameters typically require more than 24 GB and incur higher latency.
This efficiency renders HunyuanOCR suitable for diverse practical OCR services while minimizing infrastructure cost. A plausible implication is expanded applicability in memory-constrained, latency-sensitive environments.
6. Industrial and Research Impact
HunyuanOCR addresses both research and production needs across a wide spectrum of text-centric vision-language tasks:
- Automated document parsing (invoices, forms, legal texts)
- End-to-end IE in receipts and cards
- VQA for manuals and scientific charts
- Real-time translation of multilingual signage, posters, and academic papers
The open-source release on HuggingFace and GitHub facilitates extension for new languages, domains, and pipeline integrations (e.g., Retrieval-Augmented Generation). This suggests acceleration of research progress and broader industrial adoption.
In summary, HunyuanOCR exemplifies a compact, RL-enhanced, end-to-end VLM with demonstrated superiority over larger and more resource-intensive systems in OCR, parsing, IE, and text-image translation, offering efficient and robust deployment for real-world scenarios (Team et al., 24 Nov 2025).