HunyuanOCR Technical Report (2511.19575v1)

Published 24 Nov 2025 in cs.CV and cs.AI

Abstract: This paper presents HunyuanOCR, a commercial-grade, open-source, and lightweight (1B parameters) Vision-LLM (VLM) dedicated to OCR tasks. The architecture comprises a Native Vision Transformer (ViT) and a lightweight LLM connected via an MLP adapter. HunyuanOCR demonstrates superior performance, outperforming commercial APIs, traditional pipelines, and larger models (e.g., Qwen3-VL-4B). Specifically, it surpasses current public solutions in perception tasks (Text Spotting, Parsing) and excels in semantic tasks (IE, Text Image Translation), securing first place in the ICDAR 2025 DIMT Challenge (Small Model Track). Furthermore, it achieves state-of-the-art (SOTA) results on OCRBench among VLMs with fewer than 3B parameters. HunyuanOCR achieves breakthroughs in three key aspects: 1) Unifying Versatility and Efficiency: We implement comprehensive support for core capabilities including spotting, parsing, IE, VQA, and translation within a lightweight framework. This addresses the limitations of narrow "OCR expert models" and inefficient "General VLMs". 2) Streamlined End-to-End Architecture: Adopting a pure end-to-end paradigm eliminates dependencies on pre-processing modules (e.g., layout analysis). This fundamentally resolves error propagation common in traditional pipelines and simplifies system deployment. 3) Data-Driven and RL Strategies: We confirm the critical role of high-quality data and, for the first time in the industry, demonstrate that Reinforcement Learning (RL) strategies yield significant performance gains in OCR tasks. HunyuanOCR is officially open-sourced on HuggingFace. We also provide a high-performance deployment solution based on vLLM, placing its production efficiency in the top tier. We hope this model will advance frontier research and provide a solid foundation for industrial applications.

Summary

The paper presents a unified OCR system that consolidates text spotting, document parsing, information extraction, VQA, and text image translation within a single efficient architecture.
Experimental results demonstrate state-of-the-art performance, including a 70.92% accuracy in text spotting and robust parsing across complex layouts and over 130 languages.
The novel architecture and reinforcement learning training pipeline enable efficient, on-device deployment while reducing computational costs and error propagation.

HunyuanOCR: A Unified Lightweight Vision-LLM for Efficient and Robust OCR

Motivation and Objectives

HunyuanOCR (2511.19575) introduces a commercial-grade, open-source, production-ready OCR system centered on a 1B-parameter vision-LLM (VLM). The design goals are twofold: (1) to unify previously fragmented OCR tasks—spanning text spotting, document parsing, information extraction (IE), visual question answering (VQA), and text image translation—within a single compact and efficient architecture, and (2) to address long-standing limitations of pipeline-based and large-scale general VLM systems, notably error propagation, inefficiency, and high deployment/computational cost.

The model’s scope includes robust performance across complex layouts, long-context documents, and multi-language scenarios (covering over 130 languages), and delivers both state-of-the-art (SOTA) task scores and practical deployment efficiency.

Figure 1: Comparative performance of HunyuanOCR and various SOTA models across core OCR tasks.

System Architecture

HunyuanOCR is based on a modular three-component design:

Native-Resolution Visual Encoder: A 0.4B-parameter SigLIP-v2-400M backbone, optimized to preserve aspect ratios and capture fine-grained detail via global attention in a ViT manner. This ensures scalability for arbitrary resolutions and robustness to extreme aspect ratios, which is essential for long texts and degraded documents.
Adaptive MLP Connector: A learnable pooling adapter performing spatially adaptive, task-aware feature compression and selection. This bridges high-resolution visual features and the LLM, reducing redundant information without sacrificing critical local semantics.
Lightweight LLM: A 0.5B Hunyuan LLM with XD-RoPE (decomposed into text, height, width, time subspaces) providing native alignment between 1D textual, 2D layout, and 3D spatiotemporal modalities. This supports structured reasoning for complex layouts and multi-page visual text understanding.

The architecture enables true end-to-end training and inference, eliminating layout analysis and other pre-/post-processing dependencies typical for OCR-targeted VLMs and pipelines.

Figure 2: HunyuanOCR’s end-to-end framework integrating native-resolution visual encoding, adaptive feature connection, and lightweight language modeling for a broad OCR task suite.

Data Pipeline and Task Unification

A pivotal aspect of HunyuanOCR is a large-scale, multimodal, high-curation dataset (>200M samples) comprising real-world, web-mined, and synthetic data, with explicit augmentation for typographic variance, layout complexity, language diversity, and realistic document distortions.

Figure 3: Schematic of HunyuanOCR’s data production chain, including multilingual and structural document synthesis, realistic image warping, and cross-task label re-use.

All core OCR tasks are cast into a unified instruction-driven setting, allowing flexible control and natural adaptation. Supported task formats include:

Text Spotting: Structured prompts yield both normalized bounding box localization and text content extraction.
Document Parsing: Native support for formulas (LaTeX), tables (HTML/Markdown), charts/diagrams (Markdown/Mermaid), and arbitrary text extraction in reading order.
Information Extraction: Domain-adaptive extraction spans 30+ granular document types and supports both targeted (single-field) and parallel (multi-field/JSON) modes.
Video Subtitle Extraction: Generalizes to diverse subtitle placements and styles across video frames and aspect ratios.
VQA: Open-ended, composed reasoning across documents, charts, figures, and natural scene inscriptions.
Text Image Translation: Robust end-to-end translation across 14+ source languages in document and scene settings.

Pretraining and Reinforcement Learning

The training pipeline leverages a staged process:

Supervised Pretraining: Four sequential phases—vision-language alignment (encoder adapter freeze), multimodal joint training (full model), long-context extension (up to 32k tokens for whole-document reasoning), and application-oriented SFT (with standardized instructions for all tasks).
Online Reinforcement Learning: Task-specific reward models, combining rule-based (e.g., edit distance, IoU) and LLM-as-a-judge strategies (for translation and VQA). The RL algorithm is a Group Relative Policy Optimization (GRPO), using group-based advantage estimation and KL regularization for stable refinement. Output format and structural fidelity are strictly verified to maximize downstream utility and reward signal relevance.
Figure 4: Training curve for RL, showing steady improvement in mean reward and increasing share of optimal outputs as RL progresses.

Evaluation and Results

Text Spotting

HunyuanOCR consistently outperforms all compared commercial APIs, open-source pipelines, and even large general-purpose VLMs, with a notable margin—e.g., it achieves 70.92% overall on the internal multi-domain spotting benchmark, with leading results across all categories and especially large gains on artistic and dense document images.

Figure 5: Highly robust text spotting on stylized/complex fonts.

Figure 6: Precise detection and transcription in dense document layouts.

Figure 7: Demonstrated robustness to document layout complexity and occlusions.

Document Parsing

On OmniDocBench and its Wild variant, HunyuanOCR achieves new SOTA results at 94.1/85.2 overall (digital/real-capture), outperforming much larger general VLMs (Gemini-2.5-Pro, Qwen3-VL-235B) and specialized systems (PaddleOCR-VL, MonkeyOCR, dots.ocr), especially on formula and table parsing. DocML evaluation confirms strong cross-lingual transfer and robustness.

Figure 8: Structured extraction from figures within documents.

Figure 9: Structurally faithful table parsing performance.

Figure 10: Consistency and accuracy in multi-format tabular context parsing.

Figure 11: Strong performance in real-world (photographed, distorted, multi-lingual) Wild-OmniDocBench conditions.

Figure 12: Effective parsing in a broad spectrum of real-world visual and document scenarios.

Figure 13: Robust reading for multilingual scenarios, including low-resource and complex typographical settings.

Information Extraction, VQA, and Subtitle Extraction

HunyuanOCR delivers SOTA on all IE categories (e.g., 92.3% card extraction, 92.5% receipt parsing) and robustly handles fine-grained, complex fields in semi-structured, low-SNR visual scenes. Video subtitle extraction demonstrates high recall/precision across unconstrained frames and layouts.

Figure 14: Parallel and domain-adaptive information extraction from receipts.

Figure 15: Faithful, language-aware video subtitle extraction in noisy real-world content.

Figure 16: Accurate document and chart VQA, involving multi-table and figure-based reasoning.

Text Image Translation

On DoTA, DocML, and a broad document translation suite, HunyuanOCR approaches or surpasses performance from models with 4–8x more parameters. For English-to-Chinese, it attains a COMET score of 83.48, only marginally below models as large as Gemini-2.5-Flash (85.60), despite its compact design.

Figure 17: Semantic and structural preservation in translation for multi-language books, technical content, and forms.

Strong/Contradictory Claims

Unified End-to-End Efficiency: HunyuanOCR demonstrates that a 1B-parameter VLM can match or outperform 8B+ general VLMs and specialized OCR pipelines for all critical document intelligence tasks. This contradicts prior assumptions that large parameter count or strongly modularized system design are necessary for robust performance in complex layouts or real-world scenes.
RL Significance in OCR: The use of RL for lightweight (<3B) OCR-targeted VLMs, especially with fine-grained, verifiable reward formulations, produces substantial real-world generalization improvements. This decisively closes the performance gap for open-ended and highly structured understanding tasks, especially compared to purely supervised regimes.

Practical and Theoretical Implications

From a practical perspective, HunyuanOCR ensures feasible on-device deployment and production-friendly inference cost, supporting a wide range of applications (e.g., industrial automation, archival digitization, RAG for LLMs, document translation in edge/enterprise scenarios). Its unified, instruction-driven interface simplifies integration and eliminates engineering overhead from legacy pipelined approaches.

Theoretically, HunyuanOCR substantiates the viability of efficient, end-to-end VLMs for covering the entire document intelligence spectrum, providing empirical evidence for the effectiveness of data-centric scaling, multimodal context expansion, and RL-based task adaptation in OCR. The model’s ability to unify visual, structural, and textual reasoning into a converged representation furthers prospects for cross-modal cognitive modeling.

Another notable implication is the demonstrated transferability of vision-language alignment and reinforcement learning beyond open-ended reasoning toward highly structured, schema-constrained document inference—a validation of RLVR/LLM-judge reward design in applied production settings.

Future Directions

Extensions may include:

Further compression and quantization for ultra-low-resource/edge deployment.
Context window expansion for cross-document and multi-page reasoning.
Modular extension for hybrid multimodal pipelines (e.g., layered integration with advanced MT systems for improved translation).
Research on task balancing and transfer under constrained supervision, especially for under-represented scripts and layouts.
Investigation into curriculum learning and hierarchical reward design to improve controllability in open-domain scenarios.

Conclusion

HunyuanOCR establishes a new paradigm for OCR research and deployment: a compact, efficient, yet high-accuracy unified model, validated across key document intelligence benchmarks and diverse real-world application settings. The integration of data-centric supervised training, robust synthetic augmentation, and tailored reinforcement learning manifests in demonstrable advantages for both robustness and usability. The publication, code, data, and deployment toolkit are made broadly accessible, lowering the barrier for industrial and academic adoption and further research.

PDF Markdown

Whiteboard

Generate a whiteboard explanation of this paper.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Glossary

off on

Practical Applications

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper introduces HunyuanOCR, a small but powerful computer system that can read and understand text in images. This includes things like scanned documents, photos of signs, charts, tables, receipts, and even subtitles from videos. It’s “open-source” (free for developers to use), “lightweight” (about 1 billion parameters, which is small for modern AI), and fast enough to run on devices with limited computing power.

What questions did the researchers try to answer?

In simple terms, they wanted to know:

Can we build one compact AI model that does many OCR jobs well (like finding text, understanding document layouts, answering questions about images, and translating) without needing a complicated chain of separate tools?
Can an “end-to-end” design (one model that goes straight from image to answer) reduce errors and make deployment easier than traditional multi-step pipelines?
Does feeding the model lots of high-quality, carefully designed training data—and using reinforcement learning (a kind of “practice with feedback”)—significantly boost performance for OCR tasks?

How did they build and train the system?

The model’s parts

Think of HunyuanOCR like a small team where each member has a role:

Vision Transformer (ViT): This is the “eyes.” It looks at the image by splitting it into small patches (like turning a picture into puzzle pieces) and understands what’s in it. It preserves the image’s original shape, so long pages and wide receipts don’t get squashed or distorted.
MLP adapter: This is the “translator” between vision and language. It compresses and organizes what the eyes saw so the language part can understand it efficiently.
Lightweight LLM: This is the “brain.” It reads what the eyes saw and produces answers in text, understands page layouts, follows instructions, and reasons about content. A special technique called XD-RoPE helps the brain keep track of text order, page width/height, and even time for video frames—so it understands complex documents and multi-column pages better.

What the model can do

HunyuanOCR is designed to handle many common OCR tasks in one shot:

Text spotting: Find text in an image and report both the text and where it is.
Document parsing: Turn a page image into clean, structured content—plain text, plus HTML for tables and LaTeX for formulas—following the natural reading order.
Information extraction (IE): Pull out specific fields (like “Total Amount” on a receipt or a name on an ID card) and output them as structured data (like JSON).
Text-centric VQA: Answer questions about the text in an image (for example, “What is the total price?” or “How many items are listed?”).
Text image translation: Read text in one language from an image and translate it into Chinese or English; supports more than a dozen source languages and handles complex layouts.

The data they used

They built a huge, high-quality training set of about 200 million image–text pairs. It includes:

Real-world images from nine scenarios (documents, street signs, ads, screenshots, cards/invoices, handwritten notes, game interfaces, video frames, artistic typography), covering more than 130 languages.
Synthetic images they generated to create tricky cases (like long pages, rotated text, blurry scans, shadows, or unusual fonts) so the model learns to handle tough situations.
Automatically created question–answer pairs for training the VQA and parsing tasks, with checks to ensure quality and correctness.

Training steps

They trained the model in four stages:

Align vision and language: Teach the “eyes” and “brain” to communicate.
Full joint learning: Train all parts together on many tasks (spotting, parsing, translation, VQA).
Long-context support: Extend how much text the model can handle at once (up to 32,000 tokens), helpful for long documents.
Application-oriented fine-tuning: Use carefully curated, real-world data and standard instruction formats to make the model’s outputs consistent and reliable.

Reinforcement learning (RL): practice with feedback

After pretraining, they used RL to further improve results:

For tasks with exact answers (like spotting text and parsing), they rewarded the model when it matched the correct output format and content.
For open-ended tasks (like translation or some questions), a judging model scored the outputs, and HunyuanOCR learned to get higher scores over time.
They used a method called GRPO, which samples multiple answers, compares them, and updates the model in the direction of better performance, while keeping outputs within valid formats and length limits.

What did they find?

Strong performance with a small model: HunyuanOCR (about 1B parameters) outperforms many larger models (like some 4B+ models) and even some commercial OCR services.
Top results in benchmarks and competitions: It ranked first in the ICDAR 2025 DIMT Challenge (Small Model Track) and achieved state-of-the-art results on OCRBench among models under 3B parameters.
End-to-end design reduces errors: Because it doesn’t rely on separate pre-processing steps (like layout detection), mistakes don’t pile up, and the system is easier to deploy and maintain.
Versatility without heavy hardware: It handles spotting, parsing, IE, VQA, and translation in one model, with low latency and lower cost, making it suitable for devices with limited resources.
Reinforcement learning helps OCR: They show, for the first time at this scale in industry, that RL can noticeably improve accuracy and stability for OCR-specific tasks.

Why does it matter?

Easier, cheaper OCR solutions: Companies and apps can run advanced OCR on smaller hardware, reducing costs and making on-device use more practical.
Better handling of real-world complexity: The model deals with messy layouts, long documents, mixed languages, and low-quality images, which are common outside controlled environments.
One model, many tasks: This simplifies development—no need for complicated pipelines or multiple specialized models for each step.
Helps other AI systems: Cleanly parsed and translated text from images feeds into search, learning tools, assistants, and LLM training, unlocking content from books, archives, forms, and more.
Open-source foundation: Researchers and developers can build on HunyuanOCR, improving OCR tech for education, healthcare, offices, and everyday apps.

In short, HunyuanOCR shows that a small, end-to-end vision–LLM, trained on high-quality data and refined with smart feedback, can read and understand text in images across many tasks—fast, accurately, and at low cost. This makes advanced OCR more accessible and useful in the real world.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a focused list of what remains missing, uncertain, or unexplored in the paper, written to guide future research and replication.

Reproducibility of RL post-training: the GRPO objective in Eq. (1) is malformed, and key hyperparameters (group size G, sampling temperature/top‑p, clipping ε, KL coefficient β, reference policy details, rollout length, reward scaling/normalization, early stopping criteria) are not specified.
Reward design validity: no evidence that LLM-as-a-judge scores for translation and VQA correlate with human judgments (e.g., Pearson/Spearman vs BLEU/COMET/chrF, STS-style human ratings), nor how biases across languages/scripts are mitigated.
Spotting reward coverage: the RL reward combines IoU and edit distance but is limited to axis-aligned boxes; how are rotated, curved, or polygonal text regions handled, and what is the impact on curved/vertical text and artistic typography?
Schema adherence under RL: the paper penalizes schema violations with zero reward, but does not quantify how often outputs violate the required formats, nor strategies to reduce format errors (e.g., constrained decoding, structured decoders).
Ablation studies: no controlled ablations isolating the contributions of (a) synthetic vs real data, (b) long-context training, (c) adapter compression ratio, (d) RL vs SFT alone, (e) visual encoder choice (SigLIP-v2 vs alternatives), and (f) XD‑RoPE vs standard RoPE.
Per-language performance: claims of support for “hundreds of languages” and coverage of >130 languages lack per-language metrics on public benchmarks (e.g., ICDAR-MLT, LSVT, ArT, ReCTS, CTW, ScriptNet datasets) and stratified results for low-resource scripts (Arabic, Indic, Khmer, Thai, Amharic, etc.).
Reading order fidelity: no quantitative evaluation of reading-order accuracy in multi-column, RTL/LTR mixed, vertical CJK layouts, or documents with floating elements and sidebars.
Formula-to-LaTeX accuracy: the paper does not report renderability rates, semantic equivalence (e.g., via CAS-based normalization), or robustness to noisy scans for math/chemical formula parsing.
Table-to-HTML fidelity: lacks evaluation of structural correctness (headers, spanning cells, nested tables), semantic accuracy (cell content), and rendering equivalence across varied table types (rotated, skewed, dense).
Chart parsing reliability: no metrics or datasets reported for chart-to-Mermaid/Markdown conversion (type classification accuracy, data series extraction correctness, axis/legend parsing).
Document translation quality: translation into CN/EN is claimed, but no standardized metrics (BLEU, COMET, spBLEU), human evaluation protocols, or coverage of domain‑specific jargon and mixed-language code‑switching are provided.
General scene-text translation: robustness across lighting, blur, occlusions, and stylized fonts remains unevaluated with standard datasets (e.g., ST‑TexT, ICDAR 2019 ReCTS‑Translation).
VQA evaluation rigor: the paper does not report ANLS or task-specific metrics/datasets (DocVQA, InfographicVQA, ChartQA, TextCaps) nor controls for leakage when generating QA pairs from spotting/parsing outputs.
Multi-page and cross-page reasoning: XD‑RoPE claims alignment in time/space, but there is no evaluation on multi-page documents or temporal reasoning in video subtitle extraction beyond single-frame tests.
Video OCR temporal modeling: subtitle extraction is described for screenshots; it remains open whether sequence-level modeling (temporal smoothing, subtitle line continuity, timing alignment) is implemented and evaluated.
Long-context inference feasibility: although trained to 32k context, there are no latency, memory, and throughput measurements for long inputs on typical edge/server hardware, nor guidance on context management (chunking, pagination).
Native-resolution ViT scalability: global attention over native-resolution patches is quadratic in sequence length; the paper does not specify patch limits, maximum input resolutions, token counts before/after the MLP connector, or the accuracy–efficiency trade-off from adaptive pooling.
On-device deployment specifics: claims of on-device suitability are not backed by concrete resource profiles (VRAM/RAM, CPU/GPU/NPUs), quantization support (INT8/FP8), energy consumption, or real-time performance under common mobile SoCs.
vLLM serving performance: no reproducible benchmarks (tokens/sec, concurrency, latency percentiles) across batch sizes and sequence lengths; missing comparison against alternatives (TensorRT‑LLM, TGI, OVMS).
Benchmarks and baselines comparability: the custom 900‑image spotting benchmark lacks public release, annotation protocol, and detailed baseline prompting/decoding settings, risking unfair comparisons to commercial APIs and general VLMs.
Error analysis and failure modes: no qualitative/quantitative breakdown of typical errors (reading order mistakes, mislocalized boxes, hallucinated tables/formulas, mistranslations), nor mitigation strategies.
Data sources, licensing, and privacy: web-crawled and synthetic datasets (200M pairs) are not itemized with licenses, deduplication practices, PII handling, or contamination checks against evaluation/competitions (e.g., ICDAR DIMT).
Data quality controls: automated QA generation and multi-model verification are described, but label noise rates, human QA sample sizes, inter-annotator agreement, and the final acceptance criteria are not reported.
Domain generalization in IE: structured extraction is claimed across ~30 document types, but there is no zero‑shot/low‑shot evaluation on unseen document templates, forms with nested fields, or handwritten entries.
Confidence and calibration: the model does not output calibrated confidence scores for recognition/IE/translation, limiting downstream use in production pipelines and human-in-the-loop settings.
Safety and robustness: no tests for adversarial prompts, harmful content handling, watermark/noise attacks, or resilience to instruction perturbations and layout adversaries.
Interoperability of output schemas: custom tags (<ref>, <quad>) may hinder integration; there is no mapping/compatibility with common OCR/IE standards (ALTO XML, hOCR, COCO‑Text, PDFMiner structured outputs).
Training compute footprint: the paper omits details on training hardware, total GPU hours, energy cost, and scaling behavior—critical for reproducibility and environmental impact assessment.
Model licensing and release completeness: while the model is “open-sourced,” specifics of license terms, available weights (pre‑RL vs post‑RL), training/evaluation code, and data access are not clarified.
Ethical and fairness considerations: no analysis of demographic/language/script bias, accessibility for visually impaired use cases, or downstream societal impacts of errors in critical domains (healthcare, finance, legal).

View Paper Prompt View All Prompts

Glossary

Adaptive MLP Connector: A learnable pooling module that compresses visual features and projects them into the LLM’s input space. "Adaptive MLP Connector acts as a bridge between the visual and linguistic domains, implementing a core learnable pooling operation."
Annealing training: A staged fine-tuning schedule with gradually reduced learning rates to stabilize learning. "We conduct annealing training using carefully curated, human-annotated real-world data"
Application-oriented SFT: Supervised fine-tuning targeted at specific downstream applications and standardized outputs. "long-context pre-training, and application-oriented SFT."
Bidirectional text layouts (LTR/RTL): Text rendering directions supporting both left-to-right and right-to-left scripts. "bidirectional text layouts (LTR/RTL)"
Bounding box: A rectangular region used to localize objects or text in images via coordinates. "specifies the bounding box of the text region"
Consistency Verification: A cross-checking step to validate generated QA pairs before inclusion in training. "Consistency Verification and Data Refinement."
Context window: The maximum sequence length (in tokens) the model can attend to in a single pass. "we extend the model's context window to 32K"
End-to-end paradigm: A design where the system learns to map inputs to outputs directly without intermediate modules. "Adopting a pure end-to-end paradigm eliminates dependencies on pre-processing modules (e.g., layout analysis)."
Error propagation: The accumulation and amplification of mistakes across stages in a pipeline. "This fundamentally resolves error propagation common in traditional pipelines"
Group Relative Policy Optimization (GRPO): A reinforcement learning algorithm that updates policies using group-based relative advantages. "We adopt the Group Relative Policy Optimization (GRPO) algorithm as our main reinforcement learning framework."
Hard Sample Retrieval: Automated mining of challenging instances to improve model robustness. "Hard Sample Retrieval."
Hidden Markov Models (HMMs): Probabilistic models for sequence data often used historically in OCR and speech. "Hidden Markov Models (HMMs)"
Information extraction (IE): Structured extraction of key fields and entities from documents or images. "information extraction"
Intersection over Union (IoU): A metric for overlap between predicted and ground-truth regions used for matching boxes. "Intersection over Union (IoU)"
KL-divergence: A measure of difference between probability distributions used for regularization in RL. "the KL-divergence term for regularization."
LaTeX: A typesetting system used to represent mathematical formulas. "represent it using LaTeX format."
Layout analysis: Detecting and structuring elements (text, tables, figures) in a document image. "pre-processing modules (e.g., layout analysis)"
LLM-as-a-judge: Using a LLM to score or evaluate generated outputs for reward signals. "LLM-as-a-judge approach."
Long-context pre-training: Training to handle very long sequences for tasks like long-document parsing. "Long-context Pre-training"
Mermaid: A text-based diagramming syntax for flowcharts and similar charts. "use Mermaid format for flowcharts"
MLP adapter: A multilayer perceptron bridge that aligns vision features to the LLM space. "via an MLP adapter."
Multimodal LLM (MLLM): Models that process and reason over multiple modalities, e.g., vision and text. "multimodal LLMs (MLLMs)."
Native Resolution Visual Encoder: A vision encoder that preserves input image aspect ratio and resolution via adaptive patching. "Native Resolution Visual Encoder (Hunyuan-ViT)"
Normalized edit distance: A string similarity measure scaled to [0,1] for evaluating text accuracy. "normalized edit distance"
OCRBench: A benchmark suite for evaluating OCR-related capabilities of VLMs. "state-of-the-art (SOTA) results on OCRBench"
OmniDocBench: A document parsing benchmark used to compare model performance. "on the OmniDocBench for document parsing."
Pipeline effect: Degradation arising from multi-stage cascades where early errors harm later stages. "pipeline effect."
Reinforcement Learning (RL): Optimization via reward signals to align outputs with goals or preferences. "Reinforcement Learning (RL)"
Reinforcement Learning with Verifiable Rewards (RLVR): An RL setup where rewards are computed by deterministic, checkable metrics. "we adopt Reinforcement Learning with Verifiable Rewards (RLVR)"
Retrieval-augmented generation (RAG): Systems that incorporate retrieved documents into the generation process. "retrieval-augmented generation (RAG) systems."
Right-to-left (RTL): A script direction used by languages like Arabic and Hebrew. "right-to-left (RTL) reading order."
RoPE (Rotary Positional Embedding): A method for encoding positional information in transformer attention. "conventional RoPE"
SigLIP-v2-400M: A pre-trained vision encoder model variant used as the backbone. "SigLIP-v2-400M pre-trained model"
vLLM: A high-throughput inference engine for LLM serving and deployment. "deployment solution based on vLLM"
Vision Transformer (ViT): A transformer-based architecture for image encoding using patch embeddings. "Vision Transformer (ViT)"
Vision-LLM (VLM): A model jointly trained to understand and generate across vision and text modalities. "Vision-LLM (VLM)"
Visual question answering (VQA): Answering natural-language questions about images that contain text. "visual question answering (VQA)"
Warping Synthesis Pipeline: An augmentation pipeline simulating geometric and photometric distortions. "Warping Synthesis Pipeline"
XD-RoPE: An extension of RoPE that allocates positional dimensions across text, height, width, and time. "It incorporates XD-RoPE"

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

Below are applications that can be deployed now, leveraging the open-source HunyuanOCR model and its vLLM-based high-performance deployment.

OCR-as-a-Service replacement for pipelines (software, enterprise IT)
- Use HunyuanOCR’s end-to-end model to consolidate text spotting, parsing, information extraction (IE), VQA, and translation into a single inference step, replacing multi-model cascades (e.g., PaddleOCR/EasyOCR stacks).
- Tools/products/workflows: A unified OCR API, on-prem/edge microservice, SDKs for Python/Node/Java; plug-ins for document platforms (SharePoint/Google Drive/Notion).
- Assumptions/dependencies: vLLM or similar inference stack; prompt templates; GPU/CPU resources sized to throughput needs; licensing review for commercial use.
Document ingestion for RAG pipelines (software, knowledge management, legal)
- Parse PDFs/images into structured Markdown/HTML/LaTeX with reading order, extract figure positions, and produce JSON for tables; index outputs in a vector database for retrieval.
- Tools/products/workflows: “Image-to-RAG” ingestion pipeline; connectors for Elastic/OpenSearch/Pinecone; automated chunking aligned to 32k context.
- Assumptions/dependencies: Reliable parsing prompts; downstream embedding/indexing; handling of long documents via chunking.
Accounts payable and KYC automation (finance, insurance, telecom)
- Extract fields from invoices/receipts/cards/certificates into machine-readable JSON; validate entries in ERP/CRM systems.
- Tools/products/workflows: IE templates for common documents; batch ingest; reconciliation and validation scripts; dashboard for human-in-the-loop corrections.
- Assumptions/dependencies: Domain-specific key lists; schema validation; PII handling (secure storage, access controls).
Healthcare record digitization and translation (healthcare)
- Convert scanned forms and reports into structured outputs; translate patient-facing materials (e.g., consent forms) between English/Chinese and supported source languages.
- Tools/products/workflows: Hospitals’ document digitization pipelines; patient portal translation; mapping to HL7/FHIR fields.
- Assumptions/dependencies: HIPAA/GDPR compliance; domain lexicon for medical terms; sometimes specialized post-validation for critical fields.
Multilingual image-to-text translation for localization (media, e-commerce, education)
- Translate posters, signage, product packaging, screenshots, and document pages into English/Chinese while preserving structure (HTML for tables, LaTeX for equations).
- Tools/products/workflows: “Visual localization” tools for websites/apps; semi-automated translation review with human translators; batch translation for catalog images.
- Assumptions/dependencies: Language coverage (currently >14 source languages for translation; recognition supports many more); LLM-as-a-judge bias awareness.
Video subtitle extraction (media, creator tools)
- Extract subtitles from frames/screenshots in various aspect ratios; integrate with downstream ASR or subtitle styling tools.
- Tools/products/workflows: Subtitle pre-processors in VOD platforms; bulk frame extraction; timeline alignment with video editors.
- Assumptions/dependencies: Accurate frame sampling; layout-aware text extraction; potential noise in stylized/hard overlays.
Chart, table, and formula conversion (education, publishing, developer tools)
- Convert tables to HTML, formulas to LaTeX, and charts to Mermaid/Markdown for editing, republishing, and paper aids.
- Tools/products/workflows: “Image-to-editable-doc” utilities; conversion plug-ins for LaTeX/Word; e-learning content production.
- Assumptions/dependencies: Prompt templates for element types; human QA for complicated charts; symbol sets for STEM domains.
On-device OCR for privacy-preserving apps (mobile, accessibility)
- Deploy the 1B model to mobile/edge devices for offline OCR and translation (e.g., scanning bills, menus, classroom notes).
- Tools/products/workflows: Mobile SDKs; quantized models; caching for frequent prompts; accessibility apps for reading signage/documents aloud.
- Assumptions/dependencies: Hardware acceleration (GPU/NPU) or CPU optimizations; power/performance tuning; model quantization.
Enterprise UI text extraction and testing (software QA)
- Extract UI text from screenshots to verify localization completeness, style guides, and detect truncation/overlaps.
- Tools/products/workflows: CI/CD checks for UI screenshots; diffing against expected strings; bug tracker integration.
- Assumptions/dependencies: Screenshot capture consistency; mapping to design specs; tolerance for minor OCR errors.
Corpus creation from scanned books and archives (academia, libraries, public policy)
- Parse long documents into high-quality corpora for LLM training or digital archives, with consistent structure and reading order.
- Tools/products/workflows: Batch digitization pipeline; metadata enrichment; OCR+IE into canonical formats; cross-lingual archiving.
- Assumptions/dependencies: OCR accuracy on low-quality scans; archive-specific schemas; intellectual property clearance.

Long-Term Applications

The following applications require further research, scaling, integration, domain adaptation, or validation before widespread deployment.

Self-improving OCR via RL in-the-loop (software, research)
- Continually fine-tune domain-specific OCR using RL with verifiable rewards and LLM-as-a-judge, improving accuracy for niche formats (e.g., pathology reports, shipping manifests).
- Tools/products/workflows: RL pipelines with GRPO; domain reward models; automated hard-sample mining.
- Assumptions/dependencies: Robust reward functions; avoiding judge model bias; cost of RL runs; reliable ground truth acquisition.
Full document translation with layout preservation and editability (publishing, localization)
- Generate fully editable documents (e.g., DOCX/LaTeX) with precise layout replication and cross-language typographic conventions.
- Tools/products/workflows: End-to-end “image-to-docx” with semantic blocks; style transfer; post-edit toolkits.
- Assumptions/dependencies: Advanced layout reconstruction beyond current HTML/Markdown/LaTeX; comprehensive language coverage; typographic rules.
Real-time AR translation and guidance (robotics, consumer devices, accessibility)
- Provide real-time translation and reading aid via AR glasses or mobile AR, including signage, packaging, and dynamic displays.
- Tools/products/workflows: Low-latency edge inference; multi-frame stabilization; AR UX; speech output; on-device caches.
- Assumptions/dependencies: Hardware acceleration; low-latency pipelines; battery constraints; robust tracking in motion and low light.
Industrial robotics reading and compliance (manufacturing, logistics, energy)
- Robots read labels, safety notices, and shipping instructions in warehouses/factories; verify compliance documents attached to equipment.
- Tools/products/workflows: Perception modules that fuse OCR with 3D sensing; task planning based on read content; audit trails.
- Assumptions/dependencies: Integration with perception stacks; domain-specific vocabularies; extreme environment robustness.
National-scale digitization of public records (policy, archives)
- Standardized end-to-end digitization and translation of court records, statutes, permits, and historical documents across languages.
- Tools/products/workflows: Government-certified OCR appliances; auditability/traceability; public API access with rate limits; privacy-by-design.
- Assumptions/dependencies: Procurement and certification; PII redaction; accessibility standards; language policy and data sovereignty.
Scientific figure-to-data extraction (academia, pharma)
- Convert charts/plots in PDFs into machine-readable data tables and metadata for meta-analysis and synthesis.
- Tools/products/workflows: Figure parsing combined with semantic understanding; unit normalization; dataset curation workflows.
- Assumptions/dependencies: Accurate chart type detection and data extraction beyond current Mermaid/Markdown descriptions; domain ontologies.
Cross-lingual compliance monitoring (finance, healthcare, regulated industries)
- Monitor multilingual documents and signage for regulatory adherence (e.g., required disclosures, labeling).
- Tools/products/workflows: OCR+NLP compliance checks; alerting; audit logs; evidence capture.
- Assumptions/dependencies: Policy-specific rule sets; high recall on critical fields; legal validation.
Low-resource language expansion and evaluation (education, global NGOs)
- Extend translation and recognition to more low-resource languages; build open evaluation suites for fair performance assessment.
- Tools/products/workflows: Community data collection; synthetic data expansion; unbiased benchmarks; multilingual lexicons.
- Assumptions/dependencies: Data availability and quality; ethical data sourcing; model fairness and bias mitigation.
Privacy-preserving OCR with formal guarantees (healthcare, legal, consumer)
- On-device OCR with differential privacy or secure enclaves; provable guarantees about PII non-exfiltration.
- Tools/products/workflows: DP training/inference; secure hardware integration; policy-compliant analytics pipelines.
- Assumptions/dependencies: Mature DP techniques for VLMs; performance trade-offs; certification costs.
End-to-end multimedia understanding (media tech)
- Combine OCR with ASR, vision, and reasoning to answer complex queries about video content (e.g., “Summarize text-heavy scenes and translate on-screen text”).
- Tools/products/workflows: Multimodal pipelines with time-aware modeling (leveraging XD-RoPE’s time subspace); content indexing/search.
- Assumptions/dependencies: Stable multi-frame OCR; synchronized metadata; scalable storage/indexing.
Autonomous document agents (enterprise software)
- Agents that read, parse, extract, validate, and route documents across departments with minimal human intervention.
- Tools/products/workflows: Workflow orchestration; integration with ERP/CRM; exception handling; SLA monitoring.
- Assumptions/dependencies: High-quality IE on diverse templates; robust validation rules; governance and human oversight.
Standardization and benchmarking of RL for OCR (academia, standards bodies)
- Establish open benchmarks and reward designs for verifiable OCR tasks; shared protocols for LLM-as-a-judge evaluation.
- Tools/products/workflows: Public leaderboards; common reward schemas; reproducible RL pipelines.
- Assumptions/dependencies: Community consensus; reproducibility infrastructure; sustainability of evaluation compute.

These applications leverage HunyuanOCR’s key innovations: an end-to-end architecture that simplifies deployment and reduces error propagation, a native-resolution visual encoder with adaptive token pooling, a long-context lightweight LLM with spatial-temporal alignment (XD-RoPE), a high-quality multilingual data pipeline, and demonstrated RL-based post-training gains for OCR-specific tasks.

HunyuanOCR Technical Report (2511.19575v1)

Sponsor

Summary

HunyuanOCR: A Unified Lightweight Vision-LLM for Efficient and Robust OCR

Motivation and Objectives

System Architecture

Data Pipeline and Task Unification

Pretraining and Reinforcement Learning

Evaluation and Results

Text Spotting

Document Parsing

Information Extraction, VQA, and Subtitle Extraction

Text Image Translation

Strong/Contradictory Claims

Practical and Theoretical Implications

Future Directions

Conclusion

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions did the researchers try to answer?

How did they build and train the system?

The model’s parts

What the model can do

The data they used

Training steps

Reinforcement learning (RL): practice with feedback

What did they find?

Why does it matter?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Glossary

Practical Applications

Immediate Applications

Long-Term Applications

Open Problems

Continue Learning

Related Papers

Authors (26)

Collections

Tweets

YouTube