Typhoon OCR: VLM for Thai & English Documents
- Typhoon OCR is an open-source vision-language model designed for high-fidelity document extraction in Thai and English, addressing challenges posed by non-Latin scripts.
- It leverages a multi-stage Thai-focused data pipeline and unified markup supervision to accurately reconstruct text and layout from complex documents.
- Its architecture integrates transformer backbones with quantization-aware training, achieving competitive OCR performance with reduced computational resources.
Typhoon OCR is an open-source family of vision-LLMs (VLMs) specifically designed for high-fidelity document extraction in Thai and English. The primary goal of Typhoon OCR is to address the substantial gap in support for non-Latin scripts, such as Thai, within the landscape of frontier VLM-based optical character recognition (OCR). Thai presents unique challenges due to script characteristics (complex glyphs, lack of explicit word boundaries) and the prevalence of unstructured, heterogeneous documents. Typhoon OCR leverages careful backbone selection, a Thai-focused multi-stage data engineering pipeline, and unified supervision to provide end-to-end text transcription, layout reconstruction, and document-level structural consistency, all while maintaining a compact computational profile (Nonesung et al., 21 Jan 2026).
1. Model Architecture and Training Paradigms
Typhoon OCR comprises several model variants, fine-tuned from large vision-language transformer backbones. Typhoon OCR V1 utilizes the Qwen2.5-VL architecture in both 3B and 7B parameter configurations, while Typhoon OCR V1.5 is based on the 2B-parameter Qwen3-VL. Each backbone combines a vision encoder (a vision transformer that processes image patch embeddings) with multimodal transformer layers (cross-attention to fuse visual and textual information) followed by an autoregressive transformer decoder for sequence generation.
The input pipeline rasterizes document pages to a fixed width of 1,800 px while preserving aspect ratio; this standardization facilitates both efficient batch processing and prediction stability. The vision encoder output is capable of attending to extremely long text sequences (up to 17,000 tokens for V1, 16,384 for V1.5). Fine-tuning is performed in a fully supervised fashion, updating all model parameters without adapters or LoRA. V1.5 incorporates quantization-aware training, exposing the model to low-precision (int8, int4) arithmetic, which ensures that inference with aggressive quantization preserves accuracy.
The training objective is the standard autoregressive cross-entropy over token sequences:
where is the document image embedding and encodes both natural language text and layout markup (Markdown, HTML tags). No explicit bounding-box regression loss is used; the model learns text-region organization implicitly through mixed markup supervision (Nonesung et al., 21 Jan 2026).
2. Multi-Stage Thai-Focused Data Pipeline
The challenge of low-resource Thai domains is addressed through a multi-stage data engineering pipeline.
Stage 1: Conventional OCR
Text extraction leverages open-source OCR (PaddleOCR, Tesseract) and any available PDF text layers, generating high-precision transcription for relatively clean sources.
Stage 2: VLM-Based Restructuring
OCR output is post-processed by open VLMs (Qwen3-VL, Dots.OCR), prompted to insert detailed layout markup (e.g., tables, lists, headings) and logically group document sections.
Stage 3: Automated Consistency Checks
Rule-based agents validate the transcription for missing or duplicated content, enforce monotonic reading order, and ensure visual/textual alignment.
Stage 4: Human Verification
A portion of the dataset undergoes human review to discard samples with unrecoverable inconsistencies.
The pipeline is further enhanced by systematic synthetic-data augmentation. V1 incorporates CoSyn-400K for structured tabular layouts (8.3% of 77,029 pages), while V1.5 expands with Thai-translated VQA (2.2%), DocLayNet structural layouts (6.4%), and a diverse 37.6% mix of synthetic documents. These synthetic sets employ random Thai vocabularies (rendered in varied fonts using PyThaiNLP), chart images (ChartCap), Southeast Asian scene images (SEA-VL), LaTeX-sourced equations, and Augraphy-based image augmentations (blur, noise, distortion).
Corpus composition for V1 (Structure Mode, 77,029 pages) spans infographics (45.6%), financial reports (7.2%), books (5.6%), handwritten (5.5%), scanned mixed types (6.2%), invoices/bills (8.7%), long-tail forms (13%), and synthetic CoSyn (8.3%). V1.5 unifies the corpus, expanding it to 155,403 pages with 53.7% V1 real documents, 2.2% VQA, 6.4% DocLayNet, and 37.6% synthetic sources (Nonesung et al., 21 Jan 2026).
3. Layout Reconstruction and Document Consistency
Typhoon OCR models are supervised to generate sequences blending Markdown and HTML markup that express both document text and structural regions. The approach relies on implicit text-region segmentation: the model emits explicit block tags (<table>, <figure>), but bounding-box localization is learned via token-level supervision rather than any direct region regression.
During inference, the output markup is post-processed to reconstruct the original layout:
- Tokenize generated markup into text lines with inferred -coordinates.
- Sort lines by vertical center ().
- Merge and into the same structural block if and horizontal overlap exceeds .
Structural consistency with reference layouts is quantified by the intersection-over-union (IoU) between predicted and ground-truth region extents:
Blocks with are flagged as likely misaligned (Nonesung et al., 21 Jan 2026).
4. Model Efficiency and Deployment Considerations
Typhoon OCR V1.5 achieves substantial reductions in model size and resource requirements (2B parameters, vs. up to 7B in V1). Quantization-aware training allows int8 inference, yielding 50–60% reduction in memory and latency relative to fp16, with minimal loss in extraction fidelity.
The V1.5 “image-only” inference mode eliminates reliance on external metadata such as PDF anchors; document extraction requires only a single image input call. Resolution-aware preprocessing retains images already below 1,800 px at native scale, resizing only when necessary to cap input width. This minimizes computational variance across real-world documents and stabilizes model accuracy during training and inference (Nonesung et al., 21 Jan 2026).
5. Quantitative Evaluation and Comparative Results
Performance evaluation encompasses BLEU (n-gram overlap), ROUGE-L (longest common subsequence), and Levenshtein-based character error rate (CER), as defined:
Selected structured-mode results (Thai, V1, Table 1):
| Category | Typhoon OCR 3B (BLEU/ROUGE-L/Lev.) | GPT-4o | Gemini 2.5 Flash |
|---|---|---|---|
| Financial Reports | 0.90 / 0.93 / 0.07 | 0.25/0.51/0.56 | 0.52/0.70/0.35 |
| Government Forms | 0.92 / 0.96 / 0.05 | 0.25/0.45/0.57 | 0.74/0.87/0.15 |
| Books | 0.64 / 0.71 / 0.32 | 0.34/0.49/0.59 | 0.47/0.59/0.47 |
V1.5 (2B) provides higher average scores in BLEU (0.644), ROUGE-L (0.774), and lower Levenshtein (0.251) than Gemini 2.5 Pro (0.605/0.743/0.289), GPT-5 (0.459/0.618/0.390), and V1 7B (0.558/0.686/0.332). Gains are strongest for structured forms and reports; for infographics and handwritten input, the performance gap narrows but V1.5 approaches proprietary state-of-the-art (Nonesung et al., 21 Jan 2026).
Ablation studies indicate that width standardization (1,800 px) directly stabilizes and improves training accuracy compared to variable resolutions. The comparable performance between 3B and 7B variants on government forms suggests that data curation and aligned fine-tuning dominate over raw parameter count for these structured tasks.
6. Error Analysis and Future Research Directions
Typhoon OCR’s residual errors are concentrated on complex figures or severely degraded (e.g., motion-blurred, occluded) document images, indicating these categories as limiting cases for current approaches. This suggests a need for further noise modeling, explicit figure reasoning, or enhanced augmentation in future releases.
The design and empirical results of Typhoon OCR demonstrate that careful adaptation via Thai-centered data engineering and unified sequence supervision enables open, compact VLMs to match or surpass larger, proprietary frontier models, with major implications for resource democratization and reproducible research in multilingual document understanding (Nonesung et al., 21 Jan 2026).