LightOnOCR-2-1B: 1B-Param Multilingual OCR

Updated 21 January 2026

The paper presents a 1B-parameter end-to-end model that directly transduces page images into clean, ordered text sequences with integrated image placeholders, bypassing traditional OCR pipelines.
It leverages a native-resolution ViT encoder, a 2-layer MLP projector, and a Qwen3-based multilingual decoder to achieve robust accuracy on OCR and image localization benchmarks.
Training on a 43-million-page dataset with RL fine-tuning and task-arithmetic merging enhances both robustness and inference speed, setting new performance standards.

LightOnOCR-2-1B is a 1-billion-parameter end-to-end multilingual vision–LLM for high-throughput, robust OCR and document image understanding. Unlike traditional OCR pipelines that disentangle text extraction and layout analysis, LightOnOCR-2-1B directly transduces page images (such as PDFs) into clean, naturally ordered text sequences interleaved with structured placeholders for images and, in specialized variants, spatial localization coordinates. It achieves state-of-the-art accuracy and speed on standard OCR and image localization benchmarks, while being significantly smaller than previous leading models. Model checkpoints, datasets, and evaluations are released under permissive licenses to enable broad adoption and reproducibility (Taghadouini et al., 20 Jan 2026).

1. Model Structure and Parameterization

The LightOnOCR-2-1B architecture integrates three primary subsystems:

Vision Encoder: Native-resolution Vision Transformer (ViT) initialized from Mistral-Small-3.1. Accepts page images up to 1540 px on the longest edge; uses $16 \times 16$ px patches, ensuring spatial structure is preserved. The ViT backbone consists of approximately 24 layers, each with 1024 hidden dimensions and 16 attention heads.
Multimodal Projector: A 2-layer MLP with GELU activations maps ViT features into the embedding space of the LLM decoder. A factor-2 spatial merge operation groups $2 \times 2$ patches, reducing visual token count by 4 $\times$ for tractability.
LLM Decoder: A Qwen3-based, pretrained multilingual decoder (handling 151k tokens) emits a linearized, Markdown-like sequence. The sequence interleaves text and image placeholders (for non-text regions), with image-break/end tokens removed. Decoder layers match ViT in count (24) and size (1024 hidden, 16 heads).

Model initialization leverages pretrained weights (Mistral-Small-3.1 for vision, Qwen3 for language) for strong low-level and multilingual capabilities. The overall parameter count remains at 1 billion.

2. Training Data Composition and Distillation

Training proceeds on a 43-million-page mixture, about 2.5 $\times$ larger than the v1 dataset and derived primarily by distillation from a robust teacher model (Qwen3-VL-235B-A22B-Instruct). The mixture emphasizes:

Scanned and noisy documents: Increased robustness through exposure to rotated, degraded, and noisy images.
Multilingual corpora: Strong representation for European languages, particularly French (using Wikipedia, open corpora, and PDFA scans).
Scientific PDFs: Includes distillation from rendered arXiv pages and the nvpdftex pipeline, which compiles TeX sources to create PNG page images with aligned text and precise region bounding boxes (capturing headers, tables, figures, formulas).
Additional annotations: Cropped regions annotated with GPT-4o to simulate varied layouts; blank-page examples (3% of mix) to prevent hallucinated loops; and public OCR datasets (OlmOCR v0.3, PDFA) for further diversity.

Preprocessing and normalization encompass strict Markdown/LaTeX sanitization, standardization of special cases, loop and repetition filtering, LaTeX-to-Markdown/HTML conversion with KaTeX validation, and pruning lengthy or incomplete samples. Relevant datasets for OCR and bounding boxes are released as lightonai/LightOnOCR-mix-0126 and lightonai/LightOnOCR-bbox-mix-0126, respectively.

3. Pretraining and Optimization Strategies

3.1 Localization-aware Pretraining (Resume Strategy)

To incorporate figure localization, a second phase resumes pretraining from the supervised base checkpoint, increasing exposure to bbox-annotated pages drawn from nvpdftex and teacher model traces. This stage instructs the model to emit normalized (relative to [0,1000]) bounding-box coordinates in tandem with each image placeholder.

3.2 Reinforcement Learning from Vision–Language Rewards (RLVR)

After supervised runs, LightOnOCR-2-1B undergoes RL fine-tuning via GRPO, yielding two primary checkpoints:

OCR-focused RLVR: Rewards include OlmOCR unit test performance, penalties for repetitive outputs, KaTeX/Markdown consistency, and presence of structural document markers (headers/footers).
BBox-focused RLVR: Directly optimizes mean IoU and ID overlap between predicted and ground-truth bounding boxes, computed as

$R_{\mathrm{bbox}} = \underbrace{\frac{1}{|I_\cap|}\sum_{i\in I_\cap}\mathrm{IoU}(\mathcal{B}^{\mathrm{pred}}_i, \mathcal{B}^{\mathrm{gt}}_i)}_{\text{mean IoU over matched IDs}} \;\times\; \underbrace{\frac{|I_\cap|}{\max(|I_{\mathrm{gt}}|,|I_{\mathrm{pred}}|)}}_{\text{ID-overlap factor}}$

where $\mathrm{IoU}(B_p, B_{gt}) = \frac{\mathrm{area}(B_p \cap B_{gt})}{\mathrm{area}(B_p \cup B_{gt})}$ .

Both RLVR phases use KL-regularized GRPO (β=0.01), group-scaled token rewards, and AdamW optimizer; 28 rollouts per prompt (OCR), 14 (bbox), implemented in HuggingFace TRL and vLLM.

4. Output Sequencing and Structured Prediction

Model outputs form a single Markdown-like stream, interleaving:

Text tokens (including both plain text and LaTeX formula spans);
Image placeholders: ![image](image_N.png);
Optional normalized bounding-box coordinates, formatted as ![image](image_N.png)x1,y1,x2,y2 where $x_1, y_1, x_2, y_2 \in [0,1000]$ .

Example excerpt:

1	... introduction ... ![image](image_3.png)x123,456,789,900 ... conclusion.

Training applies loss masking to coordinates when unavailable. Inference always generates coordinates with image placeholders in bbox-enabled variants.

5. Robustness, Model Ensembling, and Task Merging

5.1 Checkpoint Averaging

The final base model results from "souper" averaging over the last 5 DDP checkpoints. This method empirically increases stability and mean performance across seeds.

5.2 Task-Arithmetic Merging

A weight-space interpolation technique enables explicit toggling between optimized OCR accuracy and image localization capability:

$\theta_{\mathrm{merge}} = \theta_{\mathrm{bbox}} + \alpha\,(\theta_{\mathrm{ocr}} - \theta_{\mathrm{bbox}}),\quad \alpha\in[0,1]$

For $\alpha \in [0.1, 0.4]$ , the merged model retains near state-of-the-art localization while recovering most OCR accuracy (e.g., $\alpha = 0.1$ yields 80.9% OCR, IoU $\approx 0.677$ ). This approach accommodates diverse downstream application needs.

Task Merging Recipe

Step	Operation
1	$\theta_\text{base} \leftarrow$ average_last5_checkpoints()
2	$\theta_\text{ocr} \leftarrow$ RLVR_OCR( $\theta_\text{base}$ )
3	$\theta_\text{bbox} \leftarrow$ resume_pretrain_bbox( $\theta_\text{base}$ ); $\theta_\text{bbox} \leftarrow$ RLVR_bbox( $\theta_\text{bbox}$ )
4	$\theta_\text{ocr\_soup} \leftarrow \theta_\text{base} + 0.4 \cdot (\theta_\text{ocr} - \theta_\text{base})$
5	$\theta_\text{bbox\_soup} \leftarrow \theta_\text{bbox} + 0.1 \cdot (\theta_\text{ocr\_soup} - \theta_\text{bbox})$

6. Empirical Results and Comparative Evaluation

6.1 OCR Benchmarking (OlmOCR-Bench)

On OlmOCR-Bench (with headers/footers excluded), LightOnOCR-2-1B achieves a new state-of-the-art (SOTA) single-pass score:

Model	Size	Overall
Chandra-9B	9B	81.7%
olmOCR-2-8B	8B	80.4%
LightOnOCR-2-1B	1B	83.2%

Performance gains are particularly marked for arXiv pages, older scanned mathematics collections, and tables. RLVR yields a +1.4 pp improvement above the base checkpoint, and task-arithmetic merges restore OCR quality in bbox-enhanced models.

6.2 Image Localization (LightOnOCR-bbox-bench)

Comparison on the image localization benchmark:

Subset	Metric	Chandra-9B	LightOnOCR-2-1B-bbox
OlmOCR(290)	F₁@0.5	0.75	0.78
	Mean IoU	0.71	0.70
	Count Acc.	75.2%	83.8%
arXiv(565)	F₁@0.5	0.81	0.83
	Mean IoU	0.77	0.77
	Count Acc.	81.8%	85.0%

LightOnOCR-2-1B-bbox matches or outperforms the 9B baseline on F₁ and count accuracy, despite being 9 $\times$ smaller.

6.3 Inference Efficiency

Throughput evaluated on NVIDIA H100 80 GB:

Model	Params	Dtype	Throughput (pages/sec)
LightOnOCR-2-1B	1B	BF16	5.71
olmOCR-2-8B	8B	FP8	3.28
Chandra-9B	9B	BF16	1.70

LightOnOCR-2-1B infers at $\sim$ 1.7 $\times$ the speed of the 8B model and $\sim$ 3.3 $\times$ faster than the 9B baseline.

7. Availability and Licensing

All LightOnOCR-2-1B model checkpoints and derivatives are distributed under the Apache 2.0 license at https://huggingface.co/collections/lightonai/lightonocr-2. The training mixtures and bbox-supervised datasets follow their upstream licensing terms:

lightonai/LightOnOCR-mix-0126 (PDFA-derived)
lightonai/LightOnOCR-bbox-mix-0126 (nvpdftex-derived)
lightonai/LightOnOCR-bbox-bench (code and data under Apache 2.0)

Supplementary evaluation scripts and a blog post are made available for broader scientific use and benchmarking.

Markdown Upgrade to Chat

References (1)

LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LightOnOCR-2-1B.