LightOnOCR-bbox-bench: Document Image Localization
- LightOnOCR-bbox-bench is a benchmark for image figure localization that utilizes 2,350 annotated bounding boxes from diverse scientific and archival documents.
- It uses normalized 4-tuple bounding boxes on a fixed 1000×1000 grid to ensure consistent evaluation across varied document layouts and resolutions.
- The evaluation protocol employs metrics like [email protected] and mean IoU, supporting end-to-end vision–language models and enabling reproducible comparisons.
LightOnOCR-bbox-bench is a publicly released benchmark designed for evaluating end-to-end document image localization, with a specific focus on the detection and spatial localization of embedded image figures within complex documents. Developed in the context of the LightOnOCR-2-1B project, it provides a rigorous, large-scale testbed for assessing the capabilities of vision–LLMs to output both sequence-level transcriptions and precise, normalized bounding-box predictions for visual objects on document pages (Taghadouini et al., 20 Jan 2026).
1. Dataset Composition and Annotation Protocol
LightOnOCR-bbox-bench consists of 855 document pages divided into two primary subsets: a 290-page set manually derived from the OlmOCR-Bench and a 565-page arXiv subset composed of scientific article PDFs compiled via nvpdftex. The manual OlmOCR subset provides challenging cases, including scanned documents, multi-column layouts, tables, and legacy as well as contemporary scientific PDFs featuring dense typography and mathematical notation. The arXiv subset predominantly includes modern and legacy scientific articles.
The dataset is primarily in English, with minor representation of French and other Latin-script documents. Each page contains at least one image figure; bounding-box annotations amount to approximately 1,100 manually reviewed boxes (OlmOCR) and about 1,250 automatically generated boxes (arXiv), totaling roughly 2,350 image boxes across the benchmark. Annotations are defined as normalized axis-aligned rectangles, with coordinates scaled into the integer range for both and axes. The annotation format attaches bounding-box coordinates directly to the image figure placeholder, e.g.,
1 |
150,200,620,810 |
which specifies the top-left and bottom-right corners at and on a grid, independent of image pixel dimensions (Taghadouini et al., 20 Jan 2026).
2. Output Specification and Normalization
The output format is strictly normalized: every bounding box is encoded as a 4-tuple , where all values belong to , facilitating model output consistency across documents of varying resolutions and aspect ratios.
Models evaluated on LightOnOCR-bbox-bench must emit canonical image placeholders with appended normalized coordinates, providing direct comparability across architectures and obviating the need for post-processing or geometric normalization at evaluation time.
3. Benchmark Task Definition and Evaluation Protocol
The central task is image localization: given a document page, models must detect all embedded image figures and precisely predict their bounding boxes in normalized coordinates. The protocol stipulates a single-pass, end-to-end pipeline, translating raw pixels directly into tokenized outputs, without recourse to test-time augmentations such as rotations or retries. Two task variants are evaluated independently: performance on the manual OlmOCR subset (290 pages) and on the automated arXiv subset (565 pages).
A true positive is defined as a predicted box with Intersection over Union (IoU) exceeding $0.5$ relative to a ground-truth box, with matching image IDs. This criterion aligns with standard object detection conventions and ensures both spatial alignment and correct instance identification.
4. Quantitative Metrics and Comparative Results
LightOnOCR-bbox-bench employs several complementary metrics:
- Intersection over Union (IoU): Per box,
- Precision: ; Recall: , where = true positives at IoU , = number of predicted boxes, = number of ground truth boxes.
- F@0.5: at IoU threshold
- Mean IoU: Averaged over matched box pairs.
- Count Accuracy: Fraction of pages where the number of predicted boxes exactly matches the ground truth.
A tabulation of the main results for LightOnOCR-2-1B-bbox and the 9B-parameter Chandra-9B baseline follows:
| Model | Subset | F@0.5 | Mean IoU | Count Acc. (%) |
|---|---|---|---|---|
| Chandra-9B | OlmOCR | 0.75 | 0.71 | 75.2 |
| LightOnOCR-2-1B-bbox | OlmOCR | 0.78 | 0.70 | 83.8 |
| LightOnOCR-2-1B-bbox-soup | OlmOCR | 0.76 | 0.67 | 80.7 |
| Chandra-9B | arXiv | 0.81 | 0.77 | 81.8 |
| LightOnOCR-2-1B-bbox | arXiv | 0.83 | 0.77 | 85.0 |
| LightOnOCR-2-1B-bbox-soup | arXiv | 0.82 | 0.76 | 85.1 |
Despite being approximately 9 smaller (1B vs 9B parameters), LightOnOCR-2-1B-bbox matches or outperforms the larger model on both F@0.5 and count accuracy, with comparable mean IoU (Taghadouini et al., 20 Jan 2026).
5. Training Procedure and Model Specialization
LightOnOCR-2-1B-bbox is trained via a staged pipeline:
- Supervised pretraining: Input images are processed at 200 DPI with maximum edge 1540 px. Augmentations include erosion/dilation, affine transforms, grid distortions, and explicit blank-page training. The model is initialized from a text-only OCR checkpoint (LightOnOCR-2-1B-base).
- Resume strategy for box learning: Box-annotated pages are progressively added (from the lightonai/LightOnOCR-bbox-mix-0126 dataset), and the model is trained to emit “…x1,y1,x2,y2” outputs while retaining transcription fidelity.
- Reinforcement learning with verifiable rewards (RLVR): IoU-based rewards optimize localization. The reward per page is
where is the set of matched image IDs. This formulation rewards localization accuracy and penalizes under- and over-detection. RLVR at this stage uses GRPO with AdamW (learning rate $4e$–5, KL coefficient $0.01$), and 14 rollouts per page.
- Checkpoint averaging (“soup”): The last five supervised checkpoints are averaged to yield LightOnOCR-2-1B-bbox-soup.
- Task-arithmetic merging: Interpolating between OCR-only and bbox-specialized checkpoints enables trade-offs between text and box localization accuracy:
with –$0.4$.
6. Position within the Broader OCR Localization Landscape
LightOnOCR-bbox-bench extends the scope and scale of document object localization benchmarks by providing a large multilingual and multi-domain dataset, a normalized output protocol integrated with end-to-end vision–language modeling, and a unified evaluation pipeline. In comparison to specialized benchmarks such as SSIG-SegPlate (designed for license plate character segmentation) (Gonçalves et al., 2016), LightOnOCR-bbox-bench targets the broader task of arbitrary figure/image localization in heterogenous document layouts. Both benchmarks use IoU-based measures, though LightOnOCR-bbox-bench does not incorporate a localization-sensitive penalty such as the Jaccard-Centroid coefficient; instead, it aligns with the dominant object detection evaluation protocols.
A plausible implication is that metrics such as Jaccard-Centroid, which directly couple box centering with downstream recognition accuracy, may augment the sensitivity of LightOnOCR-bbox-bench if extended to fine-grained or rotated object localization scenarios. However, the strict normalization and single-pass generation constraints of LightOnOCR-bbox-bench reflect an emphasis on joint layout/text modeling and document-level inference as required by contemporary end-to-end OCR pipelines (Taghadouini et al., 20 Jan 2026, Gonçalves et al., 2016).
7. Applications and Prospective Developments
By establishing state-of-the-art results for figure/image localization with significantly reduced parameter count and high throughput, LightOnOCR-bbox-bench provides a performance ceiling and a fair, reproducible comparison ground for a new generation of multilingual document understanding models. Its open release under permissive licenses supplements reproducibility and further research, and its structure encourages future exploration of:
- Hybrid metrics that connect spatial precision with document semantics.
- Benchmarking across additional languages, denser image layouts, or rotated figure detection.
- Task interpolation strategies for balancing disparate layout/text objectives in large, multimodal models.
- Transfer learning for downstream tasks such as table extraction or formula localization, using the same output format and normalization scheme.
LightOnOCR-bbox-bench stands as a reference benchmark for holistic document image localization, robust to diverse scientific and archival content, and tailored to the evaluation of unified vision–LLM architectures (Taghadouini et al., 20 Jan 2026).