OCRBench v2: Multimodal OCR Benchmark
- OCRBench v2 is a benchmark for text-centric evaluation of multimodal models, covering 23 subtasks across 31 diverse real-world scenarios.
- It expands task diversity 4× over its predecessor by including tasks such as handwritten extraction, structured element parsing, and logical reasoning in bilingual formats.
- The benchmark employs meticulous evaluation protocols like IoU, TEDS, and BLEU to diagnose performance gaps and guide improvements in document understanding.
OCRBench v2 is a large-scale, text-centric benchmark targeting the comprehensive evaluation of large multimodal models (LMMs) on visual text localization and reasoning tasks. Designed to address the shortcomings of earlier OCR benchmarks, it expands both task diversity and scenario coverage, thereby revealing persisting challenges in LMM capabilities for practical document understanding and scene text reasoning (Fu et al., 2024).
1. Motivation and Objectives
OCRBench v2 was motivated by the limitations observed in previous OCR-centric evaluation suites, which emphasized text recognition on clean or standardized inputs while omitting fine-grained localization, complex parsing, and reasoning tasks. Existing benchmarks had largely saturated in performance metrics, failing to diagnose critical weaknesses in areas such as handwritten extraction, structured element parsing, and logical reasoning. OCRBench v2 targets these concerns by:
- Expanding the number of evaluated tasks 4× compared to its predecessor, OCRBench v1, increasing from 5 to 23 subtasks.
- Addressing challenging capabilities such as text localization, extraction from handwriting, reasoning with visual elements, and structured data parsing.
- Ensuring scenario diversity (31 domains) to reflect real-world applications including scientific documents, outdoor scenes, forms, charts, and UI screenshots.
- Introducing bilingual evaluation (English and Chinese), with task parity across a core subset.
The primary objective is to enable robust, real-world assessment of LMMs for OCR and higher-order document reasoning, bridging the gap between laboratory setups and production requirements (Fu et al., 2024).
2. Dataset Composition and Scenario Coverage
OCRBench v2 contains over 9,500 images sourced from 31 distinct scenarios and paired with 10,000 human-verified instruction–response pairs. Data was curated from 81 public academic datasets supplemented by private sources, engineered to ensure broad coverage of:
- Document types: receipts, scientific papers, schematics, reports, resumes, emails, ASCII art.
- Capture conditions: indoor/outdoor scenes, various lighting, rotated or artistic fonts, occlusions, and dot-matrix printing.
- Difficulty: Balanced inclusion of high-difficulty samples, identified by manual screening and low OCR-engine performance, ensures that fine-grained perception and complex layouts are represented throughout all subtasks.
- Multilinguality: A comprehensive bilingual subset, with eight subtasks in both English and Chinese, supports rigorous multilingual evaluation (Fu et al., 2024).
Annotation protocols involved converting ground truths into LMM-compatible prompts with structured JSON, HTML, or Markdown output specifications, and all instruction–QA pairs underwent multiple rounds of human verification.
3. Task Taxonomy and Subtask Structure
OCRBench v2 operationalizes 23 subtasks, grouped under eight core capability categories:
| Capability | Example Subtasks | Output Formats |
|---|---|---|
| Text Recognition | Fine-grained region OCR, full-page | Text strings, lines+bboxes |
| Text Referring | Text grounding, VQA with position | Bounding boxes, JSON |
| Text Spotting | End-to-end text spotting | List (bbox, text) pairs |
| Relation Extraction | Key-value extraction, handwritten | JSON key-value maps |
| Element Parsing | Table/chart/document/formula parsing | HTML, LaTeX, JSON |
| Mathematical Calculation | Counting, math QA | Integers, math expr |
| Visual Text Understanding | Cognition VQA, diagram QA, classif. | Labels, text answers |
| Knowledge Reasoning | Reasoning VQA, science QA, UI agent | Free-form, action seqs |
Each task is formulated as instruction + image input, sometimes requiring explicit localization (bounding box in [0,1000]×[0,1000]) or structured semantic extraction (Fu et al., 2024).
4. Evaluation Protocols and Metrics
OCRBench v2 employs metric types tailored to task families:
- Parsing-based outputs (tables/charts/docs): Tree-Edit-Distance-based Similarity (TEDS),
- Localization and spotting tasks: Intersection-over-Union (IoU) between predicted and ground truth bounding boxes.
- Extraction tasks: , where precision and recall are defined over matched key-value pairs.
- Long-form and free-text QA: BLEU, METEOR, character-level , and normalized edit distance
- Counting tasks: Score is 0 for pathological predictions, otherwise .
- Short/long answer VQA: Exact match (for MCQ), substring containment for short spans, and Average Normalized Levenshtein Similarity (ANLS) for longer outputs (Fu et al., 2024).
Private 1,500-image test sets (unseen during design) ensure robust zero-shot assessment.
5. Baseline Performance and Key Findings
OCRBench v2 benchmarked 38 LMMs (31 open-source and 7 closed-source), encompassing models such as Qwen-VL, InternVL2, Gemini-Pro, GPT-4V, and Claude 3.5.
Key empirical findings:
- Most LMMs (<36/38) scored below 50 (on a normalized 100-point scale), even in aggregate across all capabilities.
- Top open-source: Qwen2-VL-8B (≈51.4); Top closed-source: Gemini-Pro (≈51.9).
- Highest performance appears in basic text recognition and VQA tasks (60–80%), dropping steeply for text referring and spotting (<20%), structured parsing (<40%), and reasoning/mathematics (<60%).
- Five pervasive error patterns were diagnosed:
- Substantial (>30%) accuracy drop for rare/low-frequency text types (e.g., artistic, dot-matrix, symbols).
- Deficits in fine-grained spatial localization (IoU often <0.2) and dense text spotting.
- Systematic errors in parsing documents with complex/overlapping layouts.
- Incomplete or malformed outputs for charts, nested tables, multi-element docs.
- High error rates (>50%) in visual reasoning, math QA, and logic-intensive subtasks (Fu et al., 2024).
On the private test split, performance trends were highly consistent with the public data, validating benchmark reliability.
6. Context, Comparative Analysis, and Significance
OCRBench v2 markedly broadens capability evaluation over its predecessor and other benchmarks such as DocVQA, TextVQA, ChartQA, and Table-VQA, which report performance saturation (>90%) but fall below 50% on OCRBench v2. This suggests a new standard for stress-testing practical document and scene-text LMM competence. Task diversity—from pixel-level extraction to world knowledge reasoning—substantially exceeds prior available benchmarks. Bilingual coverage furthers its utility for multilingual LMM assessment.
Comparative table excerpt:
| Benchmark | #Langs | Distortions | Task Types | Size (images) |
|---|---|---|---|---|
| OCRBench v1 | ✓ | ✓ | 5 (Recog, QA) | ~1,000 |
| OCRBench v2 | ✓ | ✓ | 23 (full taxonomy) | ~9,500 |
| DocVQA/TextVQA | ✓ | ✓ | QA | <2,000 |
| OmniDocBench | ✗ | ✓ | Parsing, Grounding, Extraction? | ~1,400 |
A plausible implication is that OCRBench v2 will shape future research on OCR-centric LMMs by exposing robustness gaps that prior benchmarks cannot, especially for localization, structured element parsing, and document-level reasoning (Fu et al., 2024).
7. Prospects and Suggested Research Directions
OCRBench v2 highlights persistent limitations in state-of-the-art LMMs, even as raw OCR accuracy improves. Suggested avenues for model development include:
High-resolution visual encoders and context integration via sparse attention or region-of-interest mechanisms.
- End-to-end training paradigms unifying OCR and language modeling to leverage joint visual-text context.
- Enhanced decoders for structured output domains, especially HTML, Markdown, and graph-based schemas.
- Specialized modules for mathematical/logical inference in visually-grounded scenarios.
- Domain-adapted and multilingual instruction tuning protocols.
Benchmark resources—data splits, evaluation code—and documentation are publicly available to promote reproducibility and further research at https://github.com/Yuliang-Liu/MultimodalOCR (Fu et al., 2024).