OmniParsingBench: Unified Multimodal Parsing
- OmniParsingBench is a multimodal parsing benchmark that unifies evaluation across documents, images, charts, audio, and video using a three-tier parsing paradigm.
- It decomposes parsing tasks into hierarchical levels—detection, recognition, and interpreting—to ensure high-level semantic claims are grounded in verifiable low-level evidence.
- The benchmark employs standardized metrics (IoU, mAP, CER, WER) across six domain-specific modules to rigorously assess both perception and cognition performances.
OmniParsingBench is a multimodal parsing benchmark introduced in the "Logics-Parsing-Omni Technical Report" to unify evaluation across documents, static images, and audio-visual streams, while operationalizing a three-level parsing paradigm that bridges low-level perception and high-level cognition (An et al., 10 Mar 2026). It is designed around the premise that multimodal parsing has been fragmented into separate pipelines for OCR/layout, chart parsing, video captioning, and related tasks, and that high-level semantic descriptions often lack rigorous grounding in detected facts and tend to hallucinate. The benchmark therefore couples hierarchical task decomposition with an evidence anchoring requirement: semantic outputs must cite the factual sub-outputs—such as boxes, timestamps, and symbols—from which they derive, so that unstructured signals can be transformed into standardized knowledge that is locatable, enumerable, and traceable (An et al., 10 Mar 2026).
1. Motivation and benchmark objective
OmniParsingBench was created to address three stated goals: to unify evaluation across documents, static images, and audio-visual streams; to evaluate a three-level parsing paradigm that bridges low-level perception and high-level cognition; and to enforce evidence anchoring by requiring all semantic answers to cite the factual sub-outputs from which they derive (An et al., 10 Mar 2026). In the associated framework, multimodal parsing is not treated as a single monolithic prediction problem. Instead, it is decomposed into hierarchical levels that progress from spatial-temporal grounding, to symbolization and attribute extraction, to semantic and logical interpretation.
The benchmark is therefore aligned with the broader Omni Parsing framework rather than with a single modality-specific task family. In the source report, the framework integrates three hierarchical levels: Holistic Detection, which achieves precise spatial-temporal grounding of objects or events to establish a geometric baseline for perception; Fine-grained Recognition, which performs symbolization and attribute extraction on localized objects to complete structured entity parsing; and Multi-level Interpreting, which constructs a reasoning chain from local semantics to global logic (An et al., 10 Mar 2026). A pivotal advantage claimed for this organization is the evidence anchoring mechanism, which enforces strict alignment between high-level semantic descriptions and low-level facts.
A plausible implication is that OmniParsingBench is intended not merely to compare end-task outputs, but to diagnose whether high-level reasoning quality is supported by verifiable low-level perception. That emphasis distinguishes it from benchmarks in which captioning or question answering can be scored without checking whether the underlying claims are grounded in identifiable evidence.
2. Domain coverage and held-out test composition
OmniParsingBench consists of six domain-specific modules, each with a held-out test split, and no training or fine-tuning is performed on these test sets (An et al., 10 Mar 2026). The benchmark spans page images, static natural scenes, charts and geometry diagrams, audio clips, general video, and text-rich long-form video.
| Module | Test composition | Modality emphasis |
|---|---|---|
| Document Parsing (LogicsDocBench) | 900 page-level PDF images covering OCR/layout, tables, formulas | Documents |
| Natural Image Parsing | 1,000 general scenes + 1,000 knowledge-aware scenes with unambiguous real-world entities | Images |
| Graphics Parsing | 500 charts + 500 single-image geometry problems | Charts and geometry |
| Audio Parsing | 2,434 audio clips, sampled from AudioSet-strong and Librispeech | Speech and acoustic events |
| Natural Video Parsing | 1,100 general video segments + 725 camera-motion clips | General video |
| Text-Rich Video Parsing | 452 long-form instructional videos with frame-level OCR, segment-level ASR, and global captions | Long-form instructional video |
The dataset composition is itself part of the benchmark’s design argument. Document Parsing covers OCR/layout, tables, and formulas. Natural Image Parsing includes both “general” scenes and “knowledge-aware” scenes with unambiguous real-world entities such as landmarks, brands, and species. Graphics Parsing is divided into a chart submodule with line/bar/pie charts and flowcharts, and a geometry submodule with single-image geometry problems. The audio and video modules extend the benchmark beyond conventional document-vision evaluation by incorporating speech, acoustic events, camera motion, frame-level OCR, segment-level ASR, and global narrative outputs (An et al., 10 Mar 2026).
This breadth matters because the benchmark is intended to test a unified parsing stack rather than isolated subsystems. The held-out, read-only test design also makes the reported setting explicitly evaluative rather than adaptive.
3. Hierarchical taxonomy and benchmark tasks
OmniParsingBench evaluates each domain along three hierarchical parsing levels, denoted L1, L2, and L3 (An et al., 10 Mar 2026). These levels are both a taxonomy and a task decomposition.
At L1 (Holistic Detection), the benchmark scores spatial or temporal grounding of bounding boxes or segments for objects, text blocks, chart regions, video events, camera shots, and audio events. The task is defined as follows: given raw input, localize all relevant instances with bounding-box or timestamp outputs and assign a coarse category label (An et al., 10 Mar 2026). In effect, L1 establishes the geometric or temporal substrate on which later parsing stages depend.
At L2 (Fine-grained Recognition), the benchmark scores symbol extraction and attribute extraction. The reported examples include OCR text, table cells, code blocks, object labels, chart data points, geometric primitives, acoustic labels, and speaker IDs. The task is defined as extracting structured symbols, normalized coordinates for geometry, or speech transcripts and acoustic tags on each L1 region (An et al., 10 Mar 2026). Here the benchmark moves from mere localization to structured entity parsing.
At L3 (Multi-level Interpreting), the benchmark scores semantic captioning, question answering, chart interpretations such as HTML or mermaid-code, geometric reasoning, video narratives, course summaries, and logical induction, including chart trends, geometry proofs, and audio-visual causal chains (An et al., 10 Mar 2026). The task is defined as using L1 and L2 results to produce a structured, evidence-anchored interpretation in JSON, such as a global caption, QA-style reasoning output, or code/table reverse rendering.
The benchmark’s internal categories make clear that “parsing” is being used in a broad sense. It includes not only region detection and token recognition, but also semantic normalization and logically structured explanation. This suggests a deliberate attempt to bridge perception-oriented evaluation and cognition-oriented evaluation within one benchmark schema.
4. Evidence anchoring and the JSON evaluation workflow
The defining procedural feature of OmniParsingBench is its evidence anchoring mechanism (An et al., 10 Mar 2026). Every L3 output must reference the exact L1/L2 facts that justify each semantic statement; the report gives the example “Box#3: ‘Bar in April = 42’.” Bench scripts then verify anchoring by matching cited identifiers—box IDs, timestamps, and symbol IDs—against the ground-truth L1/L2 annotations. Failure to anchor a semantic claim is counted as an L3 error (An et al., 10 Mar 2026).
This requirement changes the meaning of semantic correctness. A semantically plausible answer is insufficient if it does not cite the factual sub-outputs from which it derives. In that sense, the benchmark treats hallucination as a grounding failure rather than merely as an incorrect final answer.
The evaluation procedure is correspondingly standardized. For each sample, the model produces a single JSON containing L1, L2, and L3 outputs. The benchmark scripts then extract detections for IoU and mAP, text for CER and WER, and semantic outputs for EM, KR, and LC. Domain-level Perception, Cognition, and Overall scores are computed, and a unified leaderboard ranks models by the sum of their Overall scores across domains (An et al., 10 Mar 2026).
Because the JSON contains all three levels, the benchmark supports cross-level consistency checks rather than isolated module scoring. A plausible implication is that the benchmark encourages architectures that can retain traceability from raw evidence to final semantic claims, instead of post hoc explanation layers detached from the perception stack.
5. Metrics and score aggregation
OmniParsingBench groups metrics by level and domain, then aggregates them into two composite scores, Perception for L1+L2 and Cognition for L3. The reported Overall score per domain is the simple average of Perception and Cognition (An et al., 10 Mar 2026).
For L1 detection, the benchmark uses Intersection-over-Union for each predicted box and ground-truth :
It also reports Recall@ and Precision@:
Mean Average Precision is defined through
For L2 recognition, OCR text accuracy uses Character Error Rate:
where , , and 0 are substitutions, deletions, and insertions, and 1 is ground-truth length. ASR accuracy uses Word Error Rate:
2
Acoustic event detection uses segment-level F1:
3
For L3 interpreting, the benchmark uses Exact-Match for QA,
4
Knowledge-Reference Rate for images,
5
and a Logical Consistency Score via LLM-based QA,
6
The report defines the aggregations as
7
These choices are consequential. They make L3 explicitly measurable as a cognition-oriented layer rather than as an unstructured free-text side task, while still keeping the underlying perception metrics visible.
6. Representative results, ablations, and relation to adjacent benchmarks
The report provides representative domain-level results for Logics-Parsing-Omni on OmniParsingBench (An et al., 10 Mar 2026). For Natural Image, the scores are Overall 62.46, Perception 50.53, and Cognition 74.38. For Graphics, they are 87.43, 82.67, and 92.19. For Document, the benchmark reports Overall 84.90 and Perception 84.90, with no L3. For Audio, the scores are 53.75, 70.03, and 37.48. For Natural Video, they are 43.78, 51.76, and 35.79. For Text-Rich Video, they are 66.78, 54.52, and 79.03.
The same source states that Logics-Parsing-Omni outperforms open-weight baselines—Qwen3-Omni, Qwen3-VL, and GPT-5.2—in all six domains, and matches or exceeds proprietary Gemini-3-Pro in most L3 cognitive measures (An et al., 10 Mar 2026). The reported Graphics ablation is especially central to the benchmark’s rationale: removing structured L1/L2 supervision severely degrades L3 logical reasoning, down from 92.19 to approximately 80. In the benchmark’s framing, this underlines the necessity of evidence anchoring.
A common source of confusion is nomenclature. In surrounding benchmark literature, OmniDocBench has also been referred to as OmniParsingBench, but it denotes a document-centered benchmark for PDF parsing with 981 PDF pages from nine distinct sources, 19 layout categories, and 15 attribute labels (Ouyang et al., 2024). By contrast, the benchmark in the Logics-Parsing-Omni report is explicitly multimodal, spanning documents, images, graphics, audio, and video (An et al., 10 Mar 2026). The overlap in naming suggests a shared interest in unified parsing evaluation, but the scopes are not identical.
There is also a useful historical parallel with BenchCLAMP, a benchmark for constrained LLM parsing across seven semantic parsing datasets and two syntactic parsing datasets (Roy et al., 2022). BenchCLAMP formalizes output validity with context-free grammars and constrained decoding, whereas OmniParsingBench formalizes multimodal grounding with evidence-anchored JSON outputs. BenchCLAMP’s own discussion of an ideal unified parsing benchmark called “OmniParsingBench” proposed multi-regime splits, formal grammars, standardized metrics, and shared constrained-decoding tooling (Roy et al., 2022). This suggests a convergence between text-centric parsing research and multimodal parsing research around a common objective: standardized, cross-task evaluation with explicit validity constraints, even though the technical mechanisms differ.
Taken together, these results and comparisons position OmniParsingBench as a benchmark centered on traceable multimodal knowledge extraction rather than on isolated captioning or recognition accuracy alone. Its distinctive contribution is the requirement that semantic interpretation remain explicitly tethered to verifiable low-level evidence.