Logics-Parsing-Omni: Unified Multimodal Parsing
- The paper introduces a framework that converts unstructured signals into standardized and traceable knowledge using a progressive, hierarchical parsing paradigm.
- It employs a three-level taxonomy—holistic detection, fine-grained recognition, and multi-level interpreting—to tightly couple low-level evidence with high-level cognition.
- Empirical results on OmniParsingBench show significant gains in graphics reasoning, document parsing, and audio-video analysis, underlining its practical impact.
Logics-Parsing-Omni is a unified multimodal parsing framework and model for converting unstructured inputs from documents, images, audio, and video into standardized, locatable, enumerable, and traceable knowledge. It is introduced as the model instantiation of the broader Omni Parsing framework, whose central objective is to connect fine-grained perception with high-level cognition through a progressive parsing paradigm rather than treating OCR, ASR, captioning, and reasoning as isolated tasks. The framework organizes parsing into hierarchical levels, enforces evidence anchoring between semantic outputs and low-level facts, and evaluates performance with OmniParsingBench across six domains (An et al., 10 Mar 2026).
1. Conceptual scope and problem setting
Logics-Parsing-Omni is motivated by a recurrent fragmentation in multimodal AI. Document pipelines often recover layout and text but lose figure or chart semantics; image captioning can produce fluent descriptions without precise grounding; audio and video captioning can ignore speaker identity, acoustic events, camera motion, or temporal structure; and high-level reasoning is often not anchored to verifiable low-level facts. The framework is designed for settings in which downstream systems require representations that are explicitly tied to coordinates or timestamps, decomposed into discrete entities or events, and traceable back to evidence (An et al., 10 Mar 2026).
Within this formulation, the term parsing is broader than OCR, ASR, or dense captioning. The framework describes a transformation from raw signals into structured knowledge by moving from geometry or time grounding to symbol extraction or attribute parsing and then to logic or reasoning. All outputs are normalized into a structured JSON format so that documents, images, audio, and video or audio-visual streams can be represented in a consistent schema.
A common misconception is that Logics-Parsing-Omni is merely a larger OCR/ASR system or a generic multimodal captioner. The framework explicitly distinguishes itself from both. Its stated target is neither low-level extraction without meaning nor fluent description without grounded structure; rather, it aims at evidence-based logical induction from unstructured multimodal signals (An et al., 10 Mar 2026).
2. Unified taxonomy and progressive parsing hierarchy
The Omni Parsing framework is defined by a three-level hierarchy that applies across modalities:
- L1-Holistic Detection
- L2-Fine-grained Recognition
- L3-Multi-level Interpreting
At Holistic Detection, the system performs spatio-temporal grounding. In documents and images, this includes objects, text regions, tables, formulas, charts, and figures. In audio, it includes acoustic events and speaker segments. In video, it includes temporal segments, events, shots, scene boundaries, and camera motion. This stage establishes the geometric or temporal baseline for subsequent interpretation.
At Fine-grained Recognition, the model performs symbolization and attribute extraction. The examples given include OCR for documents, images, and videos; ASR for audio and video; chart axes, legends, labels, formulas, and tables; speaker IDs and acoustic event tags; camera motion categories; and knowledge-aware identity recognition for landmarks, brands, and species. This stage converts grounded regions and segments into structured entity-level facts.
At Multi-level Interpreting, the model constructs reasoning chains from local semantics to global logic. Local objects, text, or events are combined into segment-level meaning, and segment-level meaning is then combined into document-, scene-, or video-level interpretation. The framework describes the output at this level as a global narrative or structured explanation that captures trends, causal relations, summaries, and higher-order logic (An et al., 10 Mar 2026).
This layered taxonomy gives the framework a precise claim: perception and cognition are not parallel modules but hierarchically linked stages. A plausible implication is that the framework treats multimodal reasoning as a constrained continuation of grounded parsing, rather than as a separate free-form generation problem.
3. Evidence anchoring and model architecture
A pivotal mechanism is evidence anchoring, defined as a strict alignment between high-level semantic descriptions and low-level facts. In practice, the framework states that global descriptions are generated only after local entities or events are explicitly grounded; verified knowledge tags are attached only when visual evidence is unambiguous; audio or video narratives are assembled from timestamped chunks, speaker labels, OCR spans, and acoustic tags; and charts or geometry are reverse-rendered into structured forms before reasoning. The claimed effect is improved factuality, interpretability, retrieval usefulness, trustworthiness, hallucination resistance, and traceability (An et al., 10 Mar 2026).
The released model is initialized from Qwen3-Omni-30B-A3B and trained in a two-stage progressive SFT pipeline. In Stage 1: Panoramic Cognitive Foundation, the model is trained on a large 16M-sample supervised corpus to build atomic capabilities including detection, OCR/ASR, captioning, basic parsing, and broad multimodal knowledge; this stage includes around 12.6M samples of knowledge-intensive image QA. In Stage 2: Unified Parsing Alignment, the model is trained on a 5M-sample high-quality, balanced dataset to align outputs toward standardized JSON while preserving both structured extraction and fluent semantic generation.
Both stages use standard autoregressive SFT or next-token prediction:
The implementation details reported are specific: the system is built with Megatron-SWIFT; all Qwen3-Omni-30B-A3B components are unfrozen except the talker; and long-context training uses global batch size 32, max sequence length 56k, video sampling at 2 FPS, and base LR , warmup 0.05, cosine decay (An et al., 10 Mar 2026).
4. Data construction and benchmark design
The standardized corpus spans four domains: document, image, audio, and video. The document portion includes page-level document parsing, tables, formulas, embedded figures or illustrations, and multilingual content, using public datasets such as olmOCR-mix, FinTabNet, TNCR, PubTabNet, and ChEBI-20-MM, together with in-house data. The image portion covers natural images, knowledge-aware natural images, graphics such as charts, flowcharts, and geometric figures, and multi-image difference tasks. The audio portion includes diarized speaker-attributed transcription, acoustic event detection, and unified semantic chunks with timestamps, speaker IDs, ASR, acoustic tags, and scene descriptions. The video portion includes general natural video, camera-aware video, and text-rich educational or course video, with temporal segmentation, OCR on keyframes, audio-video fusion, camera motion taxonomy, and in-depth structured captions for course videos (An et al., 10 Mar 2026).
To evaluate this breadth, the paper introduces OmniParsingBench, whose major domains are Document, Natural Image, Graphics, Audio, Natural Video, and Text-Rich Video. The benchmark aggregates domain-specific metrics into Perception for L1 and L2, Cognition for L3, and then an Overall score.
Several metric definitions are given explicitly. For natural images:
with perception instance correctness requiring IoU and correctly verified semantic or text fields.
For audio:
For natural video:
For text-rich video:
0
1
2
The benchmark therefore evaluates not only extraction quality but also structure understanding, scene or document reasoning, and cross-level consistency. A plausible implication is that OmniParsingBench is intended to measure whether a model preserves the connection between local evidence and global interpretation, not merely whether it generates plausible summaries (An et al., 10 Mar 2026).
5. Empirical performance and ablation results
The reported results indicate strong performance across modalities. The following scores are explicitly given for OmniParsingBench and associated evaluations:
| Domain | Reported score | Additional note |
|---|---|---|
| Natural Image | Overall 62.46 | Cognition 74.38 |
| Graphics | Overall 87.43 | Cognition 92.19 |
| Document | 84.90 | on LogicsDocBench |
| Audio | Overall 53.75 | |
| Natural Video | Overall 43.78 | |
| Text-Rich Video | Overall 66.78 | Cognition 79.03 |
For natural images, the model is reported as the best among open-weight models, with General caption: 90.36 and Knowledge reference: 58.40. For graphics, the system reports Perception avg: 82.67 and Cognition avg: 92.19, with particularly strong component scores including Element: 96.16, Relation: 92.27, and Reasoning: 88.13. For document parsing, the paper also reports Overall 92.54 on OmniDocBench v1.5, describing the result as very competitive with specialized OCR systems and best or near-best on formula and table metrics. For audio, the model reports Perception avg: 70.03 and Cognition avg: 37.48. For natural video, a highlighted result is Camera parsing accuracy 60.69. For text-rich video, the paper reports gains in structured reports, OCR extraction, and long-form educational video parsing (An et al., 10 Mar 2026).
The comparative claim is that the model beats Qwen3-Omni-30B-A3B substantially almost everywhere, and in many tasks approaches or surpasses proprietary systems. The paper especially emphasizes graphics reasoning, audio parsing, and text-rich video parsing.
The ablation on graphics cognition is central to the framework’s empirical argument. Three settings are compared: Qwen3-Omni-30B-A3B baseline, + Graphics (Caption Only), and + Graphics (Parsing + Caption). The reported overall results are 79.50, 79.45, and 92.15, respectively, yielding +12.70 over caption-only. The paper further reports that chart logical reasoning jumps from 73.97 → 90.87 and geometric quantitative relations improve from 80.39 → 96.08. The accompanying interpretation is explicit: free-form captions alone are insufficient, whereas structured parsing plus captioning materially improves cognitive metrics (An et al., 10 Mar 2026).
6. Relation to adjacent systems, misconceptions, and limitations
The name Logics-Parsing-Omni can invite terminological confusion. In the cited literature, at least three distinct research threads use closely related language. Logics-Parsing is presented as an end-to-end LVLM-based document parsing framework built on Qwen2.5-VL-7B-Instruct, with a two-stage strategy of supervised fine-tuning and layout-centric reinforcement learning (LC-RL) for layout analysis and reading-order inference (Chen et al., 24 Sep 2025). By contrast, Logics-Parsing-Omni generalizes beyond page images to documents, images, audio, and video, and organizes the task through the L1-L2-L3 taxonomy rather than through document-only layout objectives (An et al., 10 Mar 2026).
A second nearby but separate line concerns logical or formal semantic parsing from natural language into logic. “Statistical Parsing for Logical Information Retrieval” introduces a pipeline in which an LLM preprocesses and reranks, a typed slot grammar deterministically compiles disambiguated sentences into logical form, and a Quantified Boolean Bayesian Network performs proof-traceable reasoning (Coppola, 12 Feb 2026). “NL2LOGIC: AST-Guided Translation of Natural Language into First-Order Logic with LLMs” similarly separates semantic parsing from deterministic code generation through an abstract syntax tree (Putra et al., 29 Jan 2026). These works address logical form construction and solver-ready reasoning, whereas Logics-Parsing-Omni addresses multimodal parsing and evidence-backed structured knowledge across heterogeneous media.
A third adjacent perspective is logical computational linguistics, which promotes grammar as logic and parsing as deduction, emphasizing proof-theoretic guarantees rather than statistical dependencies (Morrill et al., 19 Apr 2026). This suggests that the term logics in contemporary titles can refer either to symbolic proof structure or to grounded, traceable knowledge induction. Logics-Parsing-Omni belongs to the latter category: its central claims concern evidence anchoring, hierarchical parsing, and multimodal traceability, not a formal sequent calculus.
The paper also states several limitations. Localization ambiguity can create false negatives in evaluation; benchmark metrics are recall-oriented, with limited explicit hallucination quantification; knowledge-aware entity parsing remains challenging; the current schema under-explores higher-level affective or aesthetic semantics; and future work is flagged on efficiency and cross-modal continual learning (An et al., 10 Mar 2026).
These caveats are important when interpreting the reported gains. The framework’s empirical evidence supports the claim that fine-grained structural parsing improves higher-level cognition, but it does not claim to have solved hallucination measurement, localization ambiguity, or all forms of semantic abstraction. A plausible implication is that the framework is strongest where cognition can be operationalized as reasoning over explicitly grounded entities, events, text spans, and temporal segments.