Omni Parsing: Unified Document Parsing

Updated 4 July 2026

Omni Parsing is a unified parsing paradigm that converts heterogeneous, unstructured inputs into structured, machine-readable representations, integrating text, tables, figures, and formulas.
It employs end-to-end methods to generate single-sequence outputs in formats like Markdown, HTML, or LaTeX, preserving reading order and hierarchical structure.
The approach extends to multimodal inputs, leveraging techniques such as reinforcement learning and prompt-guided decoding to optimize structural fidelity, speed, and robustness.

Searching arXiv for papers on Omni Parsing and closely related document parsing work. Omni Parsing denotes a unified parsing paradigm that converts heterogeneous, unstructured inputs into structured, machine-readable representations while preserving content, structure, and, where relevant, reading order or spatiotemporal grounding. In recent document-parsing work, it is defined as unified, end-to-end parsing of scanned documents that jointly recovers paragraphs and headers, tables, figures, and mathematical formulas across diverse page layouts, languages, and distributions, often by directly translating a page image into a single Markdown/HTML/LaTeX sequence (Wang et al., 17 Oct 2025). Closely related work extends the same unifying idea to visually-situated text parsing, GUI screenshots, and multimodal streams spanning documents, images, audio, and video, with outputs designed to be locatable, enumerable, and traceable (Yu et al., 22 Feb 2025, Lu et al., 2024, An et al., 10 Mar 2026).

1. Conceptual scope

In the literature, Omni Parsing is characterized less by a single architecture than by a common objective: collapsing heterogeneous subproblems into one structured prediction interface. For scanned documents, this means recovering text hierarchy, tables, formulas, figures, and reading order in one pass or one coordinated framework. For visually-situated text parsing, it means unifying text spotting, key information extraction, table recognition, and layout analysis within one encoder-decoder and one sequence format. For multimodal parsing, it expands to a progressive pipeline that links local perception to higher-level interpretation.

Domain	Representative formulation	Structured output
Scanned and digital documents	Page image to a single structured parse	Markdown/HTML/LaTeX or XML/JSON-like sequence
Visually-situated text	Unified parsing of spotting, KIE, tables, layout	Structured points, polygons, text, HTML
GUI screenshots	Screen parsing for agent grounding	DOM-like elements, Set-of-Marks, local semantics
Multimodal streams	Progressive parsing across documents, images, audio, video	Unified JSON and anchored knowledge tuples

This common framing is explicit in "Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing" (Wang et al., 17 Oct 2025), "OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal LLMs" (Yu et al., 22 Feb 2025), "OmniParser for Pure Vision Based GUI Agent" (Lu et al., 2024), and "Logics-Parsing-Omni Technical Report" (An et al., 10 Mar 2026). A plausible implication is that Omni Parsing is best understood as a family of unification strategies rather than a single benchmark task.

2. Document-centric formulations

The document-parsing literature supplies the most concrete and technically mature use of the term. Infinity-Parser instantiates omni parsing by directly translating a scanned page image into a document-level Markdown/HTML/LaTeX sequence that interleaves text, tables, figures, and formulas in the correct reading order; layout cues are encoded implicitly as structural tokens, and reading order is enforced by sequence order and by a dedicated order-preservation reward (Wang et al., 17 Oct 2025). The model is built by fine-tuning Qwen2.5-VL-7B via reinforcement learning, without altering the underlying visual backbone or language decoder.

Other document systems adopt different factorizations while retaining the same objective. Dolphin-v2 uses a two-stage pipeline in which Stage 1 jointly performs document type classification and layout analysis, and Stage 2 chooses between holistic page-level parsing for photographed documents and element-wise parallel parsing for digital-born documents. Its Stage-1 outputs include finer-grained element detection with 21 categories, reading-order prediction, and optional semantic attributes such as author information and document metadata (Feng et al., 5 Feb 2026). This architecture explicitly rejects a one-size-fits-all inference path: photographed pages are handled holistically to avoid axis-aligned cropping failures under skew, curvature, or perspective distortion, whereas digital-born pages are parsed through anchor-guided parallel extraction.

"Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training" formalizes the same end-to-end goal as a mapping from a document image $I$ to a structured textual sequence $Y$ , with $p_\theta(Y\mid I)=\prod_t p_\theta(y_t \mid I, y_{<t})$ , and uses explicit structure tokens such as <doc>, <paragraph>, <table>, <tr>, and <td> to encode grammar, nesting, and reading order (Li et al., 25 Mar 2026). Youtu-Parsing takes a decoupled but still unified route: a NaViT-style dynamic-resolution ViT extracts shared page features once, and a prompt-guided Youtu-LLM-2B performs layout analysis plus region-prompted decoding for text, formulas, tables, charts, seals, and hierarchical structures (Yin et al., 28 Jan 2026).

A recurring distinction therefore emerges within document-centric omni parsing. Some systems are single-sequence generators with no explicit box output at inference time, as in Infinity-Parser and DocHumming (Wang et al., 17 Oct 2025, Li et al., 25 Mar 2026). Others retain explicit anchors or regions, as in Dolphin-v2 and Youtu-Parsing (Feng et al., 5 Feb 2026, Yin et al., 28 Jan 2026). This suggests that “omni” refers to the breadth of jointly modeled structure, not to a mandatory absence or presence of layout primitives.

3. Unified representations and structured outputs

A defining feature of Omni Parsing is the replacement of task-specific outputs with a common representational substrate. In Infinity-Parser, the substrate is a single autoregressive sequence containing Markdown headings and paragraphs, HTML or Markdown tables, LaTeX formulas, and the reading order encoded by segment order (Wang et al., 17 Oct 2025). In DocHumming, the output is likewise a single linearized sequence with explicit structure tokens, optimized to remain semantically accurate and structurally valid over long contexts (Li et al., 25 Mar 2026).

The VsTP literature makes this unification particularly explicit. "OmniParser: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition" casts text spotting, KIE, and table recognition as point-conditioned text generation with a shared encoder-decoder, unified objective, prompt-based conditioning, and a coordinate vocabulary of 1000 bins (Wan et al., 2024). "OmniParser V2" refines this into Structured-Points-of-Thought (SPOT), a two-stage schema in which Stage 1 emits a Structured Points Sequence consisting of center-point tokens interleaved with structural tags such as {address}, {tr}, {line}, or {para}, and Stage 2 emits polygons or quadrilaterals plus character-level transcription for each point (Yu et al., 22 Feb 2025). The point tokens act as explicit bridges between localization and recognition.

Dolphin-v2 generalizes the notion of a promptable structural unit through anchors. An anchor contains categorical type, absolute bounding-box coordinates in pixels, optional attributes, and implicit positional features because anchors are emitted as an ordered sequence. These anchors are then converted into task-specific prompts such as paragraph, table, formula, or code prompts, or bypassed entirely in the photographed-document branch (Feng et al., 5 Feb 2026). Logics-Parsing-Omni extends the same logic across modalities by representing parsed entities as anchored knowledge tuples $\kappa=(s,p,o,\alpha)$ , where $\alpha$ includes spatial and temporal anchors plus evidence pointers, thereby standardizing multimodal knowledge into a spatiotemporal-anchored graph or triple form (An et al., 10 Mar 2026).

Across these formulations, the representational problem is not merely serialization. The structured sequence or anchored tuple has to carry locality, hierarchy, and inter-element relations. In document parsing, this includes reading order, merged cells, and formula syntax; in layout analysis, it includes paragraph-line-word grouping; in multimodal parsing, it includes evidence anchoring and traceability.

4. Optimization, supervision, and decoding

Omni Parsing systems differ sharply in how they enforce structural fidelity. Infinity-Parser introduces LayoutRL, a reinforcement learning framework trained with Group Relative Policy Optimization (GRPO). Its multi-aspect reward is

$R_{\mathrm{Multi\text{-}Aspect}} = R_{\mathrm{dist}} + R_{\mathrm{count}} + R_{\mathrm{order}},$

where $R_{\mathrm{dist}}$ is derived from normalized edit distance, $R_{\mathrm{count}}$ penalizes paragraph-count mismatch, and $R_{\mathrm{order}}$ penalizes inversions in matched segment order after Hungarian matching; the main results use RL directly on Qwen2.5-VL-7B without supervised fine-tuning (Wang et al., 17 Oct 2025). This design directly optimizes content fidelity, segmentation accuracy, and reading-order preservation at page level.

DocHumming instead emphasizes structure at the token level. Its Document-Aware Training Recipe couples a progressive training paradigm—first short-context element-level training, then long-context document-level training—with Structure-Token Aware Optimization:

$L_{\mathrm{structured}} = -\sum_t \alpha_t y_t \log P(x_t \mid x_{<t}),$

where $Y$ 0 for structured tokens and $Y$ 1 in Stage 2 (Li et al., 25 Mar 2026). The stated motivation is that errors on structural tokens propagate and trigger repetition; the weighted loss is therefore used to improve tag fidelity, reduce malformed hierarchies, and curb repetition and hallucination.

Efficiency-oriented work targets the decoding bottleneck. "Efficient Document Parsing via Parallel Token Prediction" introduces Parallel-Token Prediction (PTP), which inserts learnable register tokens after each regular token during training and optimizes

$Y$ 2

allowing multiple future tokens to be generated in parallel per step without changing the backbone architecture (Li et al., 16 Mar 2026). Youtu-Parsing pushes this further with token parallelism and query parallelism. Up to 64 candidate tokens are proposed per iteration via mask tokens and verified by a second forward pass, while up to five region queries can be decoded simultaneously from shared page features; the verification rule accepts the longest exact-match prefix, preserving equivalence to standard autoregressive decoding (Yin et al., 28 Jan 2026).

These approaches illustrate three distinct optimization philosophies. LayoutRL encodes page-level structure as a verifiable reward. Structure-token weighting encodes structural importance in the training loss. Parallel token and query decoding encode the assumption that many document outputs are high-certainty transcription problems rather than open-ended generation tasks. A plausible implication is that omni parsing research is increasingly organized around structured supervision and decoding control rather than around architectural novelty alone.

5. Evaluation regimes and empirical profile

Benchmarking in Omni Parsing is dominated by metrics that score both content and structure. Infinity-Parser reports on OmniDocBench, olmOCR-Bench, PubTabNet, and FinTabNet, using normalized edit distance for text, formulas, and reading order, TEDS for tables, and table edit distance. On OmniDocBench, Infinity-Parser-7B reaches Overall Edit $Y$ 3 for English/Chinese, Table TEDS $Y$ 4, and Read Order Edit $Y$ 5; on olmOCR-Bench it reports an overall score of $Y$ 6; on PubTabNet and FinTabNet it reports TEDS-S/TEDS of $Y$ 7 and $Y$ 8, respectively (Wang et al., 17 Oct 2025).

Dolphin-v2 emphasizes robustness under photographed and distorted conditions. On OmniDocBench v1.5 it reports Overall $Y$ 9 versus Dolphin $p_\theta(Y\mid I)=\prod_t p_\theta(y_t \mid I, y_{<t})$ 0, a gain of $p_\theta(Y\mid I)=\prod_t p_\theta(y_t \mid I, y_{<t})$ 1 points. On RealDoc-160, it reports Edit Distance $p_\theta(Y\mid I)=\prod_t p_\theta(y_t \mid I, y_{<t})$ 2 for English, $p_\theta(Y\mid I)=\prod_t p_\theta(y_t \mid I, y_{<t})$ 3 for Chinese, and $p_\theta(Y\mid I)=\prod_t p_\theta(y_t \mid I, y_{<t})$ 4 average, and the paper states a $p_\theta(Y\mid I)=\prod_t p_\theta(y_t \mid I, y_{<t})$ 5 error reduction on photographed documents (Feng et al., 5 Feb 2026). The key empirical claim is not merely better OCR, but better type-aware routing between holistic and anchor-guided parsing.

DocHumming extends evaluation to real-world capture. On OmniDocBench it reports Overall $p_\theta(Y\mid I)=\prod_t p_\theta(y_t \mid I, y_{<t})$ 6, Text Edit $p_\theta(Y\mid I)=\prod_t p_\theta(y_t \mid I, y_{<t})$ 7, Formula CDM $p_\theta(Y\mid I)=\prod_t p_\theta(y_t \mid I, y_{<t})$ 8, Table TEDS $p_\theta(Y\mid I)=\prod_t p_\theta(y_t \mid I, y_{<t})$ 9, and Reading Order Edit $\kappa=(s,p,o,\alpha)$ 0. On Wild-OmniDocBench, its Origin-to-Wild Overall drop is $\kappa=(s,p,o,\alpha)$ 1, or $\kappa=(s,p,o,\alpha)$ 2, compared with drops of about $\kappa=(s,p,o,\alpha)$ 3 and $\kappa=(s,p,o,\alpha)$ 4 for PaddleOCR-VL and MinerU2.5; the paper also reports practical latency of about $\kappa=(s,p,o,\alpha)$ 5 s per text-dense page on the 1B model (Li et al., 25 Mar 2026). Youtu-Parsing reports OmniDocBench v1.5 Overall $\kappa=(s,p,o,\alpha)$ 6 and olmOCR-bench Overall $\kappa=(s,p,o,\alpha)$ 7, together with token-parallel speedups up to $\kappa=(s,p,o,\alpha)$ 8 for tables and $\kappa=(s,p,o,\alpha)$ 9 for formulas (Yin et al., 28 Jan 2026).

Although these systems differ in model size, output format, and benchmark version, the metric pattern is consistent. Omni parsing is evaluated not only by transcription quality but by table topology, formula fidelity, reading order, and robustness under layout and capture shift. This metric design is itself part of the field’s definition.

6. Extensions beyond page documents

The same unifying logic appears outside scanned-document parsing. In GUI agents, OmniParser converts a screenshot into a Set-of-Marks overlay plus a local semantics list generated from OCR text and icon-functional captions. The system merges detector and OCR boxes, assigns numeric IDs, and prompts GPT-4V to select a box ID for action grounding. On ScreenSpot, OmniParser with local semantics and the interactable detector reports average accuracy $\alpha$ 0 versus a GPT-4V baseline of $\alpha$ 1; on AITW it reports $\alpha$ 2 versus $\alpha$ 3 for GPT-4V + history (Lu et al., 2024). Here, omni parsing means comprehensive, pure vision-based screen parsing into actionable, semantically grounded elements.

Logics-Parsing-Omni broadens the concept further into a unified multimodal parsing framework with three hierarchical levels: Holistic Detection, Fine-grained Recognition, and Multi-level Interpreting. It covers document pages, natural images, graphics, audio, natural video, and text-rich video, and evaluates them in OmniParsingBench under both perception and cognition criteria. Reported aggregate results include Natural Image Overall $\alpha$ 4, Graphics Overall $\alpha$ 5, Audio Overall $\alpha$ 6, Natural Video Overall $\alpha$ 7, and Text-Rich Video Overall $\alpha$ 8 (An et al., 10 Mar 2026). The framework’s pivotal mechanism is evidence anchoring, which forces high-level descriptions to remain aligned with low-level facts.

Earlier work foreshadows this broader usage. Object-oriented Neural Programming parses documents into a predesigned object-oriented ontology through a sequential decision process that blends symbolic and differentiable operations and can be trained with supervised learning, reinforcement learning, or a hybrid of the two (Lu et al., 2017). "A generalized parsing framework for Abstract Grammars" uses the term in an even more abstract sense: a single parser operates over CFGs, generalized CFG-like systems, and Minimalist Grammars through a compact grammar interface (Harasim et al., 2017). These formulations are not document parsers in the contemporary vision-language sense, but they share the same unifying principle: one parsing engine, multiple structured targets.

7. Limits, misconceptions, and open directions

A frequent misconception is that Omni Parsing must mean a single end-to-end decoder with no intermediate structure. The surveyed literature does not support that restriction. Infinity-Parser and DocHumming are single-sequence parsers without box output at inference time, but Dolphin-v2 explicitly remains two-stage, and Youtu-Parsing performs layout analysis plus region-prompted decoding from reusable page features (Wang et al., 17 Oct 2025, Feng et al., 5 Feb 2026, Li et al., 25 Mar 2026, Yin et al., 28 Jan 2026). What unifies them is the attempt to cover heterogeneous content and structure within one coordinated system, not adherence to one decomposition.

Another misconception is that omni parsing is solved once text OCR becomes strong. The reported failure modes are structural and distributional. Infinity-Parser still struggles on extremely noisy or old scans, very complex tables with merged cells and rotations, and dense mathematical layouts (Wang et al., 17 Oct 2025). Dolphin-v2 identifies classification errors on borderline photographed pages, extreme distortions or blur, and high-density forms with large memory pressure (Feng et al., 5 Feb 2026). DocHumming identifies irregular or interleaved layouts such as posters and newspapers, ultra-high-resolution pages requiring downsampling or tiling, and latency constraints of about $\alpha$ 9 s per dense page (Li et al., 25 Mar 2026).

The current open directions are correspondingly structural. Infinity-Parser proposes more expressive rewards such as hierarchical tree consistency, semantic table-cell alignment, and formula-specific metrics, together with hierarchical parsing and multi-page context modeling (Wang et al., 17 Oct 2025). DocHumming points toward resolution-adaptive modeling, cross-tile consistency, and faster backbones and decoding (Li et al., 25 Mar 2026). Logics-Parsing-Omni identifies knowledge-grounded data synthesis, contrastive identifier training, and continual cross-modal learning as future work for evidence-based multimodal parsing (An et al., 10 Mar 2026). This suggests that the frontier of Omni Parsing lies in scaling beyond token accuracy toward global coherence, cross-element reasoning, and robust grounding under real-world variability.