Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 54 tok/s

Gemini 2.5 Pro 54 tok/s Pro

GPT-5 Medium 22 tok/s Pro

GPT-5 High 25 tok/s Pro

GPT-4o 99 tok/s Pro

Kimi K2 196 tok/s Pro

GPT OSS 120B 333 tok/s Pro

Claude Sonnet 4.5 34 tok/s Pro

2000 character limit reached

OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models (2502.16161v1)

Published 22 Feb 2025 in cs.CV and cs.CL

Abstract: Visually-situated text parsing (VsTP) has recently seen notable advancements, driven by the growing demand for automated document understanding and the emergence of LLMs capable of processing document-based questions. While various methods have been proposed to tackle the complexities of VsTP, existing solutions often rely on task-specific architectures and objectives for individual tasks. This leads to modal isolation and complex workflows due to the diversified targets and heterogeneous schemas. In this paper, we introduce OmniParser V2, a universal model that unifies VsTP typical tasks, including text spotting, key information extraction, table recognition, and layout analysis, into a unified framework. Central to our approach is the proposed Structured-Points-of-Thought (SPOT) prompting schemas, which improves model performance across diverse scenarios by leveraging a unified encoder-decoder architecture, objective, and input&output representation. SPOT eliminates the need for task-specific architectures and loss functions, significantly simplifying the processing pipeline. Our extensive evaluations across four tasks on eight different datasets show that OmniParser V2 achieves state-of-the-art or competitive results in VsTP. Additionally, we explore the integration of SPOT within a multimodal LLM structure, further enhancing text localization and recognition capabilities, thereby confirming the generality of SPOT prompting technique. The code is available at \href{https://github.com/AlibabaResearch/AdvancedLiterateMachinery}{AdvancedLiterateMachinery}.

Summary

The paper presents OmniParser V2, a unified visual text parsing framework using a two-stage Structured-Points-of-Thought (SPOT) prompting strategy to consolidate multiple tasks.
The two-stage SPOT strategy and token-router decoder decouple tasks, leading to state-of-the-art results across text spotting, KIE, table recognition, and layout analysis benchmarks.
The Structured-Points-of-Thought (SPOT) strategy demonstrates generality by improving text processing performance when integrated into Multimodal Large Language Models (MLLMs).

The paper presents OmniParser V2, a unified framework for visually-situated text parsing (VsTP) that consolidates tasks such as scene text spotting, key information extraction (KIE), table recognition, and layout analysis into a single encoder‐decoder architecture. The central innovation is the Structured-Points-of-Thought (SPOT) prompting strategy, which decomposes the parsing process into two distinct stages to improve both performance and interpretability.

The first stage generates a structured points sequence that encodes center point tokens representing the spatial locations of text instances along with structural tokens (e.g., those indicating task-specific markup like {address} or {tr}). By normalizing and quantizing image coordinates into discrete tokens, this intermediate representation provides an effective abstraction for mapping between raw visual input and high-level document structures. In the second stage, the shared decoder, implemented via a token-router-based architecture that leverages a simplified mixture-of-experts (MoE) design, simultaneously produces polygon outlines (or bounding boxes) and content sequences for text recognition. This decoupling of spatial localization and textual transcription results in a reduction of sequence length and associated error accumulation, thus enhancing overall generalization.

Key design elements and contributions include:

Unified Modeling Paradigm: Instead of relying on task-specific branches and external post-processing (e.g., OCR engines), the framework uses a shared encoder based on Swin-B and a token-router-based shared decoder to handle all VsTP tasks in an end-to-end manner. This reduces model complexity and mitigates modal isolation.
Token-Router-Based Shared Decoder: The decoder divides each token’s computation over three specialized feed-forward networks (FFNs) that correspond to structured, detection, and recognition tokens. This explicit supervision of token types improves training convergence and reduces redundancy compared with native MoE decoders.
Two-Stage SPOT Prompting: The first stage focuses on structured points sequence generation—capturing center points alongside task-related structural tokens—while the second stage predicts the geometric contours and semantic content. Ablation studies indicate that this intermediate representation is critical for achieving higher performance on tasks like text spotting, as it explicitly decouples complex coordinate and recognition learning.
Pre-training Strategies: Spatial-window prompting and prefix-window prompting are introduced to enrich spatial and semantic representations. The spatial-window strategy guides the decoder to focus on specific image regions via fixed or random windows, whereas the prefix-window strategy uses character-based prompts to refine the prediction of text instances, thereby improving the model’s robustness against diverse layouts.
Generalization to Multimodal LLMs (MLLMs): The paper further explores integrating SPOT prompting within MLLMs to enhance text localization and recognition performance. Through careful supervised fine-tuning with SPOT-style prompts (varying in length as normal, short, or long SPOT), the authors demonstrate significant improvements in point-based (Pos) and transcription-based (Trans) metrics on benchmarks such as ICDAR 2015, Total-Text, and CTW1500. Although a performance gap remains compared to the specialized OmniParser V2 framework, the results underline the potential for end-to-end multimodal reasoning without external OCR dependencies.

The experimental results are comprehensive. Quantitative evaluations indicate that OmniParser V2 achieves state-of-the-art or competitive results across multiple benchmarks:

Text Spotting: Gains of +0.6% and +0.4% over prior models on Total-Text and CTW1500 under lexicon-based evaluation, with substantial improvements in end-to-end (E2E) metrics across strong, weak, and generic lexicon settings on ICDAR 2015.
Key Information Extraction: Field-level F1 scores on CORD and improved tree-edit-distance-based accuracies on SROIE demonstrate superior localization and recognition capability compared to prior generation-based methods.
Table Recognition: The decoupling of table structure recognition from cell content transcription results in higher Tree-Edit-Distance-based Similarity (TEDS) scores on PubTabNet and FinTabNet datasets, with reported improvements even when using shorter decoder lengths (1,500 tokens versus 4,000 in some baselines).
Layout Analysis: On the HierText dataset, OmniParser V2 delivers gains in Panoptic Quality (PQ) for word, line, and paragraph grouping tasks relative to baseline models, underscoring its capability in capturing hierarchical document structures.

Overall, the framework’s design significantly simplifies the processing pipeline by eliminating the need for multiple task-specific architectures and objectives, while the two-stage SPOT prompting and token-router-based shared decoder jointly contribute to improved accuracy, reduced model size (reported as a 23.6% reduction), and enhanced robustness. The generality of SPOT prompting when applied to MLLMs further broadens the applicability of the approach within multimodal document understanding fields.