- The paper presents OmniParser V2, a unified visual text parsing framework using a two-stage Structured-Points-of-Thought (SPOT) prompting strategy to consolidate multiple tasks.
- The two-stage SPOT strategy and token-router decoder decouple tasks, leading to state-of-the-art results across text spotting, KIE, table recognition, and layout analysis benchmarks.
- The Structured-Points-of-Thought (SPOT) strategy demonstrates generality by improving text processing performance when integrated into Multimodal Large Language Models (MLLMs).
The paper presents OmniParser V2, a unified framework for visually-situated text parsing (VsTP) that consolidates tasks such as scene text spotting, key information extraction (KIE), table recognition, and layout analysis into a single encoder‐decoder architecture. The central innovation is the Structured-Points-of-Thought (SPOT) prompting strategy, which decomposes the parsing process into two distinct stages to improve both performance and interpretability.
The first stage generates a structured points sequence that encodes center point tokens representing the spatial locations of text instances along with structural tokens (e.g., those indicating task-specific markup like {address} or {tr}). By normalizing and quantizing image coordinates into discrete tokens, this intermediate representation provides an effective abstraction for mapping between raw visual input and high-level document structures. In the second stage, the shared decoder, implemented via a token-router-based architecture that leverages a simplified mixture-of-experts (MoE) design, simultaneously produces polygon outlines (or bounding boxes) and content sequences for text recognition. This decoupling of spatial localization and textual transcription results in a reduction of sequence length and associated error accumulation, thus enhancing overall generalization.
Key design elements and contributions include:
- Unified Modeling Paradigm: Instead of relying on task-specific branches and external post-processing (e.g., OCR engines), the framework uses a shared encoder based on Swin-B and a token-router-based shared decoder to handle all VsTP tasks in an end-to-end manner. This reduces model complexity and mitigates modal isolation.
- Token-Router-Based Shared Decoder: The decoder divides each token’s computation over three specialized feed-forward networks (FFNs) that correspond to structured, detection, and recognition tokens. This explicit supervision of token types improves training convergence and reduces redundancy compared with native MoE decoders.
- Two-Stage SPOT Prompting: The first stage focuses on structured points sequence generation—capturing center points alongside task-related structural tokens—while the second stage predicts the geometric contours and semantic content. Ablation studies indicate that this intermediate representation is critical for achieving higher performance on tasks like text spotting, as it explicitly decouples complex coordinate and recognition learning.
- Pre-training Strategies: Spatial-window prompting and prefix-window prompting are introduced to enrich spatial and semantic representations. The spatial-window strategy guides the decoder to focus on specific image regions via fixed or random windows, whereas the prefix-window strategy uses character-based prompts to refine the prediction of text instances, thereby improving the model’s robustness against diverse layouts.
- Generalization to Multimodal LLMs (MLLMs): The paper further explores integrating SPOT prompting within MLLMs to enhance text localization and recognition performance. Through careful supervised fine-tuning with SPOT-style prompts (varying in length as normal, short, or long SPOT), the authors demonstrate significant improvements in point-based (Pos) and transcription-based (Trans) metrics on benchmarks such as ICDAR 2015, Total-Text, and CTW1500. Although a performance gap remains compared to the specialized OmniParser V2 framework, the results underline the potential for end-to-end multimodal reasoning without external OCR dependencies.
The experimental results are comprehensive. Quantitative evaluations indicate that OmniParser V2 achieves state-of-the-art or competitive results across multiple benchmarks:
- Text Spotting: Gains of +0.6% and +0.4% over prior models on Total-Text and CTW1500 under lexicon-based evaluation, with substantial improvements in end-to-end (E2E) metrics across strong, weak, and generic lexicon settings on ICDAR 2015.
- Key Information Extraction: Field-level F1 scores on CORD and improved tree-edit-distance-based accuracies on SROIE demonstrate superior localization and recognition capability compared to prior generation-based methods.
- Table Recognition: The decoupling of table structure recognition from cell content transcription results in higher Tree-Edit-Distance-based Similarity (TEDS) scores on PubTabNet and FinTabNet datasets, with reported improvements even when using shorter decoder lengths (1,500 tokens versus 4,000 in some baselines).
- Layout Analysis: On the HierText dataset, OmniParser V2 delivers gains in Panoptic Quality (PQ) for word, line, and paragraph grouping tasks relative to baseline models, underscoring its capability in capturing hierarchical document structures.
Overall, the framework’s design significantly simplifies the processing pipeline by eliminating the need for multiple task-specific architectures and objectives, while the two-stage SPOT prompting and token-router-based shared decoder jointly contribute to improved accuracy, reduced model size (reported as a 23.6% reduction), and enhanced robustness. The generality of SPOT prompting when applied to MLLMs further broadens the applicability of the approach within multimodal document understanding fields.