Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 54 tok/s

Gemini 2.5 Pro 54 tok/s Pro

GPT-5 Medium 22 tok/s Pro

GPT-5 High 25 tok/s Pro

GPT-4o 99 tok/s Pro

Kimi K2 196 tok/s Pro

GPT OSS 120B 333 tok/s Pro

Claude Sonnet 4.5 34 tok/s Pro

2000 character limit reached

Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting (2505.14059v1)

Published 20 May 2025 in cs.CV

Abstract: Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Current approaches either assemble specialized expert models or directly generate page-level content autoregressively, facing integration overhead, efficiency bottlenecks, and layout structure degradation despite their decent performance. To address these limitations, we present \textit{Dolphin} (\textit{\textbf{Do}cument Image \textbf{P}arsing via \textbf{H}eterogeneous Anchor Prompt\textbf{in}g}), a novel multimodal document image parsing model following an analyze-then-parse paradigm. In the first stage, Dolphin generates a sequence of layout elements in reading order. These heterogeneous elements, serving as anchors and coupled with task-specific prompts, are fed back to Dolphin for parallel content parsing in the second stage. To train Dolphin, we construct a large-scale dataset of over 30 million samples, covering multi-granularity parsing tasks. Through comprehensive evaluations on both prevalent benchmarks and self-constructed ones, Dolphin achieves state-of-the-art performance across diverse page-level and element-level settings, while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism. The code and pre-trained models are publicly available at https://github.com/ByteDance/Dolphin

Summary

Document Image Parsing via Heterogeneous Anchor Prompting: An Analytical Perspective

In the field of document image parsing, the extraction of structured content from complexly intertwined elements such as text paragraphs, figures, formulas, and tables remains an arduous task. The recent development of Dolphin, a novel multimodal document image parsing model, presents a notable advancement in addressing the inefficiencies and layout degradation issues commonly encountered with existing approaches. This essay provides an analytical summary of Dolphin's methodology and evaluates its contributions to the domain.

Dolphin distinguishes itself through its analyze-then-parse paradigm, which diverges from traditional reliance on multiple expert models or autoregressive generation alone. This model is designed to generate layout elements in reading order during its first stage, which then serve as anchors for parallel content parsing in the second stage. Each anchor is paired with task-specific prompts, allowing efficient parsing of heterogeneous content types. This approach synthesizes the benefits of two predominant trajectories in document parsing: integration-based methods and vision-LLMs (VLMs). Through comprehensive evaluation against prevalent benchmarks as well as self-constructed ones, Dolphin demonstrates state-of-the-art performance across a variety of parsing tasks, coupled with significant efficiency gains.

The development of Dolphin was underpinned by the creation of a large-scale dataset containing over 30 million samples, strategically designed for multi-granularity parsing tasks. This dataset ensures robust training and enables Dolphin to achieve impressive results in both page-level and element-level parsing environments. The enhancement in parsing accuracy and efficiency is largely attributed to Dolphin's lightweight architecture and parallel parsing mechanism, which surmounts the typical bottlenecks faced by conventional methods.

The paper rigorously evaluates the performance of Dolphin across benchmarks such as Fox-Page and Dolphin-Page, highlighting its superiority in parsing complex documents with interleaved elements. Notably, Dolphin achieves an edit distance of 0.0575, outperforming other competitive models and affirming its capability to parse documents containing mixed elements like tables and formulas. Additionally, Dolphin exhibits considerable efficiency, achieving 0.1729 FPS—almost double that of its fastest competitor.

The implications of this research are multifaceted. Practically, Dolphin provides a streamlined solution for document parsing in sectors demanding high efficiency and accuracy, such as academia and enterprise. Theoretically, the heterogeneous anchor prompting method introduces a promising avenue for future exploration in multimodal model design and instruction tuning. It suggests potential pathways for integrating parallel element parsing strategies into broader applications, a paradigm shift that could significantly enhance processing capabilities in large-scale document handling scenarios.

Looking forward, further developments in AI-enabled document parsing may include expanding multilingual capabilities, integrating handwriting recognition, and refining the system for more specialized document types, such as historical manuscripts. The Dolphin model lays a solid foundation for these advancements, emphasizing the importance of a unified architecture that leverages both comprehensive analysis and efficient parsing strategies.

Dolphin represents an important contribution to the evolution of document parsing technologies, offering a novel perspective on how multimodal models can be optimized for diverse content structures, thereby addressing longstanding challenges in the domain.