Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 54 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 22 tok/s Pro
GPT-5 High 25 tok/s Pro
GPT-4o 99 tok/s Pro
Kimi K2 196 tok/s Pro
GPT OSS 120B 333 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting (2505.14059v1)

Published 20 May 2025 in cs.CV

Abstract: Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Current approaches either assemble specialized expert models or directly generate page-level content autoregressively, facing integration overhead, efficiency bottlenecks, and layout structure degradation despite their decent performance. To address these limitations, we present \textit{Dolphin} (\textit{\textbf{Do}cument Image \textbf{P}arsing via \textbf{H}eterogeneous Anchor Prompt\textbf{in}g}), a novel multimodal document image parsing model following an analyze-then-parse paradigm. In the first stage, Dolphin generates a sequence of layout elements in reading order. These heterogeneous elements, serving as anchors and coupled with task-specific prompts, are fed back to Dolphin for parallel content parsing in the second stage. To train Dolphin, we construct a large-scale dataset of over 30 million samples, covering multi-granularity parsing tasks. Through comprehensive evaluations on both prevalent benchmarks and self-constructed ones, Dolphin achieves state-of-the-art performance across diverse page-level and element-level settings, while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism. The code and pre-trained models are publicly available at https://github.com/ByteDance/Dolphin

Summary

Document Image Parsing via Heterogeneous Anchor Prompting: An Analytical Perspective

In the field of document image parsing, the extraction of structured content from complexly intertwined elements such as text paragraphs, figures, formulas, and tables remains an arduous task. The recent development of Dolphin, a novel multimodal document image parsing model, presents a notable advancement in addressing the inefficiencies and layout degradation issues commonly encountered with existing approaches. This essay provides an analytical summary of Dolphin's methodology and evaluates its contributions to the domain.

Dolphin distinguishes itself through its analyze-then-parse paradigm, which diverges from traditional reliance on multiple expert models or autoregressive generation alone. This model is designed to generate layout elements in reading order during its first stage, which then serve as anchors for parallel content parsing in the second stage. Each anchor is paired with task-specific prompts, allowing efficient parsing of heterogeneous content types. This approach synthesizes the benefits of two predominant trajectories in document parsing: integration-based methods and vision-LLMs (VLMs). Through comprehensive evaluation against prevalent benchmarks as well as self-constructed ones, Dolphin demonstrates state-of-the-art performance across a variety of parsing tasks, coupled with significant efficiency gains.

The development of Dolphin was underpinned by the creation of a large-scale dataset containing over 30 million samples, strategically designed for multi-granularity parsing tasks. This dataset ensures robust training and enables Dolphin to achieve impressive results in both page-level and element-level parsing environments. The enhancement in parsing accuracy and efficiency is largely attributed to Dolphin's lightweight architecture and parallel parsing mechanism, which surmounts the typical bottlenecks faced by conventional methods.

The paper rigorously evaluates the performance of Dolphin across benchmarks such as Fox-Page and Dolphin-Page, highlighting its superiority in parsing complex documents with interleaved elements. Notably, Dolphin achieves an edit distance of 0.0575, outperforming other competitive models and affirming its capability to parse documents containing mixed elements like tables and formulas. Additionally, Dolphin exhibits considerable efficiency, achieving 0.1729 FPS—almost double that of its fastest competitor.

The implications of this research are multifaceted. Practically, Dolphin provides a streamlined solution for document parsing in sectors demanding high efficiency and accuracy, such as academia and enterprise. Theoretically, the heterogeneous anchor prompting method introduces a promising avenue for future exploration in multimodal model design and instruction tuning. It suggests potential pathways for integrating parallel element parsing strategies into broader applications, a paradigm shift that could significantly enhance processing capabilities in large-scale document handling scenarios.

Looking forward, further developments in AI-enabled document parsing may include expanding multilingual capabilities, integrating handwriting recognition, and refining the system for more specialized document types, such as historical manuscripts. The Dolphin model lays a solid foundation for these advancements, emphasizing the importance of a unified architecture that leverages both comprehensive analysis and efficient parsing strategies.

Dolphin represents an important contribution to the evolution of document parsing technologies, offering a novel perspective on how multimodal models can be optimized for diverse content structures, thereby addressing longstanding challenges in the domain.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 3 posts and received 2 likes.

Youtube Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube