Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 27 tok/s Pro
GPT-4o 100 tok/s Pro
Kimi K2 204 tok/s Pro
GPT OSS 120B 433 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Infinity-Doc-400K: Multimodal Document Dataset

Updated 24 October 2025
  • Infinity-Doc-400K is a large-scale, multimodal corpus with 400K scanned documents from magazines, academic papers, reports, and more, designed for layout-aware document parsing.
  • The dataset provides high-quality Markdown annotations that capture text blocks, tables, and hierarchical structures, combining pixel-perfect synthetic labels with realistic, noisy real-world scans.
  • It underpins reinforcement learning approaches in models like Infinity-Parser, enhancing structural extraction and reading order preservation for robust document understanding.

The Infinity-Doc-400K dataset is a large-scale, multimodal corpus that serves as a foundational resource for layout-aware document parsing. Designed to address the limitations of existing training data in document understanding, Infinity-Doc-400K enables robust model generalization across diverse document types, visual layouts, and semantic structures. Constructed to support reinforcement learning-based approaches, the dataset provides richly annotated ground truth for scanned page images, delivering holistic supervision necessary for extracting structured representations such as text blocks, tables, formulas, and reading order from heterogeneous, real-world documents.

1. Size, Domains, and Composition

Infinity-Doc-400K comprises approximately 400,066 scanned documents, spanning both real-world and synthetic sources:

Domain Doc Count Notable Features
Magazines ≈180,000 High variability in visual design
Academic Papers ≈71,700 Dense technical layouts, formulas
Financial Reports ≈58,000 Structured tables, multi-column
Synthetic Docs ≈69,000 HTML-rendered, perfectly labeled
Books ≈11,300 Hierarchical headings, narrative text
Medical Reports ≈5,000 Specialized notation, charts
Web Pages ≈5,000 Dynamically generated structures

The synthetic documents are algorithmically generated using content curated from sources like CC3M, general web resources, and Wikipedia, rendered via browser-driven HTML templates. Real-world scans are collected from publicly accessible web sources and automatically annotated at scale.

This compositional diversity ensures a broad array of content types, layout styles, and complexities, creating a rigorous training and benchmarking environment for layout-aware parsing models.

2. Annotation and Holistic Supervision

Every scanned page in Infinity-Doc-400K is paired with a high-quality Markdown ground-truth file. These annotations precisely encode text, region boundaries, hierarchical structures (e.g., distinctions among headers, paragraphs, tables, equations), and explicit reading order. For synthetic branch samples, annotations are derived directly from HTML source, assuring pixel-perfect segment alignment between rendered images and Markdown ground truth.

The real-world segment, though annotated automatically at scale (resulting in some noisy or imperfect pseudo-labels), introduces essential annotation variability encountered in practical document analysis scenarios. This mix of precise synthetic supervision and realistic annotation noise provides a robust generalization training environment—models learn both to extract canonical structural features and to handle imperfections inherent in scanned documents.

3. Role in Layout-Aware Reinforcement Learning

Infinity-Doc-400K underpins the training of Infinity-Parser, a vision-LLM optimized using the LayoutRL framework. LayoutRL operationalizes multi-aspect reward signals, directly supported by the dataset’s structural annotations, for reinforcement learning. Specifically, model outputs are evaluated using:

  • Normalized Edit Distance Reward (RdistR_{\text{dist}}):

Rdist=1D(y,y^)max(N,M)R_{\text{dist}} = 1 - \frac{D(y, \hat{y})}{\max(N, M)}

where D(y,y^)D(y, \hat{y}) is the Levenshtein distance between reference yy and prediction y^\hat{y}, and N,MN, M are their respective lengths.

  • Paragraph Count Reward (RcountR_{\text{count}}):

Rcount=1NYNY^NYR_{\text{count}} = 1 - \frac{|N_Y - N_{\hat{Y}}|}{N_Y}

  • Order Reward (RorderR_{\text{order}}):

Penalizes segment order inversion to preserve sequence integrity.

The final reward is a weighted sum:

Rmulti-aspect=Rdist+Rcount+RorderR_{\text{multi-aspect}} = R_{\text{dist}} + R_{\text{count}} + R_{\text{order}}

Although computational constraints limited RL training to a 43K-document subset, the dataset’s breadth and supervision quality are leveraged through both supervised pretraining and RL finetuning phases. This regime enforces simultaneous fidelity to content, structure, and layout ordering.

4. Addressing Diversity and Annotation Quality Challenges

Document parsing traditionally suffers from scarcity of data that reflects the true structural complexity and layout diversity present in real-world domains. Infinity-Doc-400K strategically mitigates this through dual-channel design: the synthetic branch supplies perfectly aligned, noiseless labels, while the real-world branch introduces large-scale, imperfect annotation and visual noise. This duality fosters models capable of learning robust features transferrable across domain boundaries and tolerant of annotation inconsistencies, leading to superior generalization.

The synthetic segment ensures coverage of “edge-case” layouts and rare structural types that may be underrepresented in naturally occurring datasets, without compromising annotation precision. The real-world segment teaches the model to parse documents exhibiting practical imperfections, such as scanning artifacts or unreliable segment boundaries, which are common in mass digitization efforts.

5. Benchmarking and Model Performance Impact

Infinity-Parser, trained on Infinity-Doc-400K via LayoutRL, achieves state-of-the-art performance across multiple public benchmarks, including OmniDocBench, PubTabNet, FinTabNet, and olmOCR-Bench. Notably, the Infinity-Parser-7B model achieved an OmniDocBench normalized edit distance score of 0.104, outperforming both specialized systems (MinerU, Mathpix) and general-purpose VLMs (GPT-4o, Qwen2-VL).

For table recognition tasks, the model demonstrates competitive or superior results on metrics such as TEDS (structural accuracy) and content edit distance, maintaining high performance in multi-language scenarios and on documents with complex tables. Additional analyses reveal that layout-aware RL produces smoother learning trajectories and enhances page-level structure extraction compared to standard supervised fine-tuning (SFT). This suggests marked improvements in global document understanding.

6. Significance in Document Intelligence

Infinity-Doc-400K represents a qualitative leap in the availability of high-resolution, multimodal annotated corpora for research in document parsing and layout analysis. Its scale, diversity, and exhaustive annotation framework directly contribute to the capabilities of next-generation vision-LLMs, facilitating robust end-to-end extraction of structured information from highly variable scanned sources.

By supporting reinforcement learning paradigms tailored to structural understanding, Infinity-Doc-400K enables advances in segment alignment accuracy, reading order preservation, and hierarchical structure detection. A plausible implication is that such datasets will become central to the development of universal document intelligence architectures capable of automatic conversion of arbitrary scan images into semantically rich digital formats, catalyzing progress in downstream applications such as financial extraction, academic search, and accessible publishing.

7. Release and Future Outlook

Infinity-Doc-400K, along with associated code and models, is intended for open release to foster reproducible research and facilitate advancement in layout-aware document parsing (Wang et al., 17 Oct 2025). The dataset establishes new benchmarks for both the scale and quality of annotation in document understanding, providing a platform for innovation in vision-language modeling, reinforcement learning, and information extraction across heterogeneous domains. Future directions may involve expanding the annotation schema, further increasing domain diversity, and exploring unsupervised or semi-supervised learning leveraging the corpus’s extensive coverage and hierarchical tags.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Infinity-Doc-400K Dataset.