DocTron-Formula: Unified OCR Framework

Updated 5 August 2025

The paper presents a unified framework that leverages large-scale vision-language transformers to achieve state-of-the-art OCR on complex scientific formulas.
It introduces the CSFormula dataset, a diverse collection of formula examples across multiple scientific disciplines and document layouts.
The method simplifies recognition by eliminating task-specific engineering, using an end-to-end transformer for accurate, context-sensitive LaTeX transcription.

DocTron-Formula is a unified framework for mathematical formula recognition and transcription in complex and structurally diverse scientific documents. Built upon general vision-LLMs, DocTron-Formula departs from traditional task-specific approaches by leveraging large-scale multimodal pre-trained architectures, fine-tuned on an extensive, structurally challenging dataset ("CSFormula"). The system achieves state-of-the-art results on formula OCR and transcends prior domain-specific limitations, demonstrating robust generalization to a wide variety of scientific domains, layouts, and document granularities (Zhong et al., 1 Aug 2025).

1. Architecture and Model Design

DocTron-Formula utilizes a vision-language transformer backbone with the following pipeline:

Input Preparation: Document images are resized so that both height and width are multiples of 28, facilitating efficient patching and downstream transformer processing.
Vision Encoder: The model applies a Vision Transformer (ViT) backbone with 2D rotary positional encoding (2D-RoPE) and local window attention. These mechanisms capture spatial relations among image patches while maintaining efficiency at high resolution.
Token Grouping and Refinement: Visual tokens extracted from densely overlapping patches are grouped spatially and refined by a two-layer multilayer perceptron (MLP), reducing redundancy and enabling global scene understanding.
Cross-modal Token Concatenation: The sequence of grouped vision tokens is concatenated with embedded instruction (text) tokens to form a unified multimodal stream:

$x_0 = [v_1, v_2, ..., v_P, t_1, t_2, ..., t_N]$

Transformer Fusion: The joint token stream is processed through $L$ layers of transformer blocks, each with multi-head self-attention (MSA), layer normalization (RMSNorm), and feed-forward network (FFN):

$x_l' = \text{MSA}(\text{RMSNorm}(x_{l-1})) + x_{l-1}$

$x_l = \text{FFN}(\text{RMSNorm}(x_l')) + x_l'$

Autoregressive Generation: Formula transcription is treated as a conditional language modeling task, producing LaTeX code:

$\mathcal{L}_\text{main}(\theta) = -\sum_{i=1}^K \log p(y_i \mid V, T, Y_{<i}; \theta)$

where $y_i$ are LaTeX tokens, $V$ the visual tokens, $T$ the instruction tokens, and $Y_{<i}$ the previously generated outputs.

2. The CSFormula Dataset

A crucial advancement underpinning DocTron-Formula is the CSFormula dataset, which addresses major shortcomings of extant formula recognition corpora:

Structural Breadth: CSFormula includes formulas not only at line level but also embedded within paragraphs and over full document pages. This exposes the model to multi-line formulas, natural paragraph context, and real-world layout variability.
Scale and Diversity: The dataset is sourced from over 5.8 million pages, post-processed for deduplication, and covers disciplines such as mathematics, physics, and chemistry. Representative dataset sizes are 741,016 line-level, 135,575 paragraph-level, and 131,876 page-level training examples (each with 1,000 held-out test samples).
Symbolic and Layout Variance: The dataset captures deep nesting, superscripting, microsymbols, varied domain-specific notation, and realistic document noise—from scanned images to mixed graphical content.

This diversity enables the model to generalize across scientific domains and practical document digitization scenarios.

3. Methodological Distinctions and Model Training

No Task-Specific Architectural Engineering: Unlike prior systems using structural decoders, explicit tree-based parsing heads, or formula-specific visual modules, DocTron-Formula strictly fine-tunes general VLMs (such as Qwen2.5-VL) with standard supervised learning on the CSFormula data.
Window Attention and Grouping: Attention is performed over local windows to maintain scalability for large images. Token grouping before fusion reduces sequence length and redundancy, aiding efficient full-page and paragraph-level processing.
Loss and Optimization: The system's generative loss is an autoregressive negative log-likelihood over LaTeX token sequences, conditioned on both image and textual context, allowing it to learn formula structure and context-sensitive symbol resolution.

4. Evaluation Metrics and Empirical Results

Performance is assessed using two complementary metrics:

Metric	Description	DocTron-Formula (CSFormula)
Edit Distance (ED)	Minimum character-level operations (insert/delete/substitute) to reach ground-truth LaTeX	0.164 (mean, lower is better)
Character Detection Matching (CDM)	Visual matching score (rendered images); measures symbolic/structural fidelity	0.873 (higher is better)

On CSFormula, DocTron-Formula attains superior results relative to both task-specific models (e.g., UniMERNet) and generic vision-LLM baselines (e.g., GPT-4o, Gemini-2.5-flash, Qwen2.5-VL). Its low ED reflects accurate LaTeX structure reproduction, while its high CDM demonstrates robust handling of visual and layout variances.

5. Applications and Scientific Impact

DocTron-Formula advances the automated understanding and digital processing of scientific documents:

Scientific Literature OCR: Efficient, accurate transcription of complex formulas from diverse document formats enables scalable mathematical digitization.
Semantic Indexing and Search: Accurately recognized LaTeX enables downstream tasks such as formula search, semantic retrieval, and citation-aware information extraction across STEM disciplines.
Automated Editing and Knowledge Management: High-fidelity recognition supports reusability of existing scientific content and automated knowledge base construction.
Unified Document Analysis: Fine-tuning generic VLMs sidesteps brittle domain-specific engineering, enabling robust formula understanding across layout granularities (line, paragraph, page) and scientific domains without re-architecting.

A plausible implication is that this approach, based on general vision-LLM fine-tuning and large-scale heterogeneous training data, establishes a template for future efforts in mathematical and scientific document analysis, potentially replacing ad hoc formula-specific architectures for most practical scenarios.

6. Technical Contributions and Broader Paradigm Shift

DocTron-Formula's technical contributions lie in its evidence that:

Large, general-purpose VLMs can be straightforwardly adapted—via supervised finetuning and careful data curation—to outperform models purpose-built for formula recognition.
Rich, structurally challenging datasets such as CSFormula are critical enablers for domain generalization and robustness.
The end-to-end, transformer-based recognition pipeline requires no handcrafted feature engineering, showing extensibility to broader forms of scientific structure beyond formulas.

This suggests a paradigm shift from specialized OCR/structural models toward flexible, foundation-model-based recognition with tight integration of visual and textual cues in scientific document analysis.

PDF Markdown Chat (Pro)

References (1)

DocTron-Formula: Generalized Formula Recognition in Complex and Structured Scenarios (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to DocTron-Formula.