DocTron-Formula Framework

Updated 23 June 2026

DocTron-Formula is a unified OCR framework designed to recognize complex mathematical formulas in diverse scientific documents.
It leverages a multimodal transformer architecture with integrated vision-language tokens to convert document images into accurate LaTeX representations.
Benchmark results on the CSFormula dataset show lower edit distances and higher character detection matching compared to specialized OCR systems.

Optical Character Recognition (OCR) for mathematical formula recognition remains a foundational challenge in scientific document analysis. The DocTron-Formula framework represents a unified, end-to-end approach to formula recognition that leverages general vision-LLMs (VLMs) without architectural specialization. It combines a scalable fine-tuning regimen with the newly introduced CSFormula dataset to achieve state-of-the-art (SOTA) performance in handling the structural diversity and complexity inherent to real-world scientific documents (Zhong et al., 1 Aug 2025).

1. Core Architecture and Design Principles

DocTron-Formula is instantiated atop Qwen2.5-VL, a multimodal transformer model combining a Vision Transformer (ViT) with a LLM. The vision encoder module resizes document images to dimensions $H' \times W'$ (multiples of 28), partitions them into $14 \times 14$ patches, and applies 2D Rotary Positional Encoding (2D-RoPE) alongside “window” self-attention. The resultant $P_0$ tokens are grouped in adjacent $2 \times 2$ neighborhoods and processed through a two-layer multilayer perceptron to yield $P$ tokens of dimension $D$ , matching the LLM hidden size.

DocTron-Formula eschews task-specific modules such as detection heads or grammar decoders, relying instead on the LLM’s generative capacity. The Transformer stack receives as input the concatenated vision tokens $(v_1, \ldots, v_P)$ and textual instruction tokens $(t_1, \ldots, t_N)$ , fully integrated through cross-modal self-attention at every layer:

$\begin{aligned} x_0 &= [v_1, \ldots, v_P, t_1, \ldots, t_N] \ x'_l &= \mathrm{MSA}(\mathrm{RMSNorm}(x_{l-1})) + x_{l-1} \ x_l &= \mathrm{FFN}(\mathrm{RMSNorm}(x'_l)) + x'_l \end{aligned}$

where MSA denotes multi-head self-attention and FFN a position-wise feed-forward network.

The training objective is the autoregressive negative log-likelihood of the target LaTeX token sequence:

$L_{\mathrm{rec}} = -\sum_{i=1}^{K} \log p(y_i\,|\,V, T, y_{<i})$

No explicit localization or detection loss is incorporated ( $14 \times 14$ 0, $14 \times 14$ 1 in the generalized objective $14 \times 14$ 2).

2. Construction and Structure of the CSFormula Dataset

CSFormula is a large-scale dataset curated from approximately 5.8 million StackExchange pages, explicitly designed to capture diverse structural and notational formulaic content. The dataset supports recognition at three granularity levels: line, paragraph, and page.

Level	Train set	Test set	Example content
Line	741,016	1,000	$14 \times 14$ 3
Paragraph	135,575	1,000	Inline/interleaved formulae
Page	131,876	1,000	Multi-line, figures, tables

The dataset spans mathematics, physics, chemistry, engineering, and applied sciences. Emphasis is placed on deep parse trees (nested radicals, fractions), diverse symbol sets (chemical/reaction notation, statistical symbols), multiline alignments, and the interplay of formulae within paragraph and page layouts. Representative samples illustrate both simple expressions and complex page-level scientific document segments.

3. Training Pipeline and Optimization

DocTron-Formula is initialized from Qwen2.5-VL (7B parameters) and trained using supervised fine-tuning (SFT) across combined line-, paragraph-, and page-level samples. The training prompt is consistently: “Please convert the following image of a scientific document to LaTeX.” No data augmentation is used except for standard resizing; cropping and rotation have no measurable benefit.

Key hyperparameters include the AdamW optimizer with weight decay (0.1), a peak learning rate of $14 \times 14$ 4 (linear warmup and decay), batch size of 32 images per GPU, and five epochs over the training set. The loss objective remains purely the generative negative log-likelihood ( $14 \times 14$ 5); no detection or segmentation modules are trained.

Ablation experiments reveal that training with only line-level data impairs performance on structured or page layouts. Full joint training across levels enables best generalization: line ED=0.121, paragraph ED=0.123, and page ED=0.272.

4. Quantitative Results and Comparative Performance

Evaluation utilizes Edit Distance (ED, lower is better) and Character Detection Matching (CDM, higher is better). ED reflects the token operations required to match predicted LaTeX to reference; CDM compares rendered images at the character level.

Model / Dataset	Im2LaTeX-160k	UniMER Avg.	CSFormula Avg.
UniMERNet (SOTA)	0.240	0.103	0.679
Qwen2.5-VL	0.310	0.303	0.472
GPT-4o	0.434	0.545	0.402
Gemini-2.5-flash	0.424	0.531	0.394
Mathpix	0.449	0.516	0.457
DocTron-Formula	0.245	0.098	0.164

Model / Dataset	Im2LaTeX-160k	UniMER Avg.	CSFormula Avg.
UniMERNet	0.991	0.965	0.524
Qwen2.5-VL	0.971	0.911	0.622
GPT-4o	0.929	0.783	0.536
Gemini-2.5-flash	0.973	0.882	0.732
Mathpix	0.969	0.949	0.733
DocTron-Formula	0.985	0.961	0.873

DocTron-Formula achieves lower ED and higher CDM than all baselines, including specialized systems (UniMERNet), commercial tools (Mathpix), and leading open/closed VLMs (GPT-4o, Gemini-2.5-flash) (Zhong et al., 1 Aug 2025). This pattern holds across Im2LaTeX-160k, UniMER (various print quality/handwriting), and CSFormula splits.

5. Robustness Across Noise, Layout, and Notation

DocTron-Formula’s performance generalizes robustly across noisy and heterogeneous input. For noise resilience, the Screen-Captured (SCE) UniMER subset shows DocTron-Formula at ED=0.182, compared to UniMERNet at 0.224, indicating superior resistance to compression artifacts and illumination variance.

On complex page layouts (CSFormula, page-level), DocTron-Formula attains ED=0.251 and CDM=0.774. This demonstrates superior parsing of documents containing tables, interleaved text, multiple math environments, and no intermediate segmentation.

Training on a multidisciplinary corpus (mathematics, physics, chemistry) facilitates strong zero-shot handling of rare or domain-specific symbols, such as “⧧” or “⇌”, and highly specialized notation from particle physics.

6. Qualitative Assessment and Model Ablations

Qualitative analysis confirms recovery of deep, multi-line and tensor expressions, e.g.,

$14 \times 14$ 6

DocTron-Formula produces coherent LaTeX even for full-page documents, preserving section structure, aligned environments, and floats.

Ablations show that DocTron-Formula trained with joint line/paragraph/page data yields the best all-level accuracy. Increasing model size to 7B parameters is beneficial, though even 3B parameter versions surpass the 7B Qwen2.5-VL baseline, highlighting the effect of domain data adaptation.

7. Contextual Impact and Paradigm Shifts

DocTron-Formula establishes a new paradigm in formula recognition by demonstrating that general VLMs, when fine-tuned on a large, structurally varied dataset, can match or exceed specialized OCR systems in both accuracy and robustness. It dispenses with the need for engineered encoders, decoders, or post-processing logic, relying solely on end-to-end generative architectures and cross-modal self-attention. CSFormula provides the first extensive, public benchmark spanning line, paragraph, and page layout complexities for formula OCR and general document analysis. These contributions redefine the methodological landscape for automated scientific document understanding (Zhong et al., 1 Aug 2025).

Markdown Report Issue Upgrade to Chat

References (1)

DocTron-Formula: Generalized Formula Recognition in Complex and Structured Scenarios (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DocTron-Formula Framework.

DocTron-Formula Framework

1. Core Architecture and Design Principles

2. Construction and Structure of the CSFormula Dataset

3. Training Pipeline and Optimization

4. Quantitative Results and Comparative Performance

5. Robustness Across Noise, Layout, and Notation

6. Qualitative Assessment and Model Ablations

7. Contextual Impact and Paradigm Shifts

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

DocTron-Formula Framework

1. Core Architecture and Design Principles

2. Construction and Structure of the CSFormula Dataset

3. Training Pipeline and Optimization

4. Quantitative Results and Comparative Performance

5. Robustness Across Noise, Layout, and Notation

6. Qualitative Assessment and Model Ablations

7. Contextual Impact and Paradigm Shifts

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research