DocTron-Formula Framework
- DocTron-Formula is a unified OCR framework designed to recognize complex mathematical formulas in diverse scientific documents.
- It leverages a multimodal transformer architecture with integrated vision-language tokens to convert document images into accurate LaTeX representations.
- Benchmark results on the CSFormula dataset show lower edit distances and higher character detection matching compared to specialized OCR systems.
Optical Character Recognition (OCR) for mathematical formula recognition remains a foundational challenge in scientific document analysis. The DocTron-Formula framework represents a unified, end-to-end approach to formula recognition that leverages general vision-LLMs (VLMs) without architectural specialization. It combines a scalable fine-tuning regimen with the newly introduced CSFormula dataset to achieve state-of-the-art (SOTA) performance in handling the structural diversity and complexity inherent to real-world scientific documents (Zhong et al., 1 Aug 2025).
1. Core Architecture and Design Principles
DocTron-Formula is instantiated atop Qwen2.5-VL, a multimodal transformer model combining a Vision Transformer (ViT) with a LLM. The vision encoder module resizes document images to dimensions (multiples of 28), partitions them into patches, and applies 2D Rotary Positional Encoding (2D-RoPE) alongside “window” self-attention. The resultant tokens are grouped in adjacent neighborhoods and processed through a two-layer multilayer perceptron to yield tokens of dimension , matching the LLM hidden size.
DocTron-Formula eschews task-specific modules such as detection heads or grammar decoders, relying instead on the LLM’s generative capacity. The Transformer stack receives as input the concatenated vision tokens and textual instruction tokens , fully integrated through cross-modal self-attention at every layer:
where MSA denotes multi-head self-attention and FFN a position-wise feed-forward network.
The training objective is the autoregressive negative log-likelihood of the target LaTeX token sequence:
No explicit localization or detection loss is incorporated (0, 1 in the generalized objective 2).
2. Construction and Structure of the CSFormula Dataset
CSFormula is a large-scale dataset curated from approximately 5.8 million StackExchange pages, explicitly designed to capture diverse structural and notational formulaic content. The dataset supports recognition at three granularity levels: line, paragraph, and page.
| Level | Train set | Test set | Example content |
|---|---|---|---|
| Line | 741,016 | 1,000 | 3 |
| Paragraph | 135,575 | 1,000 | Inline/interleaved formulae |
| Page | 131,876 | 1,000 | Multi-line, figures, tables |
The dataset spans mathematics, physics, chemistry, engineering, and applied sciences. Emphasis is placed on deep parse trees (nested radicals, fractions), diverse symbol sets (chemical/reaction notation, statistical symbols), multiline alignments, and the interplay of formulae within paragraph and page layouts. Representative samples illustrate both simple expressions and complex page-level scientific document segments.
3. Training Pipeline and Optimization
DocTron-Formula is initialized from Qwen2.5-VL (7B parameters) and trained using supervised fine-tuning (SFT) across combined line-, paragraph-, and page-level samples. The training prompt is consistently: “Please convert the following image of a scientific document to LaTeX.” No data augmentation is used except for standard resizing; cropping and rotation have no measurable benefit.
Key hyperparameters include the AdamW optimizer with weight decay (0.1), a peak learning rate of 4 (linear warmup and decay), batch size of 32 images per GPU, and five epochs over the training set. The loss objective remains purely the generative negative log-likelihood (5); no detection or segmentation modules are trained.
Ablation experiments reveal that training with only line-level data impairs performance on structured or page layouts. Full joint training across levels enables best generalization: line ED=0.121, paragraph ED=0.123, and page ED=0.272.
4. Quantitative Results and Comparative Performance
Evaluation utilizes Edit Distance (ED, lower is better) and Character Detection Matching (CDM, higher is better). ED reflects the token operations required to match predicted LaTeX to reference; CDM compares rendered images at the character level.
| Model / Dataset | Im2LaTeX-160k | UniMER Avg. | CSFormula Avg. |
|---|---|---|---|
| UniMERNet (SOTA) | 0.240 | 0.103 | 0.679 |
| Qwen2.5-VL | 0.310 | 0.303 | 0.472 |
| GPT-4o | 0.434 | 0.545 | 0.402 |
| Gemini-2.5-flash | 0.424 | 0.531 | 0.394 |
| Mathpix | 0.449 | 0.516 | 0.457 |
| DocTron-Formula | 0.245 | 0.098 | 0.164 |
| Model / Dataset | Im2LaTeX-160k | UniMER Avg. | CSFormula Avg. |
|---|---|---|---|
| UniMERNet | 0.991 | 0.965 | 0.524 |
| Qwen2.5-VL | 0.971 | 0.911 | 0.622 |
| GPT-4o | 0.929 | 0.783 | 0.536 |
| Gemini-2.5-flash | 0.973 | 0.882 | 0.732 |
| Mathpix | 0.969 | 0.949 | 0.733 |
| DocTron-Formula | 0.985 | 0.961 | 0.873 |
DocTron-Formula achieves lower ED and higher CDM than all baselines, including specialized systems (UniMERNet), commercial tools (Mathpix), and leading open/closed VLMs (GPT-4o, Gemini-2.5-flash) (Zhong et al., 1 Aug 2025). This pattern holds across Im2LaTeX-160k, UniMER (various print quality/handwriting), and CSFormula splits.
5. Robustness Across Noise, Layout, and Notation
DocTron-Formula’s performance generalizes robustly across noisy and heterogeneous input. For noise resilience, the Screen-Captured (SCE) UniMER subset shows DocTron-Formula at ED=0.182, compared to UniMERNet at 0.224, indicating superior resistance to compression artifacts and illumination variance.
On complex page layouts (CSFormula, page-level), DocTron-Formula attains ED=0.251 and CDM=0.774. This demonstrates superior parsing of documents containing tables, interleaved text, multiple math environments, and no intermediate segmentation.
Training on a multidisciplinary corpus (mathematics, physics, chemistry) facilitates strong zero-shot handling of rare or domain-specific symbols, such as “⧧” or “⇌”, and highly specialized notation from particle physics.
6. Qualitative Assessment and Model Ablations
Qualitative analysis confirms recovery of deep, multi-line and tensor expressions, e.g.,
6
DocTron-Formula produces coherent LaTeX even for full-page documents, preserving section structure, aligned environments, and floats.
Ablations show that DocTron-Formula trained with joint line/paragraph/page data yields the best all-level accuracy. Increasing model size to 7B parameters is beneficial, though even 3B parameter versions surpass the 7B Qwen2.5-VL baseline, highlighting the effect of domain data adaptation.
7. Contextual Impact and Paradigm Shifts
DocTron-Formula establishes a new paradigm in formula recognition by demonstrating that general VLMs, when fine-tuned on a large, structurally varied dataset, can match or exceed specialized OCR systems in both accuracy and robustness. It dispenses with the need for engineered encoders, decoders, or post-processing logic, relying solely on end-to-end generative architectures and cross-modal self-attention. CSFormula provides the first extensive, public benchmark spanning line, paragraph, and page layout complexities for formula OCR and general document analysis. These contributions redefine the methodological landscape for automated scientific document understanding (Zhong et al., 1 Aug 2025).