Papers
Topics
Authors
Recent
Search
2000 character limit reached

Formula OCR: Automated Math Recognition

Updated 23 June 2026
  • Formula OCR is an automated process that transforms complex mathematical expressions in images into structured markup languages such as LaTeX and Mathpix-Markdown.
  • State-of-the-art systems employ unified vision–language models with high-capacity encoders and autoregressive decoders to achieve near-human transcription accuracy of intricate formulas.
  • Reinforcement learning with syntactic unit tests and comprehensive benchmarks ensures these models deliver high formula accuracy and compilable LaTeX outputs across diverse document types.

Formula OCR refers to the automated recognition and structured transcription of mathematical expressions from images or scanned documents—transforming visual formula representations into markup languages such as LaTeX or Mathpix-Markdown. This task stands as a core challenge in the digitization of scientific literature due to the structural complexity, high density, and domain-specific notation of mathematical content. Recent advances leveraging large-scale vision–LLMs (VLMs), progressive multi-stage training, and fine-grained data curation have led to systems that approach human-level accuracy in extracting formulas from complex, real-world documents (Wei et al., 2024, Wu et al., 2 Mar 2026, Wang et al., 24 Apr 2026, Zhong et al., 1 Aug 2025).

1. Architectural Foundations in Formula OCR

State-of-the-art formula OCR is dominated by unified VLM-based models that process arbitrary document images and generate formatted markup strings. Models such as GOT, FireRed-OCR, TexOCR, and DocTron-Formula employ encoder–decoder or autoregressive vision–language transformer architectures:

  • Vision Encoders: Modern approaches utilize high-capacity backbone architectures (e.g., VitDet, ViT, ResNet; parameter counts from 80M to 7B) to extract dense two-dimensional spatial features from high-resolution input images (typically 1024×1024 pixels) (Wei et al., 2024, Wu et al., 2 Mar 2026, Zhong et al., 1 Aug 2025). These encoders compactly summarize the spatial details necessary to resolve nested formula structure, such as fractions, superscripts, integrals, and matrices.
  • Language Decoders: Long-context decoders (Qwen-0.5B, Qwen3-VL, or autoregressive LLMs) support output sequences exceeding thousands of tokens, a necessity for page-level transcriptions containing multi-line, nested mathematics (Wei et al., 2024, Wu et al., 2 Mar 2026, Wang et al., 24 Apr 2026).
  • Bridging Vision and Language: A linear projection or connector layer (e.g., 1024×1024) mediates between visual embeddings and the language decoder to ensure matched dimensionality (Wei et al., 2024).
  • Spatial Reasoning: Positional encoding (2D-RoPE) and windowed attention mechanisms capture fine-grained layout structure, crucial for distinguishing symbols that differ only by position, such as subscript/superscript, or embedded constructs within dense equations (Zhong et al., 1 Aug 2025).

No dedicated tree-structured decoder or grammar module is required, as transformer models learn structural regularities through large-scale autoregressive fine-tuning on markup tokens (Zhong et al., 1 Aug 2025).

2. Data Preparation and Representation

High-precision formula OCR requires diverse, structurally rich data that reflects the full document distribution:

  • Geometry + Semantics Data Factories: To address the rarity and diversity of formula layouts, data curation pipelines cluster document images by visual layout (single-column, formula-heavy, dense tables) and semantic tags (language, genre, scanned/born-digital) (Wu et al., 2 Mar 2026). Stratified sampling ensures rare or complex formula types are properly weighted.
  • Annotation Normalization: Formula ground truths are systematically re-annotated into a unified Markdown+LaTeX style (inline as “......”, display as “”) regardless of source representation (e.g., MathML, HTML tokens) (Wu et al., 2 Mar 2026).
  • Synthetic Data Generation: Render-based synthesis pipelines sample complex expressions (from external collections like latex-formulas-80M), nesting fractions, limits, and summations to produce ground truth–aligned images and formulas (Wu et al., 2 Mar 2026).
  • CSFormula Dataset: Over 5.8 million StackExchange pages were crawled and filtered to create CSFormula, providing paired images and LaTeX across line, paragraph, and page levels for fine-tuning and robust evaluation (Zhong et al., 1 Aug 2025).

3. Training Objectives and Optimization

Formula OCR models are primarily trained via standard autoregressive cross-entropy losses over markup tokens, augmented by reinforcement learning (RL) approaches to directly enforce structural and syntactic validity:

  • Autoregressive Decoding: Training objective is

L=t=1TlogP(yty<t,H0)\mathcal{L} = -\sum_{t=1}^T \log P(y_t | y_{<t}, H_0)

where H0H_0 is the encoder output and yty_t is the token at position tt (Wei et al., 2024, Zhong et al., 1 Aug 2025).

  • Supervised Fine-Tuning (SFT): All models perform large-scale SFT on image–markup pairs, including millions of formulas in diverse environments and languages (Wei et al., 2024, Zhong et al., 1 Aug 2025).
  • Reinforcement Learning with Verifiable Rewards: RL stages use group-based policy optimization. For each decoded output, binary “unit tests” (LaTeX compilation, delimiter matching, structural checks) are applied; rewards are assigned for passing tests such as:
  • Format-Constrained Policy Optimization: FireRed-OCR introduces “Format-Constrained GRPO,” which computes composite rewards from LaTeX compilation, closure, and table structure checks during RL (Wu et al., 2 Mar 2026).
  • No Explicit Auxiliary Losses: The leading frameworks avoid specialized auxiliary objectives (e.g., bounding box loss, explicit grammar trees); general vision–language training suffices with appropriate data curation (Wei et al., 2024, Zhong et al., 1 Aug 2025).

4. Input Modalities, Region Prompts, and Output Formatting

  • Preprocessing: Images are normalized to fixed resolution, and, for ultra-high-res documents, windowed tiling and merging are applied (Wei et al., 2024).
  • Region-Guided OCR: Interactive or fine-grained region recognition is enabled through:
    • Coordinate-based prompts ([x1,y1,x2,y2]) for bounding-box extraction
    • Color-based cues (e.g., drawing a red/green/blue frame) to guide attention (Wei et al., 2024)
  • Instruction Prompts: Models accept short instruction sequences indicating required outputs (e.g., “Recognize the formula. Output in Mathpix-Markdown.”) (Wei et al., 2024, Zhong et al., 1 Aug 2025).
  • Output Formats: Systems generate formulas in Mathpix-Markdown, canonical LaTeX (with strict delimiter distinction), or other domain-specific formats. Display and inline formulas are handled distinctly depending on document context (Wei et al., 2024, Wu et al., 2 Mar 2026).

Representative output for an integral image:

`%%%%5%%%%`

(Wei et al., 2024)

5. Experimental Benchmarks and Quantitative Results

Performance of Formula OCR systems is evaluated with structural and character-level metrics, as well as document-level compilation success:

Model Formula Accuracy (FA) FormulaCDM (%) Compilation Success (%) Edit Distance (ED)
TexOCR (SFT+RLVR) 85.9 High
FireRed-OCR-2B 91.71
DocTron-Formula (7B) 87.3 (CSFormula) 0.164 (avg)
GOT (multi-crop, 1K) 0.159
  • Dynamic Resolution: Sliding window or multi-crop input strategies yield +11.6 F1 improvement for formulas by recovering small/high-density regions (Wei et al., 2024).
  • RL Gains: Reinforcement learning stages show substantial improvements over SFT alone (+11 points FA in TexOCR), particularly in structural and syntax-sensitive scenarios (e.g., page-to-LaTeX compilation) (Wang et al., 24 Apr 2026).
  • Benchmark Leadership: On OmniDocBench and TexOCR-Bench, FireRed-OCR and TexOCR achieve state-of-the-art results, with FireRed-OCR-2B exceeding previous E2E and pipeline systems (FormulaCDM 91.71%) (Wu et al., 2 Mar 2026, Wang et al., 24 Apr 2026). DocTron-Formula achieves lowest normalized Edit Distance (ED=0.164) and highest CDM on CSFormula (Zhong et al., 1 Aug 2025).
  • Structural Robustness: RL with unit tests addresses hard problems such as sub/superscript misplacement, delimiter mismatches, and operator omission (Wang et al., 24 Apr 2026).

6. Error Profiles, Open Challenges, and Future Directions

Error analysis across these systems identifies several dominant challenges:

  • Dense Layout Ambiguity: Overlapping or closely-packed symbols at page level can cause index misassociation, especially under heavy formulas or matrices (Zhong et al., 1 Aug 2025).
  • Delimiter Errors: Unmatched or stray delimiters (“,”, “”) can break semantic parsing or compilation. RL-based unit tests penalize such cases (Wang et al., 24 Apr 2026).
  • Operator Omissions: Missing or corrupted operators (e.g., “\cdot”, “\times”) lead to semantic errors; sequence-level alignment rewards help mitigate these failures (Wang et al., 24 Apr 2026).
  • Notation Diversity: Rare or domain-specific notation, such as Dirac delta or chemical reaction arrows, remains a source of occasional misrecognition. Exposure to multidisciplinary data reduces, but does not eliminate, these errors (Zhong et al., 1 Aug 2025).
  • Compilability and Usability: Traditional OCR methods often fail to recover structurally correct, compilable LaTeX; models explicitly optimized for unit-test–driven RLVR show the best results in end-to-end usability (Wang et al., 24 Apr 2026).

Anticipated directions include lightweight grammar-checker integration, joint modeling of tables/figures with formulas, and further data augmentation with low-resource styles (e.g., handwritten, scanned archives).

7. Implications and Applications

Formula OCR has immediate utility in automating scientific knowledge extraction, digitizing legacy collections, enabling interactive mathematical interfaces, and populating semantic research repositories (Zhong et al., 1 Aug 2025). Recent evidence shows that sufficiently large and diverse VLMs, fine-tuned with structurally challenging data and RL-style syntax enforcement, can match or exceed the performance of prior specialized pipelines and hand-crafted systems (Zhong et al., 1 Aug 2025, Wang et al., 24 Apr 2026).

Formula OCR, as an integrated subdomain of OCR-2.0 frameworks, now encompasses recognition across mixed document types (text, tables, geometry, music, chemistry), delivering not only transcription accuracy but also executable, structurally faithful scientific markup (Wei et al., 2024, Wang et al., 24 Apr 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Formula OCR.