Vision-to-LaTeX Systems Overview

Updated 23 June 2026

Vision-to-LaTeX systems are neural architectures that transform visual inputs (e.g., formulas, tables) into syntactically correct and executable LaTeX markup.
They employ encoder–decoder frameworks with CNNs, transformers, and attention mechanisms to capture spatial relationships and complex document layouts.
Key challenges include ensuring cross-page consistency, managing deep nesting errors, and optimizing both visual fidelity and LaTeX compilability.

Vision-to-LaTeX systems are neural models that transduce images containing mathematical formulas, scientific tables, or full-page document layouts into syntactically correct LaTeX markup. These models address the complex problem of mapping two-dimensional visual structures—with fine-grained spatial relationships and diverse symbol sets—into linear sequences of LaTeX tokens, preserving both meaning and renderability. Vision-to-LaTeX has become central to robust mathematical OCR, automated scientific document reconstruction, and data extraction from academic literature, with models now evaluated on metrics including BLEU, structure-aware edit distance, visual similarity, and executable compilability (Singh, 2018, Kayal et al., 2021, Kayal et al., 2022, Gurgurov et al., 2024, Sundararaj et al., 2024, Li et al., 28 Jul 2025, Ling et al., 22 Sep 2025, Wang et al., 24 Apr 2026, Deng et al., 2016).

1. Architectural Foundations

Vision-to-LaTeX architectures are predominantly encoder-decoder models tasked with mapping an input image $x \in \mathbb{R}^{H \times W \times C}$ to a target LaTeX token sequence $y=(y_1, ..., y_\tau)$ such that $y$ is syntactically executable and semantically equivalent to the visual content.

Encoders. Early works adopted CNN stacks (e.g., 5-layer CNN with $2\times2$ max-pooling and tanh activations), flattening spatial feature maps for downstream sequence modeling (Singh, 2018). More recent systems leverage deep vision backbones: ResNet-101 truncations (Kayal et al., 2022), ViT-style patch embeddings (Ling et al., 22 Sep 2025), and hierarchical Swin Transformers (Gurgurov et al., 2024). Vision Transformers and non-local attention blocks have demonstrated superior ability to capture long-range dependencies in images of dense tabular or mathematical objects (Sundararaj et al., 2024, Kayal et al., 2021).

Decoders. The decoder is typically a stack of LSTM layers (Singh, 2018), transformer decoders (Kayal et al., 2022, Gurgurov et al., 2024), or large-scale causal LLMs (e.g., GPT-2, Qwen3-VL-2B) (Wang et al., 24 Apr 2026). For mathematical or tabular data, the decoder autoregressively emits tokens from a rich LaTeX vocabulary under supervision of an attention mechanism, conditioned both on previous tokens and a visual context.

Attention. Attention modules range from classical MLP-based soft attention (Singh, 2018) and Show-Attend-Tell variants, to hierarchical and coarse-to-fine approaches that restrict the attention focus for computational efficiency (Deng et al., 2016). SOTA models extract cross-attention maps not only for guiding generation but also for interpretability and, in some frameworks, for bootstrapping iterative refinement (Li et al., 28 Jul 2025).

Fusion and Refinement. Advanced systems integrate attention-guided feedback and iterative refinement, rendering predicted LaTeX and comparing against the original image to identify and rectify errors in subsequent inference rounds (Li et al., 28 Jul 2025). In high-fidelity Table2LaTeX conversion, multimodal transformers with dual-reward RL explicitly optimize both code structure and rendered visual similarity metrics (Ling et al., 22 Sep 2025).

2. Datasets and Supervised Objectives

Vision-to-LaTeX systems rely on large-scale paired datasets: formula images mapped to LaTeX (Im2LaTeX-100K (Singh, 2018, Deng et al., 2016)), table images with structure and content markup (ICDAR-TSRD/TCRD (Kayal et al., 2021), Tab-To-Tex (Kayal et al., 2022)), and full-page arXiv PDFs with automatically aligned LaTeX source (TexOCR-Train (Wang et al., 24 Apr 2026)). Recent benchmarks introduce “hard” curated subsets for rigorous stress-testing, such as Img2LaTeX-Hard-1K (Li et al., 28 Jul 2025).

Tokenization and Normalization. The output sequence is typically tokenized into command-level, macro-level, or character-level units, preserving LaTeX’s syntactic and semantic structure. Canonical preprocessing involves normalization (AST-based, elimination of rare macros), padding for fixed input size, and adding special tokens for line breaks and delimiters (Singh, 2018, Kayal et al., 2022).

Training Objectives. The standard training criterion is cross-entropy over sequence likelihoods: $\mathcal{L}_{\mathrm{CE}} = -\frac{1}{\tau} \sum_{t=1}^\tau \log P_r(y_t|y_{<t}, a)$ where $a$ is the embedded visual context. Advanced variants introduce additional regularization (L2 decay, attention regularization), and, in RL-augmented systems, policy gradients with document-level rewards for structure, visual similarity, and compilability (Wang et al., 24 Apr 2026, Ling et al., 22 Sep 2025).

3. Task Decomposition and Specialized Pipelines

For challenging domains such as tables and full-page layouts, the vision-to-LaTeX problem is decomposed into sub-tasks to improve tractability.

Table Structure vs. Content Decoding. ICDAR-2021 and related works split tabular recognition into (1) Table Structure Reconstruction (TSR)—predicting the flat/environment structure, column/row alignments, cell delimiters—and (2) Table Content Reconstruction (TCR)—emitting cell-level content tokens, including math symbols and LaTeX macros (Kayal et al., 2021, Kayal et al., 2022). This mirrors the intrinsic grammar of the LaTeX tabular environment; joint or staged decoding offers flexibility but can be brittle to upstream errors.

Page-Level and Document-Level Reconstruction. TexOCR addresses page-to-LaTeX mapping with additional parsing for sections, floats, and bibliography blocks, aligning visual regions with their source LaTeX and distinguishing compilation subregions (Wang et al., 24 Apr 2026).

Complex Structures. Models have been extended to multi-line, deeply nested, or structurally ambiguous layouts (multirow/multicolumn, nested tables, equation arrays), with explicit reward shaping or architectural mechanisms (e.g., gated attention for dynamic cross-modal weighting, parameter-efficient fine-tuning for handwriting adaptation) (Gurgurov et al., 2024, Ling et al., 22 Sep 2025, Kayal et al., 2022).

4. Evaluation Metrics and SOTA Performance

Standard evaluation tracks both syntactic fidelity and semantic/visual correctness of LaTeX predictions.

Textual Metrics. Corpus-level BLEU-4, masked token accuracy, normalized Levenshtein distance, and exact match accuracy on token sequences are ubiquitous (Singh, 2018, Kayal et al., 2021, Kayal et al., 2022, Sundararaj et al., 2024).

Visual/Structural Metrics. Structural correctness is gauged via TEDS-Structure (tree-edit-distance on LaTeX ASTs), row/column accuracy, label matching, and recall on multirow/multicolumn tokens (Ling et al., 22 Sep 2025, Kayal et al., 2021). Visual fidelity leverages CW-SSIM (complex-wavelet SSIM) and match rates assessed by re-rendering the predicted LaTeX and computing pixel-wise or region-level similarities (Ling et al., 22 Sep 2025, Li et al., 28 Jul 2025). Compilation success—fraction of generated outputs that compile with no errors—is critical for practical usability (Wang et al., 24 Apr 2026).

Page/Document-Level Metrics. TexOCR-Bench introduces metrics for complex text preservation, formula and table accuracy, section/citation/reference alignment, document-level character similarity, and executable compilability (Wang et al., 24 Apr 2026).

Model/Framework	Key Metric/Score	Reference
Visual Attention CNN-RNN	BLEU = 89.0% (Im2LaTeX-140K)	(Singh, 2018)
MASTER + Transformer (VCGroup)	Structure EM = 74%, Content EM = 55% (ICDAR-2021)	(Kayal et al., 2021)
FGRT Gated Transformer	Structure EM = 70.35%, Content EM = 49.69% (Tab-To-Tex)	(Kayal et al., 2022)
Swin+GPT2+LoRA (handwriting)	BLEU = 0.67, 243 M (full) + 3.1 M (LoRA)	(Gurgurov et al., 2024)
Vision Transformer (formulas)	BLEU = 0.557, token accuracy = 0.873	(Sundararaj et al., 2024)
A²R² (iterative VLM refinement)	BLEU-4 = 70.4, CW-SSIM = 93.5 (Img2LaTeX-Hard-1K)	(Li et al., 28 Jul 2025)
Table2LaTeX-RL (dual-reward RL)	CW-SSIM = 0.6145, TEDS-Structure = 0.9218 (complex tables)	(Ling et al., 22 Sep 2025)
TexOCR (page compilability RL)	Compilation = 95.2% (SFT+RL), Overall = 75.0% (9-dim score)	(Wang et al., 24 Apr 2026)

Performance has steadily improved with the transition to transformer-based architectures, gated and iterative feedback mechanisms, and RL with verifiable/executable rewards, driving SOTA exact match to 70–74% on structure and $\geq$ 55% on complex content for tables (Kayal et al., 2021, Kayal et al., 2022, Ling et al., 22 Sep 2025).

5. Advances in Training and Optimization

Modern vision-to-LaTeX systems incorporate several training and optimization innovations:

Pretraining and Curriculum: Pretraining on synthetic data, fine-tuning on target domains (e.g., handwriting via LoRA), and multimodal pretraining leveraging real and synthetic sources, stabilize convergence and facilitate domain transfer (Gurgurov et al., 2024, Deng et al., 2016, Sundararaj et al., 2024).
Parameter-Efficient Adaptation: LoRA enables adaptation to low-resource domains with minimal parameter overhead, essential for handwriting and niche symbol coverage (Gurgurov et al., 2024).
Reinforcement Learning with Verifiable Rewards: RL stages optimize functional criteria (e.g., successful compilation, correct label-reference structure) by executing unit tests on generated LaTeX (Wang et al., 24 Apr 2026). Dual-reward RL optimizes both TEDS-Structure and visual CW-SSIM, improving robustness on complex, nested or long-form tables (Ling et al., 22 Sep 2025).
Inference-Time Refinement: Attention-guided, visually-grounded refinement (A²R²) plugs into VLMs without further training and iteratively fixes errors, effectively boosting visual and text-level alignment beyond standard best-of-N prompting (Li et al., 28 Jul 2025).

6. Limitations, Error Analysis, and Prospects

Despite substantial progress, vision-to-LaTeX systems face several persistent challenges:

Lack of Global Consistency: Most systems operate at the line, formula, table, or page granularity, lacking mechanisms for cross-page consistency (e.g., global label resolution, table of contents, chapter numbering) (Wang et al., 24 Apr 2026).
Ambiguity and Error Modes: Visually similar glyphs (“l” vs. “1”), rare LaTeX macros, parentheses/delimiter mismatches, and deep nesting (e.g., multi-level fractions) remain failure points—especially for handwritten, scanned, or low-contrast inputs (Sundararaj et al., 2024, Gurgurov et al., 2024, Li et al., 28 Jul 2025).
Evaluation and Reward Shaping: Binary thresholds (e.g., compile or fail, TEDS-Structure cutoff) may underexploit fine-grained distinctions. Incorporation of continuous, differentiable visual fidelity signals is an open direction (Ling et al., 22 Sep 2025).
Computation and Scalability: Visual-in-the-loop verification (RL reward evaluation via LaTeX compilation and image rendering) is computationally intensive and currently feasible only for smaller datasets or select hard instances (Ling et al., 22 Sep 2025).
Task Decomposition Fragility: Two-stage (structure/content) pipelines are brittle—errors in structure decoding can propagate and invalidate otherwise correct content predictions (Kayal et al., 2022).

Future work emphasizes document-level context integration, scalable reward evaluation (e.g., differentiable renderers), joint decoding architectures for structure and content, and coupling to LaTeX grammars or ASTs for guaranteed syntactic validity (Kayal et al., 2022, Ling et al., 22 Sep 2025, Wang et al., 24 Apr 2026).

7. Benchmarks, Datasets, and Systematic Evaluation

The field is propelled by open, large-scale benchmarks:

Im2LaTeX-100K: Rendered formula images from KDD and arXiv (single-line, normalized) (Singh, 2018).
Tab-To-Tex, TSRD/TCRD: Table image-structure/content pairs, with rich annotation of spanning cells, math macros, and diverse layout styles (Kayal et al., 2021, Kayal et al., 2022).
Img2LaTeX-Hard-1K: Curated set of 1,100 challenging formula images with complex layout/nesting and difficult symbol discrimination (Li et al., 28 Jul 2025).
TexOCR-Train: 404 K page image to compilable LaTeX/BibTeX pairs from arXiv, systematically aligned and annotated for structure, floats, and sectioning (Wang et al., 24 Apr 2026).

Standardized suites (TexOCR-Bench) and multi-objective evaluation (structure, visual match, compilation) are central to characterizing system performance and tracking progress across models and tasks (Wang et al., 24 Apr 2026, Ling et al., 22 Sep 2025, Kayal et al., 2021). The ongoing release of code, data, and evaluation infrastructure accelerates further research and practical deployment.

Vision-to-LaTeX is now a vibrant synthesis of vision, language, and symbolic computation, converging on document-understanding benchmarks that demand both fine-grained visual parsing and deep linguistic/structural grounding. The integration of RL with verifiable rewards, iterative visual feedback, and pre-trained multimodal LLMs has advanced the field, yet the quest for complete, cross-document, domain-adaptive, and fully compilable LaTeX generation continues (Li et al., 28 Jul 2025, Ling et al., 22 Sep 2025, Wang et al., 24 Apr 2026).