IM2LATEX-100K Dataset Overview

Updated 7 September 2025

IM2LATEX-100K is a large-scale benchmark with over 100K image–LaTeX pairs for converting images of math formulas into LaTeX markup.
The dataset employs rigorous preprocessing, including cropping, downsampling, bucketing, and tokenization, to standardize inputs for model training.
Extensions like im2latexv2 introduce multi-font rendering and canonicalization to improve model generalization and reduce LaTeX ambiguity.

The IM2LATEX-100K dataset is a large-scale benchmark designed for the task of converting images of mathematical expressions into corresponding LaTeX markup sequences—a central problem in mathematical expression recognition (MER) and optical character recognition (OCR) for STEM documents. Sourced from LaTeX-formatted content of arXiv High Energy Physics – Theory preprints, IM2LATEX-100K provides aligned image–LaTeX pairs representing a broad spectrum of mathematical notation complexity and layout. It is widely utilized for algorithm development, training, and performance benchmarking for end-to-end deep learning systems targeting math formula transcription.

1. Construction and Composition

The IM2LATEX-100K dataset consists of approximately 103,556 distinct mathematical formula entries. These entries were extracted from the LaTeX sources of over 60,000 arXiv preprints, primarily in the theoretical physics domain. Each entry comprises a LaTeX formula string and its corresponding greyscale PNG rendering, initially produced via pdfLaTeX and subsequently converted for uniformity and accessibility.

The dataset is split into three canonical partitions:

Training set: ~83,883 formulas
Validation set: 9,319 formulas
Test set: 10,354 formulas

Image rendering uniformly employs a high-resolution (1654 × 2339 pixels), later subjected to preprocessing (see §3). Formula LaTeX strings span a wide range of lengths, with sequence lengths from 38 to 997 characters (mean ≈ 118, median ≈ 98) and a token vocabulary of size 583 after canonical tokenization. The dataset’s diversity in formula structure, symbol usage, and spatial complexity makes it suitable for developing architectures capable of robust two-dimensional spatial reasoning.

2. Dataset Preprocessing and Batch Organization

Preprocessing is focused on maximizing both computational efficiency and input consistency:

Cropping and Downsampling: Images are cropped to the tightest bounding rectangle containing the formula, removing extraneous whitespace and then downsampled to half original dimensions to reduce computational load.
Bucketing and Padding: For mini-batch parallelization, images are grouped into size buckets—either 15 or 20 distinct (width, height) combinations, depending on implementation. Within each bucket, images are zero-padded (preserving central alignment) to standardize shape.
Sequence Tokenization and Padding: LaTeX formula strings are parsed and tokenized according to established rules (e.g., WYGIWYS tokenizer). Within each mini-batch, sequences are right-padded to the batch’s maximum length.

This preprocessing pipeline enables effective GPU utilization for large-scale model training, while minimizing memory overhead due to image and sequence size heterogeneity.

3. Ground Truth Ambiguity and Canonicalization

A salient characteristic of the original IM2LATEX-100K is that the same rendered mathematical expression may be represented by semantically equivalent but syntactically distinct LaTeX ground truths (e.g., "a_3", "a_{3}", or with various optional curly brackets and whitespace). This redundancy introduces ambiguity in both model learning and evaluation—e.g., two predictions may render identically while differing at the markup level.

To address this, recent work has introduced normalization pipelines that canonicalize all LaTeX representations to a unique minimal form ("a_3" normalized regardless of bracket usage, and ordering in sub-/superscript chains unified). Arrays and alignment are also standardized, and malformed constructs may be removed. The output of this process is an enhanced dataset ("im2latexv2") with reduced token sequence diversity for visually identical formulas, thereby improving both model convergence and the fidelity of edit-distance–based evaluation metrics (Schmitt-Koopmann et al., 21 Apr 2024).

4. Dataset Extensions: Font Variation and Generalization

The original images in IM2LATEX-100K were rendered with a single font, which restricted model generalization to real-world use cases featuring diverse typographic styles. To overcome this, recent variants ("im2latexv2") extend each formula to multiple rendered images using 30 distinct fonts for training and up to 59 fonts for validation and test sets—29 of which occur only in the latter to explicitly evaluate generalization to unseen styles. This font augmentation ensures that models learn to transcribe mathematical structure, not just font-dependent symbol shapes, closing the gap between synthetic benchmarks and real document layouts (Schmitt-Koopmann et al., 21 Apr 2024).

5. Use in Deep Learning Model Development

IM2LATEX-100K serves as the principal training and evaluation resource for image-to-LaTeX models. Notable models and their approaches include:

Encoder–Decoder Models: Architectures typically combine a convolutional neural network (CNN) feature extractor (frequently VGG or ResNet variants) with a sequence decoder (bidirectional LSTM or fully convolutional networks), frequently augmented by soft attention mechanisms to enable spatial-to-token alignment.
- The addition of 2D sinusoidal positional encodings to CNN feature maps preserves spatial relationships crucial for parsing superscripts, subscripts, and fraction bars.
- Attention mechanisms weighted by learned context vectors allow the decoder to focus on relevant image regions during token prediction (Wang et al., 2019, Yan et al., 2020).
Decoding Innovations: The dataset is a testbed for advanced decoding techniques, such as k-step look-ahead modules—recursively expanding search trees to evaluate future log-likelihoods for improved token selection—and auxiliary loss terms to calibrate end-of-sequence (EOS) probabilities, especially on tasks with moderate sequence lengths (mean ≈ 64.86 characters) (Wang et al., 2020).
Fully Convolutional Sequence Models: Methods such as ConvMath eschew recurrent LSTM decoders in favor of stacked convolutional decoders with multi-layer attention, allowing full parallelization and improved efficiency. This approach leverages the variable-length, variable-width characteristics of the IM2LATEX-100K images and sequences, further supported by batch image bucketing (Yan et al., 2020).
Convolutional Vision Transformers (CvTs): A data-centric approach utilizes CvT encoders to harness both local spatial detail (symbol strokes) and global structure (formula layout). By coupling this with canonicalized ground truth and font augmentation, models such as MathNet achieve state-of-the-art edit accuracy over multiple test sets, including im2latex-100k, im2latexv2, and real-world extracted formulae (Schmitt-Koopmann et al., 21 Apr 2024).

6. Benchmarking, Metrics, and Impact

Commonly used evaluation metrics on IM2LATEX-100K include:

BLEU Score: Cumulative 4-gram BLEU to quantify overlap between predicted and ground truth LaTeX token streams, a standard for sequence generation tasks.
Image-based Metrics: Column-wise Levenshtein (edit) distance on rendered images (binarized arrays), and exact match rates (including variants ignoring whitespace). These directly assess the semantic and visual accuracy of model outputs.
Edit Score: Defined as

$\text{Edit Score} = \left(1 - \frac{lev(\text{GT}, \text{PRE})}{\max(\text{len(GT), len(PRE)})}\right) \cdot 100\%$

using canonicalized LaTeX token streams (lev: Levenshtein distance; GT: ground truth; PRE: prediction) (Schmitt-Koopmann et al., 21 Apr 2024).

Successive model generations on IM2LATEX-100K have yielded progressive improvements in all metrics—a notable BLEU increase from approximately 87–88% (pre-reinforcement learning) to 90.28% following policy-gradient sequence-level training (Wang et al., 2019), and further edit score reductions down to 5.3% error via normalization and multi-font training (Schmitt-Koopmann et al., 21 Apr 2024). This progression highlights both architectural and data-driven advances.

7. Practical Applications and Contributions

The IM2LATEX-100K dataset underpins practical advances in:

STEM document digitization: Automated conversion of scanned or PDF mathematical content to editable LaTeX.
Formula-centric OCR: Enhanced recognition accuracy for complex mathematical notation, supporting academic archiving and structured search.
Model generalization studies: Evaluation of model robustness to typographic variation and markup style, facilitated by the im2latexv2 extension.

By providing a challenging and realistic benchmark, IM2LATEX-100K has driven architectural innovation in image-to-text modeling, motivated data-centric improvements (normalization and font augmentation), and established a robust foundation for both academic paper and practical deployment in mathematical expression recognition.