IM2LATEX-100K: Math OCR Dataset
- IM2LATEX-100K is a large-scale dataset for mathematical expression recognition, featuring over 103K paired image–formula samples extracted from arXiv.
- It employs a rigorous preprocessing pipeline including bucket-based image grouping and AST-based LaTeX tokenization to ensure visual and textual consistency.
- The dataset serves as a critical benchmark for evaluating models with metrics like BLEU, edit distance, and exact match accuracy, driving advances in mathematical OCR.
IM2LATEX-100K is a large-scale dataset for mathematical expression recognition, enabling end-to-end machine learning approaches to the challenging task of converting images of rendered LaTeX formulas into their corresponding markup sequences. Sourced from arXiv.org, this dataset serves as a critical benchmark for models seeking to bridge the gap between visual mathematical expressions and structured, machine-readable representations. Its design incorporates paired grayscale image–formula samples, extensive structural diversity, robust preprocessing, and is accompanied by standardized evaluation protocols to facilitate reproducible research in mathematical OCR.
1. Dataset Construction and Composition
IM2LATEX-100K comprises 103,556 distinct mathematical expressions, extracted from the LaTeX source files of over 60,000 arXiv papers, predominantly from the High Energy Physics–Theory (hep-th) category (Yan et al., 2020, Wang et al., 2019). Each formula is rendered to PNG format via a pipeline of LaTeX → PDF → grayscale PNG, with images exceeding 480 pixels in width discarded to maintain consistency.
The dataset is split as follows:
| Split | Expressions (ConvMath) | Expressions (MI2LS) |
|---|---|---|
| Train | 65,995 | 83,883 |
| Validation | 8,181 | 9,319 |
| Test | 8,301 | 10,354 |
| Total | 103,556 | 103,556 |
While (Yan et al., 2020) and (Wang et al., 2019) report minor split differences, both document the same overall sample count per the original dataset.
The original formulas feature character lengths ranging from 38 to 997 per expression (mean ≈118, median ≈98). All major LaTeX constructs appear, including fractions, superscripts/subscripts, roots, integrals, summations, and matrix environments.
2. Image and Annotation Characteristics
All images are grayscale PNGs with variable dimensions (width ≤480 px; arbitrary height), representing tightly-cropped mathematical formulas (Yan et al., 2020, Wang et al., 2019). For computational efficiency and consistent batching:
- Images are grouped into discrete 'buckets' by size—15 in (Yan et al., 2020) (e.g., 160×32, 256×64), 20 in (Wang et al., 2019)—and padded with a white background to a canonical rectangle size within each group.
- No additional resizing or cropping is performed.
- For multi-formula batch training, formulas are centered within their buckets.
Ground-truth LaTeX is stored in plain UTF-8 text, one file per sample, using the tokenization scheme of Deng et al. (2016): for IM2LATEX-100K, the vocabulary includes 583 unique tokens (Yan et al., 2020) or 483 (after minimal token normalization and special <START>/<END> markers) (Wang et al., 2019). Annotation files encode a direct one-to-one mapping between each image and its LaTeX representation. No token-level or region-level spatial alignments are provided; alignment learning is delegated to model attention mechanisms.
3. Preprocessing Pipeline and Tokenization
The preprocessing workflow ensures consistency between visual input and textual output:
- Image preprocessing performs tight bounding box cropping, optional downsampling for memory efficiency, and bucket-based padding. Original scan resolutions (e.g., 1654×2339 px) are typically halved.
- LaTeX preprocessing tokenizes and normalizes expressions to standardize semantically equivalent—but syntactically divergent—formulations. The pipeline uses:
- Parsing to an abstract syntax tree (AST) using KaTeX (Wang et al., 2019).
- Token emission with reserved minimal units (e.g., '\psi', '\frac', '', '_', '+', etc.).
- Special sequence delimiters (<START>, <END>).
- The resulting sequence lengths exhibit a long-tail distribution, with a notable ∼3.4% of instances exceeding 150 tokens—a known challenge for neural sequence models (Wang et al., 2019).
4. Evaluation Metrics
IM2LATEX-100K is evaluated using multiple complementary metrics to account for the polymorphic nature of LaTeX and the 2D structure of mathematical expressions (Yan et al., 2020, Wang et al., 2019):
- BLEU Score (Papineni et al. 2002): Corpus-level, 4-gram cumulative BLEU measures n-gram overlap between predicted and ground-truth LaTeX token sequences.
- Column-wise Edit Distance: Rendered outputs are converted to bitmaps, serialized column-wise, and compared using the Levenshtein distance. The normalized distance (edit operations per column) penalizes insertion, deletion, or substitution errors.
- Exact Match Accuracy: Proportion of test samples for which the model prediction, once rendered, is pixel-identical to the ground-truth image. A variant ignores entirely blank (whitespace) columns.
Formally, for test examples, exact match can be expressed as
where represents the prediction and the ground truth.
5. Data Challenges and Ambiguities
IM2LATEX-100K’s complexity arises from several domain-specific challenges:
- LaTeX Polymorphism: Multiple syntactic LaTeX sequences can represent identical rendered expressions (e.g., alternate ordering of scripts or styling commands). This motivates the use of image-based, in addition to sequence-based, evaluation (Wang et al., 2019).
- Rare Symbols and Long Expressions: Vocabulary sparsity increases with rarely-used tokens (\alpha, \psi, \oslash) and extended sequences, stressing model generalization. Only about 3.4% of test expressions, but disproportionately influential, exceed the usual length limit of 150 tokens (Wang et al., 2019).
- 2D Spatial Structure: The dataset spans formulas containing higher-order vertical/horizontal relationships, such as matrices or nested scripts. Maintaining spatial locality throughout the encoding and decoding pipeline is critical; techniques like 2D positional encoding are employed for this purpose (Wang et al., 2019, Yan et al., 2020).
6. Successor Dataset: IM2LATEXv2 and Its Improvements
The MathNet project introduced “im2latexv2,” a significant data-centric revision. This variant applies systematic LaTeX normalization (function ) to canonicalize markup, removing all extraneous font, spacing, and non-structural commands (Schmitt-Koopmann et al., 2024):
- Font Diversity: Original IM2LATEX-100K used only Computer Modern. IM2LATEXv2 renders each formula in up to 59 fonts (30 for training, 29 held out for validation/test), introducing >2 million images for training alone, thus improving cross-font generalization.
- Canonical Labels: eliminates redundant or optional LaTeX constructs, standardizes bracing and sub/superscript usage, merges synonyms (e.g., '\le' → '\leq'), and purges malformed arrays, shrinking the vocabulary from ~500 to 320 tokens.
- Data Cleaning: Blank renders, empty formulas, and entries with fatal normalization errors are removed, yielding a “purer” dataset—final image–formula pairs: 74,245 train, 8,243 validation, 10,118 test (Schmitt-Koopmann et al., 2024).
- Directory and Metadata Structure: Each image is linked to rich metadata including font identifiers, DPI (fixed at 600), and image dimensions. Directory layout follows split → images/ (by font) and formulas.lst (formula index, canonical LaTeX).
This revision addresses label noise, model convergence, and robustness to font idiosyncrasies, yielding a benchmark that is both more realistic and more consistent than the original (Schmitt-Koopmann et al., 2024).
7. Benchmark Usage and Model Performance
Numerous neural architectures for mathematical expression recognition benchmark their results on IM2LATEX-100K:
- ConvMath employs a fully convolutional encoder-decoder pipeline with multi-layer attention, achieving state-of-the-art accuracy and strong parallel computation efficiency (Yan et al., 2020).
- MI2LS integrates 2D positional encoding and sequence-level reinforcement learning, attaining notable improvements in BLEU, edit distance, and exact match (for instance, BLEU: 90.28%, exact match: 82.33% with sequence-level training) (Wang et al., 2019).
Comparative performance indicates:
| Model | BLEU | ImageEditDist | Exact Match Accuracy |
|---|---|---|---|
| WYGIWYS (attn s2s) | 87.73% | 87.60% | 77.46% |
| Double Attention | 88.42% | 88.57% | 79.81% |
| MI2LS (w/ seq-level RL) | 90.28% | 92.28% | 82.33% |
Performance declines modestly but remains robust for longer expressions, and the use of sophisticated attention-based and convolutional sequence models continues to drive improvements.
IM2LATEX-100K and its successor, IM2LATEXv2, constitute foundational resources for evaluating mathematical OCR pipelines. Their careful annotation, structural diversity, and rigorous evaluation protocols enable systematic study of crucial challenges in mathematical expression recognition, with ongoing refinements such as LaTeX canonicalization and font augmentation further improving their relevance to real-world use cases (Wang et al., 2019, Yan et al., 2020, Schmitt-Koopmann et al., 2024).