MER-17M: Large-Scale Math Expression Dataset
- MER-17M is a comprehensive dataset of 17.7M image–LaTeX pairs capturing complex, multi-line mathematical formulas for advanced recognition research.
- Its fully automated pipeline—from harvesting and extraction to quality filtering—ensures high diversity and precise structural representation.
- The dataset enables robust benchmarking and improved model training, facilitating progress in math OCR, specialized tokenization, and document analysis.
MER-17M is a large-scale dataset consisting of approximately 17.7 million image–LaTeX pairs, designed to advance the field of mathematical expression recognition (MER) by providing rich and diverse samples, particularly of complex, multi-line, and long mathematical formulas. Its creation addresses the shortcomings of existing public datasets, which are primarily composed of simple, single-line expressions, and provides a new standard for training and benchmarking machine learning models in mathematical expression recognition, especially those dealing with high structural and compositional complexity (Bai et al., 14 Dec 2025).
1. Automated Construction Pipeline
The construction of MER-17M follows a fully automated, multi-stage pipeline:
- Document Harvesting: Over one million LaTeX source PDFs and TeX files were crawled from public repositories such as arXiv.
- Math-Code Extraction: Custom parsing scripts based on the LaTeX abstract syntax tree (AST) were used to extract raw LaTeX tokens from all math environments—including inline (), displayed (), and block constructs such as
equationandalign. - Cleaning and Normalization: Non-mathematical macros (such as
\cite,\ref, and footnotes) were stripped, and mismatched braces were normalized to eliminate spurious text and prevent misalignment between images and labels. - Rendering: Each normalized LaTeX snippet was rendered into a tightly cropped grayscale image using pdfTeX or LuaTeX at typically 600 dpi, then downsampled under the CMER-Fit dynamic-resolution scheme to accommodate various spatial layouts.
- Quality Filtering: Samples with failed rendering, blank or near-blank output, trivial length (e.g. single symbol), or malformed LaTeX were automatically discarded.
- Diversity Checks: A feature-level nearest neighbor analysis using hashes and embedding distances was applied to prune near-duplicate images (>90% similarity), thereby maximizing structural diversity.
- Benchmark Hold-Out: From the resulting set of approximately 17.7 million image–LaTeX pairs, 2,000 pairs were randomly sampled (stratified by formula difficulty score) to create the CMER-Bench benchmark; the remaining pairs formed the final MER-17M corpus (Bai et al., 14 Dec 2025).
Annotation relied solely on the extracted LaTeX code, which served directly as unambiguous ground truth for each rendered image.
2. Dataset Statistics and Distributional Properties
MER-17M exhibits uniquely broad coverage in token length, spatial structure, and mathematical constructs.
A. Volume and Splits
- Raw sample count: ~17,700,000 image–LaTeX pairs
- CMER-Bench (hold-out for evaluation): 2,000 samples
- Training pool: ~17,698,000 samples
There is no fixed validation split; subsets can be drawn as needed.
B. LaTeX Token Length Distribution
| LaTeX Token Length | Samples in MER-17M |
|---|---|
| 0–20 | 85,900 |
| 21–150 | 5,500,000 |
| 151–300 | 7,800,000 |
| 301–450 | 3,000,000 |
| >450 | 1,400,000 |
Compared to major previous datasets such as IM2LATEX-100K (75K samples), Pix2tex (234K), and UniMER-1M (1.1M), MER-17M increases the number of medium and long formulas by an order of magnitude.
C. Line-Count (Spatial) Distribution
| Number of Lines | Samples in MER-17M |
|---|---|
| 1 | 504,300 |
| 2 | 15,600,000 |
| 3 | 1,300,000 |
| 4 | 143,300 |
| 5 | 60,600 |
| >5 | 43,500 |
More than 99% of formulas are multi-line, and approximately 10% contain more than three lines, whereas existing corpora consist of over 90% single-line expressions.
D. Expression Types
The dataset encompasses a wide variety of mathematical constructs, including:
- Arithmetic and single-symbol displays
- Fractions (including nested), radicals, and roots
- Integrals (single and multiple), summations, products, limits
- Matrices and arrays (\begin{matrix}, pmatrix, etc.)
- Multi-line systems and aligned derivations (align, gather)
- Mixed text and math (\text{...})
This range ensures coverage of both elementary and highly structured mathematical forms.
3. Image–LaTeX Pairing and Example Content
Each data point in MER-17M consists of:
- A rendered, tight-bounds grayscale image of a mathematical expression (with dynamic resolution handling for complex, multi-line layouts)
- An extracted and normalized LaTeX string serving as the label
Examples range from simple to highly complex:
\sqrt{\frac{a + b}{c}}\frac{d}{dx}\bigl(\sin x\,e^x\bigr) = \sin x\,e^x + e^x\cos x\begin{align}f(x)&=\int_{0}^{1}t^{2}\,dt\\[4pt]&=\tfrac13\end{align}\begin{pmatrix}\frac{\partial u}{\partial x}&\frac{\partial u}{\partial y}\\[3pt]\frac{\partial v}{\partial x}&\frac{\partial v}{\partial y}\end{pmatrix}
These examples are representative of the types of content and complexity found throughout the corpus.
4. Structured Mathematical Language (SML) and Specialized Tokenization
MER-17M introduces key advances for modeling the rich 2D and hierarchical structure of mathematical expressions beyond standard LaTeX.
A. Mathematical Byte-Pair Encoding (BPE) Tokenizer
A custom tokenization approach was developed:
- The vocabulary contains atomic LaTeX commands/environments (e.g.,
\frac,\sqrt,\begin{align}), ensuring these units remain intact. - Byte-Pair Encoding (BPE) is run on a 3 million-sample subset (CMER-3M), yielding a ~32,000 token vocabulary, mitigates uninformative fragmentation seen in generic text tokenizers.
B. Structured Mathematical Language (SML)
SML provides a linearized, token-sequence representation while explicitly marking 2D and hierarchical structures using control tokens:
- Parsing: The LaTeX code is parsed into a tree whose nonterminals correspond to mathematical constructs (e.g., fractions, subscripts).
- Linear Serialization: Preorder traversal serializes this tree, inserting tokens such as
<FracStart>,<FracMid>,<FracEnd>for fractions. Formal grammar components allow recursive definition:
Analogous productions are created for superscript/subscript, matrix, integral, and alignment environments.
Because SML sequences are just token arrays, transformer-based models can process them as usual but are provided with explicit structural guidance.
5. Comparative Analysis and Benchmark Integration
MER-17M establishes a new state-of-the-art (SOTA) foundation for MER tasks by virtue of its scale and structural diversity.
- Dataset scale comparison: MER-17M offers 17.7M samples versus 75K (IM2LATEX-100K), 234K (Pix2tex), and 1.1M (UniMER-1M). For expressions with 21–150 tokens, MER-17M provides 5.5M samples (UniMER: 342K); for formulas >450 tokens, 1.4M samples exist (UniMER: 75K). Over 99% of MER-17M samples are multi-line, compared to >90% single-line in prior datasets.
- Benchmarking: The CMER-Bench benchmark, derived from MER-17M, categorizes samples into three difficulty tiers (Easy, Moderate, Complex) and supports rigorous evaluation of both specialized and general-purpose models.
- Model performance: CMERNet, an encoder–decoder architecture trained on the 3M-sample CMER-3M slice (sourced from MER-17M), achieves on CMER-Bench:
- BLEU: 0.765 (Easy), 0.721 (Moderate), 0.557 (Complex)
- CDM: 0.966, 0.845, 0.690
These results significantly surpass those of previous dedicated and general-purpose multimodal models.
6. Applications and Future Implications
MER-17M is engineered to support a broad spectrum of mathematical expression recognition research:
- Robust recognition models: The inclusion of deeply nested, multi-line, and high-token-count expressions allows models to generalize beyond trivial or linearly structured formulas.
- Benchmarking and evaluation: CMER-Bench, based on stratified sampling from MER-17M, enables transparent comparison across difficulty tiers.
- Tokenization research: The specialized BPE tokenizer and SML format facilitate future research in expressing and learning mathematical structure.
- A plausible implication is that future advances in mathematical OCR, document understanding, and automated theorem proving may increasingly depend on datasets of the scale, structural variety, and granularity provided by MER-17M.
MER-17M thus serves as a pivotal resource for ongoing and future developments in mathematical expression recognition, providing an unprecedented volume and diversity of mathematical content and enabling sophisticated modeling approaches previously unattainable with existing datasets (Bai et al., 14 Dec 2025).