CMER-Bench Benchmark

Updated 20 February 2026

CMER-Bench is a benchmark suite that evaluates optical recognition of complex mathematical expressions using explicit difficulty tiers to mirror real-world document challenges.
It employs a multi-stage pipeline combining automated scoring and expert verification to stratify expressions into Easy, Moderate, and Complex levels.
Evaluation protocols include metrics such as BLEU, ROUGE, Levenshtein distance, and a novel CDM to rigorously assess model performance.

CMER-Bench is a rigorously constructed benchmark suite dedicated to the evaluation of complex mathematical expression recognition (MER). Designed to systematically diagnose and advance the capabilities of optical mathematical recognition systems, CMER-Bench stratifies real-world expressions from scientific literature into explicit difficulty tiers and provides a controlled evaluation environment that critically exposes the limitations of current models, especially for deeply nested, multi-line, and semantically rich mathematical constructs (Bai et al., 14 Dec 2025).

1. Motivation and Distinctive Features

CMER-Bench originates from the recognition that progress in MER has been measured predominantly on simple, single-line expressions, while real scientific documents feature expressions with deep structural nesting, token breadth, and multi-line layouts (such as matrices, aligned equations, and nested integrals) that are under-represented in prior benchmarks (e.g., IM2LATEX-100K, Pix2tex, UniMER-1M). The benchmark's central aims are to:

Stress-test algorithms under varying degrees of expression complexity,
Reveal systematic weaknesses in the handling of hierarchical, two-dimensional layouts, and
Inform dataset construction and model design toward robust, real-world deployment.

CMER-Bench consists of 2,000 images of typeset mathematical expressions, each sampled from over one million scientific documents, with per sample stratification into three explicit difficulty levels: Easy, Moderate, and Complex. This stratification is explicitly absent in earlier benchmarks and is operationalized by quantitative and expert-driven procedures (Bai et al., 14 Dec 2025).

2. Construction Pipeline and Difficulty Stratification

The benchmark creation follows a multi-stage, reproducible pipeline:

Extraction and Rendering: LaTeX snippets are parsed from source documents, syntactically cleansed (e.g., removal of spurious \cite commands), and rendered into high-resolution images.
Automated Scoring: Expressions receive a continuous 'difficulty score' computed by aggregating LaTeX token length, nesting depth (parse tree height), and line count.
Human Verification and Bucketing: Domain experts vet borderline cases, ensuring alignment between visual layout and semantic content, and assign each sample to a difficulty tier defined as follows:
- Easy: ≤ 20 tokens, single-line, shallow structure.
- Moderate: 21–150 tokens, up to two lines, modest nesting.
- Complex: > 150 tokens or ≥ 3 lines, deep nesting, multi-row environments with extensive use of fraction, summation, integral, and matrix constructs.

Each tier is populated with roughly 667 expressions, ensuring balanced coverage and diagnostic utility across the full range of expression complexities encountered in real document corpora (Bai et al., 14 Dec 2025).

3. Dataset Statistics and Contextual Positioning

Table 1 and Table 2 summarize the scale and coverage of the associated training corpora (MER-17M and CMER-3M) that underpin CMER-Bench. Notably, CMER-3M is a balanced 3.1M-sample subset of the 17.7M-pair MER-17M corpus, with stratification by token length and line count to ensure comprehensive representation of complex mathematical phenomena.

Length (tokens)	IM2LATEX	Pix2tex	UniMER-1M	CMER-3M	MER-17M
0–20	<0.1K	3.6K	477.8K	85.9K	85.9K
21–150	43.4K	137.5K	342.6K	620.2K	5,500K
151–300	28.1K	70.6K	103.1K	858.1K	7,800K
301–450	3.6K	16.8K	62.6K	855.6K	3,000K
>450	0.1K	5.8K	75.6K	641.2K	1,400K
Total	75.3K	233.8K	1.1M	3.1M	17.7M

Lines	IM2LATEX	Pix2tex	UniMER-1M	CMER-3M	MER-17M
1	71.7K	221.8K	964.7K	505.7K	504.3K
2	0.3K	6.0K	45.6K	2.0M	15.6M
3–4	2.4K	5.2K	36.8K	459.9K	1.4M
5+	0.8K	1.9K	14.7K	47.1K	104.1K

A distinctive feature is that CMER-Bench's Complex tier is explicitly sourced from the most structurally rich portion of CMER-3M, offering hundreds of multi-line fractions, block matrices, and compound integrals.

4. Evaluation Metrics and Protocols

To rigorously assess both token-level transcription and structural comprehension, CMER-Bench mandates reporting four complementary metrics:

BLEU (1–4): n-gram precision with brevity penalty, sensitive to exact LaTeX string overlap.
ROUGE (1, 2, L): Token recall and longest common subsequence recall, emphasizing coverage of correct symbol sequences.
Levenshtein (Edit) Distance: Minimum edit operations (insertions, deletions, substitutions) at the token level (lower is better).
CDM (Canonicalization-Based Distance Measure): A bounding-box-aware F1 score; aligns tokens spatially and rewards correct geometric layout even with functionally equivalent LaTeX variations.

This multi-metric protocol triangulates model performance and prevents overfitting to linearized string similarity alone (Bai et al., 14 Dec 2025).

5. Baseline Performance and Diagnostic Findings

A comprehensive evaluation in Bai et al. demonstrates clear stratification of model capabilities:

Model / Tier	BLEU ↑	ROUGE-L ↑	Edit Dist. ↓	CDM ↑
UniMERNet (Easy)	0.476	0.739	94.7	0.938
UniMERNet (Complex)	0.409	0.750	483.3	0.680
GPT-4o (Complex)	0.208	0.590	1905.0	0.635
CMERNet (Easy)	0.765	0.861	51.6	0.966
CMERNet (Complex)	0.557	0.745	474.2	0.690

Key observations:

All models perform well on Easy, with BLEU ≈ 0.45–0.76, but CMERNet achieves the highest scores across all tiers.
Performance on Complex deteriorates rapidly for both dedicated MER and MLLM baselines.
CMERNet, with only 125M parameters and structure-aware enhancements, roughly doubles the Complex BLEU over GPT-4o and exceeds UniMERNet by 36%.
CDM scores reveal some robustness to LaTeX rephrasings but reinforce CMERNet’s consistent lead (Bai et al., 14 Dec 2025).

6. Supporting Resources: MER-17M, CMER-3M, SML, and Tokenizer Design

Recognition of complex expressions necessitates training on large, syntactically and semantically diverse corpora. MER-17M, sourced from > 1M arXiv papers, yields 17.7M image–LaTeX pairs with aggressive filtering for quality and variety. CMER-3M, a balanced 3.1M sample subset of MER-17M, preserves coverage across line counts and token length, ensuring models are not disproportionately tuned to short or simple samples.

To resolve LaTeX string ambiguity and improve structural alignment, Bai et al. introduce Structured Mathematical Language (SML), a representation in which LaTeX expressions are parsed into explicit syntax trees, linearized with structural tokens, and combined into decoder-compatible target streams. This explicit hierarchy enhances model supervision on multi-dimensional relationships (Bai et al., 14 Dec 2025).

A dedicated Byte-Pair Encoding (BPE) tokenizer, trained on CMER-3M and incorporating ~500 mathematics-specific tokens, prevents fragmentation of multi-character LaTeX commands, further supporting semantic preservation in both training and inference.

7. Insights, Limitations, and Best Practices

Findings from CMER-Bench suggest:

Prior MER models and MLLMs fail to generalize to deeply nested, multi-line structures despite apparent robustness on simpler tasks, underlining CMER-Bench’s necessity as a discriminative benchmark.
Large-scale, balanced pretraining (CMER-3M), explicit structural modeling (SML), and atomic tokenization drive significant performance gains without necessarily increasing parameter count.
CMER-Bench's 2,000 samples—while diverse—may limit representation at the extreme end of expression complexity; future expansions should increase scale, incorporate handwritten samples, and introduce fine-grained complexity annotations (e.g., matrix size, nesting depth).
The CDM metric, while valuable for layout fidelity, may under-penalize subtle LaTeX syntax errors, indicating the need for a unified metric capturing structural, visual, and semantic correctness.
Recommended practices include stratified evaluation by difficulty, multi-metric reporting, dynamic input resolution (e.g., CMER-Fit), and pretraining on balanced corpora to ensure robust model generalization.

In summary, CMER-Bench, together with associated data, representations, and model innovations, establishes a new standard for MER evaluation and provides a robust scaffold for the next generation of complex expression recognition research (Bai et al., 14 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Complex Mathematical Expression Recognition: Benchmark, Large-Scale Dataset and Strong Baseline (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CMER-Bench Benchmark.

CMER-Bench Benchmark

1. Motivation and Distinctive Features

2. Construction Pipeline and Difficulty Stratification

3. Dataset Statistics and Contextual Positioning

4. Evaluation Metrics and Protocols

5. Baseline Performance and Diagnostic Findings

6. Supporting Resources: MER-17M, CMER-3M, SML, and Tokenizer Design

7. Insights, Limitations, and Best Practices

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

CMER-Bench Benchmark

1. Motivation and Distinctive Features

2. Construction Pipeline and Difficulty Stratification

3. Dataset Statistics and Contextual Positioning

4. Evaluation Metrics and Protocols

5. Baseline Performance and Diagnostic Findings

6. Supporting Resources: MER-17M, CMER-3M, SML, and Tokenizer Design

7. Insights, Limitations, and Best Practices

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research