MathNet: Benchmark, MER & GNN Systems
- MathNet is a term for disparate research systems in mathematical AI, including a multimodal benchmark, a printed mathematical expression recognition framework, and a Haar-like wavelet GNN.
- The Olympiad benchmark version offers a large-scale, multilingual dataset with tasks for full-solution generation, retrieval, and retrieval-augmented reasoning.
- The MER and GNN implementations employ data-centric normalization and multiresolution wavelet transforms to achieve significant performance gains over previous methods.
Searching arXiv for recent and relevant papers on "MathNet" to ground the article. “MathNet” is a polysemous research designation used for several unrelated systems in mathematical AI and graph machine learning. In current arXiv usage, it refers at least to a global multimodal benchmark for Olympiad-level mathematical reasoning and retrieval (Alshammari et al., 20 Apr 2026), a data-centric framework for printed mathematical expression recognition (MER) together with revised datasets and a CvT–Transformer model (Schmitt-Koopmann et al., 2024), and a graph neural network framework based on Haar-like wavelet multiresolution analysis (Zheng et al., 2020). The shared name does not denote a single lineage or interoperable platform; rather, it labels distinct research programs centered on mathematically structured data.
1. Disambiguation and scope
The term “MathNet” is used in three technically separate senses in the supplied literature.
| Usage | Problem setting | Core contribution |
|---|---|---|
| MathNet benchmark | Mathematical reasoning, retrieval, and RAG | Multimodal, multilingual Olympiad benchmark and dataset |
| MathNet MER | Printed mathematical expression recognition | Data-centric normalization, multi-font datasets, and MER model |
| MathNet GNN | Graph representation and learning | Haar-like wavelet multiresolution GNN with convolution and pooling |
The benchmark-oriented MathNet targets mathematical problem solving and retrieval over Olympiad corpora. The MER-oriented MathNet addresses image-to-LaTeX transcription of printed formulae. The GNN-oriented MathNet operates on generic graphs and introduces a multiresolution basis for graph convolution and hierarchical pooling. A frequent misconception is that these papers describe successive versions of one framework; the available record indicates instead that they are separate contributions appearing in different subfields and years (Alshammari et al., 20 Apr 2026).
2. MathNet as a benchmark for mathematical reasoning and retrieval
In its 2026 usage, MathNet is a “high-quality, large-scale, multimodal, and multilingual dataset of Olympiad-level math problems together with a benchmark for evaluating mathematical reasoning in generative models and mathematical retrieval in embedding-based systems” (Alshammari et al., 20 Apr 2026). It comprises 30,676 Olympiad-level problems, each paired with an expert-written solution, and is sourced from 1,595 PDF volumes spanning 25,000+ pages of officially published contest booklets from 1985 to 2025. The coverage extends across 47 countries, 143 distinct contests, and 17 languages, with English accounting for 74% and the remaining languages, including Portuguese, Italian, Spanish, French, Polish, German, Romanian, Korean, Dutch, Russian, Mongolian, Slovene, Macedonian, Serbian, Hungarian, and others, accounting for 26%.
The corpus is organized over four broad domains: Geometry (32%), Algebra (32%), Discrete Mathematics (23%), and Number Theory (20%). It also includes a fine-grained taxonomy with 68+ problem-type labels, such as “Plane Geometry → Triangles → Euler line” and “Combinatorics → Invariants.” The dataset is explicitly multimodal: statements and solutions are represented in native LaTeX, embedded images and diagrams are preserved, and text blocks are extracted via multilingual OCR (dots-ocr) and LLM-based normalization (GPT-4.1).
MathNet defines three benchmark tasks. MathNet-Solve evaluates full-solution generation from a problem statement with optional image input. MathNet-Retrieve evaluates math-aware retrieval using anchor problems, a synthetic equivalent positive, and adversarial hard negatives. MathNet-RAG evaluates retrieval-augmented problem solving through three inference settings—Zero-Shot, Embed-RAG, and Expert-RAG—using 35 real anchor problems paired with expert-curated structurally similar neighbors. The benchmark is described as the largest open, high-quality multimodal and multilingual Olympiad-level mathematics dataset together with the first rigorous benchmark for math-aware retrieval and retrieval-augmented reasoning (Alshammari et al., 20 Apr 2026).
3. Task design, similarity taxonomy, and empirical findings in the 2026 benchmark
A central feature of the benchmark is its explicit taxonomy of mathematical similarity. It distinguishes invariance, defined as strict equivalence via renaming, reformulation, or isomorphism; structural resonance, defined by shared lemmas or reductions such as a common use of Cauchy-Schwarz; and thematic affinity, defined as disciplinary proximity without structural equivalence (Alshammari et al., 20 Apr 2026). This taxonomy underlies both the synthetic retrieval task and the real retrieval-augmented setting.
For MathNet-Solve, the 30,676 problems are split into train (23,776), test (6,400), and test-hard (500). The output is a full solution in LaTeX, judged by a GPT-5 verifier, and the metric is binary accuracy with score counted as correct. On the 6,400-problem test set, gemini-3.1-pro-preview achieves 78.4% 1.0%, with domain breakdowns of Algebra 83.7%, Number Theory 82.2%, Geometry 74.6%, and Discrete Math 75.6%. GPT-5 achieves 69.3% 1.1%, with Algebra 80.3%, Number Theory 73.6%, Geometry 61.1%, and Discrete Math 65.3%. gemini-3-flash-preview reaches 70.4% 1.1%. The top-to-bottom gap is substantial: gemini-3.1-pro outperforms ministral-3B (4.4%) by approximately 74 percentage points.
For MathNet-Retrieve, the dataset contains 10,000 anchor problems, each paired with one synthetic equivalent positive and three adversarial hard negatives, giving 40,000 items in total. The task is to rank four candidates so that the equivalent variant is surfaced, and the metric is Recall@k under cosine similarity of embeddings. Reported overall results are low at : gemini-embedding-001 obtains R@1 4.83% and R@5 68.88%; qwen3-embedding-4B obtains R@1 4.96% and R@5 64.95%; text-embedding-ada-002 obtains R@1 1.94% and R@5 42.02%. Even the best models exceed 80% only at R@10. The paper states that retrieval of strict invariants remains near random at R@1.
For MathNet-RAG, 35 real anchor problems are each paired with one expert-curated structurally similar neighbor. Under human grading, DeepSeek-V3.2-Speciale scores 84.8% in Zero-Shot, 89.5% in Embed-RAG, and 97.3% in Expert-RAG; gemini-3-pro-preview scores 89.1%, 92.9%, and 87.5%; GPT-5 scores 76.8%, 75.2%, and 86.6%. The largest reported Expert-RAG gain is +12.5 percentage points for DeepSeek-V3.2-Speciale. The reported qualitative conclusion is that generative LLMs solve complex Olympiad problems at approximately 70–80% but still struggle in Geometry and Discrete Math, while embedding models capture lexical overlap but fail to encode symbolic structure. The benchmark therefore identifies retrieval quality as the primary bottleneck for retrieval-augmented generation (Alshammari et al., 20 Apr 2026).
4. MathNet as a data-centric framework for printed mathematical expression recognition
In its 2024 usage, MathNet denotes a MER framework motivated by two data pathologies in conventional image-to-LaTeX benchmarks: LaTeX variability and single-font bias (Schmitt-Koopmann et al., 2024). The paper reports that in im2latex-100k, approximately 34.8% of the 500-token vocabulary is redundant or irrelevant for the visual appearance of the formula. This creates label noise and ambiguous supervision, because visually identical mathematical expressions may map to multiple LaTeX strings. The benchmark also renders expressions in only one font, and the authors report that all tested MER systems plunged in performance when evaluated on a different font.
The proposed remedy is explicitly data-centric. The workflow is: train a baseline on existing data, analyze consistent error patterns, apply targeted normalization or augmentation to labels or images, and repeat until all major error classes are explained. The core normalization pipeline maps every LaTeX expression to a canonical form by stripping math-font commands, removing explicit spacing commands, eliminating optional braces, merging multiple sub- and superscripts on the same base, replacing synonyms by a single token, filtering non-math tokens, canonicalizing arrays, and respelling formulae according to a deterministic tokenizer. The paper gives the example that the variants x^2^{3}+y, x{2}^{3}+y, and x_2^3 + y all normalize to x_{2}^{3}+y.
This normalization is used to build im2latexv2, a canonical and multi-font revision of im2latex-100k. Each normalized formula is rendered at 600 DPI in 30 distinct base fonts in training and in 59 fonts total across validation and test, with 29 fonts appearing only at test time. After filtering blank images, invalid or disallowed LaTeX, and failed renderings, the final multi-font training set contains 74,245 formulae rendered in 30 fonts, or approximately 2.2 million images. The paper also introduces realFormula, a real-world test set extracted from arXiv papers using FormulaNet. From approximately 250,000 detected formula images, 200 were sampled, then filtered to 121 annotated images, including 110 single-line and 11 multi-line formulae, with approximately 43 using math-font styles.
The model architecture consists of preprocessing and augmentation, a CvT encoder with 3 stages, and a Transformer decoder with 4 layers, 8 heads, and model dimension 512. The encoder outputs a grid of 512-dimensional feature vectors; the decoder uses a standard causal mask, omits relative-positional encodings, and predicts over a 400-token vocabulary. Training uses cross-entropy token loss, Adam with initial learning rate , batch size 36, and a single NVIDIA V100 (32 GB) for approximately 100 epochs. The paper positions these architectural choices as secondary to the data-centric intervention.
5. Datasets, performance, and limitations of the MER-oriented MathNet
The quantitative results of the MER-oriented MathNet are reported across four test sets (Schmitt-Koopmann et al., 2024). On im2latex-100k rendered at 600 DPI, MathNet achieves Edit = 94.7% and EM = 63.4%, compared with WYGIWYS at Edit = 88.6% and EM = 78.6%. The paper emphasizes edit-score rather than exact match, arguing that edit similarity better measures manual correction burden. Relative to WYGIWYS, the edit error decreases from 11.4 to 5.3, corresponding to 53.5% fewer errors.
On im2latexv2, MathNet reaches Edit = 97.2% and EM = 83.9%, whereas WYGIWYS falls to Edit = 37.2% and i2l-strips/i2l-nopool are approximately 76.0%. The paper reports that the error rate is reduced from 24% to 2.8%, an 88.3% relative improvement over the prior state of the art. On realFormula, MathNet obtains Edit = 88.3%, versus 27.5% for WYGIWYS and approximately 65.1% for i2l-strips/nopool. The realFormula breakdown reports 93.3% for single-line without arrays, 84.1% for single-line with arrays, 89.5% with math-font commands versus 94.1% without them, and 71.2% for multi-line formulae, rising to 96.2% with simple y-cut preprocessing. On InftyMDB-1 at 600 DPI, MathNet scores 89.2%, compared with 17.3% for WYGIWYS and approximately 63.3% for i2l-strips/nopool.
The ablation studies are central to the paper’s interpretation. Training at 100, 200, 300, and 600 DPI yields Edit scores of 78.2, 93.5, 96.9, and 98.0 respectively. Training on raw im2latex-100k gives Edit = 78.2%; on im2latexv2 with normalization only, 90.4%; and on full im2latexv2 with normalization plus 30 fonts, 97.2%. The authors state that two-thirds of the gain comes purely from label normalization and one-third from font variation, while architecture contributes a few extra percentage points. Arrays remain a major failure mode: only 4.8% of formulae contain arrays, yet they account for 52.6% of MathNet errors on im2latexv2, and removing arrays halves the overall edit-error rate from 2.8% to 1.4%.
The stated limitations concern multi-line training, math-font styles, arrays and alignment, handwritten MER, and the gap between preliminary generative multimodal methods and state of the art. The paper also identifies accessibility applications as an immediate goal, specifically semi-automatic tagging of mathematics in PDFs by integrating FormulaNet and MathNet to produce fully tagged, screen-reader–friendly STEM documents (Schmitt-Koopmann et al., 2024).
6. MathNet as Haar-like wavelet multiresolution analysis for graph learning
In its 2020 usage, MathNet is a GNN framework that introduces multiresolution Haar-like wavelets with interrelated convolution and pooling strategies for graph representation and learning (Zheng et al., 2020). The method begins from an input undirected graph and constructs a coarse-grained chain
by clustering nodes into successively coarser super-nodes. At each level, it recursively builds an orthonormal Haar-like basis as a combination of vertically extended coarse basis vectors and detail vectors. By construction, each is orthonormal and sparse.
This basis supports fast forward and adjoint graph transforms. A graph signal 0 is decomposed into coefficients over the multiresolution basis, and under dyadic clustering the total complexity of the transforms is 1. The paper integrates this construction into a wavelet-domain graph convolution:
2
where 3, 4, and 5 is a learnable diagonal wavelet-domain filter. Because the basis is sparse and the transform is linear-time, each graph convolution costs 6 rather than 7.
The hierarchical pooling strategy discards detail coefficients and preserves only the first 8 low-pass coefficients. In matrix form,
9
with complexity 0 per pooling layer. A typical network alternates HaarConv and HaarPool blocks from the finest level to the coarsest, after which a final vector is passed to an MLP for classification or regression. The implementation pads level-wise bases to a common width so that a single diagonal filter 1 can be learned per layer regardless of individual graph sizes.
Empirically, the paper reports 10-fold averaged classification accuracies of 78.3 2 1.6% on PROTEINS, 62.5 3 3.9% on ENZYMES, 82.5 4 3.6% on D&D, and 89.6 5 2.5% on MUTAG. On three 15,000-graph PointPattern datasets, MathNet achieves test accuracies of 97.4 6 0.34%, 96.0 7 0.59%, and 92.7 8 0.72%, compared with best baseline values of 92.9%, 89.3%, and 85.1%. On QM7 graph regression, it attains 42.7 9 0.92 kcal/mol MAE, versus 43.6 0 0.98 for GCNConv+SAGPool. The paper attributes stability and scalability to sparse Haar bases, linear-time transforms, and the absence of expensive Laplacian eigendecompositions or global sorting (Zheng et al., 2020).
7. Comparative interpretation and recurring misconceptions
The most important interpretive point is nominal rather than architectural: “MathNet” is not a unified framework spanning Olympiad reasoning, MER, and graph learning. The literature instead uses the same name for benchmark design, document understanding, and graph representation learning in separate problem domains. Any attempt to compare “MathNet” systems therefore requires prior disambiguation by year, task, and citation.
Within the 2026 benchmark, a common misunderstanding is that strong generative mathematical performance implies strong retrieval. The reported results contradict that view: problem-solving accuracies can reach 78.4% on MathNet-Solve, while R@1 on MathNet-Retrieve remains near random, around 4.83% to 4.96% for the best embedding models (Alshammari et al., 20 Apr 2026). This suggests that current reasoning models and current embedding models encode different aspects of mathematical competence.
Within the MER literature, another common misunderstanding is that architecture search is the primary driver of progress. The ablation evidence instead assigns most of the gain to label normalization and multi-font rendering, with two-thirds of the improvement attributed to normalization and one-third to font variation (Schmitt-Koopmann et al., 2024). A plausible implication is that canonicalization and realistic rendering conditions can be more decisive than marginal backbone changes when the target is image-to-LaTeX fidelity.
Within the GNN literature, MathNet is not presented as a generic spectral model dependent on costly eigendecomposition. Its stated contribution is precisely the opposite: a sparse multiresolution construction supporting fast forward and adjoint transforms, local operations, and linear-time complexity under dyadic clustering (Zheng et al., 2020). Across the three usages, the shared conceptual thread is not a common codebase but an emphasis on explicitly structured mathematical representations, whether problems, formulae, or graphs.