Papers
Topics
Authors
Recent
Search
2000 character limit reached

MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

Published 20 Apr 2026 in cs.AI, cs.DL, cs.IR, and cs.LG | (2604.18584v1)

Abstract: Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, language coverage, and task diversity. We introduce MathNet, a high-quality, large-scale, multimodal, and multilingual dataset of Olympiad-level math problems together with a benchmark for evaluating mathematical reasoning in generative models and mathematical retrieval in embedding-based systems. MathNet spans 47 countries, 17 languages, and two decades of competitions, comprising 30,676 expert-authored problems with solutions across diverse domains. In addition to the core dataset, we construct a retrieval benchmark consisting of mathematically equivalent and structurally similar problem pairs curated by human experts. MathNet supports three tasks: (i) Problem Solving, (ii) Math-Aware Retrieval, and (iii) Retrieval-Augmented Problem Solving. Experimental results show that even state-of-the-art reasoning models (78.4% for Gemini-3.1-Pro and 69.3% for GPT-5) remain challenged, while embedding models struggle to retrieve equivalent problems. We further show that retrieval-augmented generation performance is highly sensitive to retrieval quality; for example, DeepSeek-V3.2-Speciale achieves gains of up to 12%, obtaining the highest scores on the benchmark. MathNet provides the largest high-quality Olympiad dataset together with the first benchmark for evaluating mathematical problem retrieval, and we publicly release both the dataset and benchmark at https://mathnet.mit.edu.

Summary

  • The paper introduces MathNet, a large multilingual benchmark for Olympiad-level math, featuring over 30,000 problems and a detailed equivalence taxonomy.
  • It employs a novel three-track design—problem solving, math-aware retrieval, and retrieval-augmented problem solving—to evaluate deep mathematical reasoning and analogical matching.
  • Experimental results reveal that while state-of-the-art models excel in deductive reasoning, they struggle with math-aware retrieval, signaling a need for enhanced symbolic representation.

MathNet: A Multimodal, Multilingual Benchmark for Mathematical Reasoning and Problem Retrieval

Motivation and Benchmark Design

Mathematical reasoning—particularly at the Olympiad level—remains a central benchmark for evaluating the advanced deductive and analogical capacities of LLMs, large multimodal models (LMMs), and associated retrieval systems. Prior datasets constrained systematic research due to their limited scale, language coverage, and lack of rigorous annotation. The MathNet benchmark addresses these gaps by introducing a high-quality corpus of over 30,000 Olympiad-level math problems, curated official solutions, and a detailed equivalence taxonomy, supporting rigorous evaluation across mathematical reasoning and retrieval.

MathNet collects official problem booklets from 47 countries spanning four decades, resulting in a highly diverse and multilingual dataset (17 languages, 143 competitions). The dataset is structured into three primary tracks: (1) Problem Solving, (2) Math-Aware Retrieval, and (3) Retrieval-Augmented Problem Solving (RAG). Each track is specifically engineered to probe distinct facets of mathematical generalization, symbolic manipulation, and analogical retrieval.

Dataset Construction and Annotation

Unlike prior benchmarks sourced from informal community forums (e.g., AoPS), MathNet is extracted exclusively from officially published national and international Olympiad materials. The data ingestion pipeline leverages advanced multilingual OCR (dots-ocr), LLM-based extraction (Gemini-2.5, GPT-4.1), and robust, multi-stage verification (rule-based check, GPT-4.1, and human review), ensuring alignment, provenance, and content integrity. This pipeline enables the extraction of problem-solution pairs, structured metadata, figures, and language labels with high fidelity.

MathNet introduces a fine-grained taxonomy of mathematical similarity: strict invariance, structural resonance, and thematic affinity. This taxonomy is central to the Math-Aware Retrieval tasks and the construction of synthetic and expert-validated problem pairs.

Key Dataset Components

  • MathNet-Solve: 30,676 Olympiad problems with solutions for direct problem solving.
  • MathNet-Retrieve: 40,000 problem pairs (10k anchors, 1 positive, 3 hard negatives each) for rigorous evaluation of math-aware retrieval. Positives are generated via invariant-preserving LLM-led transformations; negatives leverage near-miss adversarial variants.
  • MathNet-RAG: 70 problems (35 anchors with 35 expert-curated analogous pairs) for evaluation of RAG protocols, selecting for structural resonance and analogical proximity.

Benchmark Tasks and Evaluation Protocols

Problem Solving

The core generative benchmark evaluates models' capacity for mathematical deduction as expressed in Olympiad problem solving. Performance is measured via an IMO-style rubric (0–7 points), binarized for accuracy, with both automatic (GPT-5) and human expert grading. Fine-grained breakdowns by domain (Algebra, Geometry, Combinatorics, Number Theory), language, and modality (text/image) are provided.

Math-Aware Retrieval

This task evaluates embedding-based retrieval systems on their ability to identify structurally or symbolically equivalent problems (not just those with semantic or lexical similarity). The metric is Recall@k over the MathNet-Retrieve set, with analysis of embedding similarity distributions for positives and hard negatives.

Retrieval-Augmented Problem Solving (RAG)

In this protocol, LLMs are presented with analogous solved problems retrieved from the MathNet corpus to assess the enhancement of problem-solving accuracy via retrieved context. Three inference regimes are evaluated: zero-shot (problem only), Embed-RAG (retrieved via embedding), and Expert-RAG (expert-paired problem). Both LLM and human grading measure downstream impact.

Experimental Results

Problem Solving

LLMs and LMMs with advanced reasoning capabilities (Gemini-3.1-Pro, Gemini-2.5, GPT-5) dominate the benchmark, achieving up to 78.4% overall accuracy (Gemini-3.1-Pro). Algebra remains the easiest, with geometry and discrete mathematics registering the largest error gaps. Even at the frontier, models demonstrate significant failure rates on proof and process-oriented problems—evidence that Olympiad-level mathematical reasoning is not trivial for present architectures.

Strong Numerical Results

  • Gemini-3.1-Pro: 78.4% total accuracy, with 83.7% in Algebra and 75.6% in Discrete Mathematics.
  • GPT-5: 69.3% accuracy overall, but only 56.3% on Geometry.
  • Lower-tier models demonstrate an over 70-point deficit compared to SOTA.

Math-Aware Retrieval

Embeddings—including Gemini-embedding-001 and Qwen3-embedding-4B—achieve only ~5% Recall@1, even for strictly invariant pairs, underscoring a profound weakness in capturing deep mathematical equivalence. Recall@5 and Recall@10 show significant improvements (near or above 80%), indicating that while structurally equivalent problems appear in the retrieval pool, they are ranked low, surpassed by lexically-similar but mathematically-irrelevant distractors.

Contradictory Claims

  • Embedding-based retrieval lags far behind generative problem solving on structurally-encoded tasks.
  • Non-equivalent, near-miss negatives are often assigned higher embedding similarity than true equivalents—revealing a misalignment in learned representations.

Retrieval-Augmented Problem Solving

While RAG can, in principle, yield large accuracy gains (e.g., DeepSeek-V3.2-Speciale improves from 84.8% to 97.3% under Expert-RAG), the effectiveness is highly sensitive to retrieval quality. Embed-RAG, powered by current embeddings, frequently fails to select genuinely useful context, sometimes offering no improvement or even degradation compared to zero-shot. Only when the retrieved context matches the structural core of the problem (Expert-RAG) are consistent improvements observed across models and grading modalities.

Implications for Mathematical AI

These results expose a clear dichotomy in mathematical AI capabilities: autoregressive LLMs achieve competent deductive performance with appropriate training, while the compositional and analogical structuring required for robust math-aware retrieval is poorly encoded in embedding-based architectures. This shortcoming signals the inadequacy of current semantic and lexical pretraining paradigms for deeply mathematical tasks. The limited gains of multimodal augmentation further emphasize that most SOTA visual-linguistic models are not integrating symbolic visual content at a level necessary for advanced mathematical reasoning.

The findings imply that future advances demand explicit symbolic or structured architectural innovations—beyond autoregressive next-token prediction—in both generative and retrieval paradigms. Integrating techniques from symbolic AI, neuro-symbolic models, or explicit theorem-proving strategies may be required for progress on generalization, analogy, and retrieval in mathematical domains.

Future Directions

MathNet sets a new standard and resource for measuring mathematical reasoning and analogical retrieval at scale. It opens avenues for research on:

  • Designing embedding architectures invariant under symbolic transformations.
  • Training generative models on analogical reasoning protocols, including contrastive and structure-aware objectives.
  • Integrating retrieval and generation to enable verifiable and self-explaining mathematical problem solvers.
  • Building multilingual and multimodal models that can operate robustly over diagrammatic mathematical content.

Conclusion

MathNet delivers the largest and most diverse multilingual, multimodal Olympiad-level reasoning benchmark to date. It demonstrates that even state-of-the-art generative models, though highly capable, remain non-asymptotic on proof-based mathematical reasoning, and that existing retrieval systems are profoundly limited in math-aware analogical search. The MathNet corpus, associated tasks, and strong baseline analyses will catalyze rigorous, fine-grained investigations into the next generation of AI systems for mathematics and formal reasoning (2604.18584).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What is this paper about?

This paper introduces MathNet, a very large, carefully checked collection of hard math contest problems (Olympiad level) from all over the world. The authors use it to test two big things in AI:

  • How well AI models can solve tough math problems.
  • How well AI models can find other problems that are “the same idea” in math, even if they look different (this is called math‑aware retrieval).

They also test whether giving an AI a solved, similar problem helps it solve a new one (retrieval‑augmented generation, or RAG).

What questions are the researchers trying to answer?

  • Can today’s top AI models actually solve Olympiad‑level problems across algebra, geometry, number theory, and discrete math?
  • Can AI search systems recognize when two problems are mathematically equivalent or share the same key idea, even if the wording, symbols, or language are different?
  • Does showing a model a closely related, solved example meaningfully boost its problem‑solving accuracy?

How did they build MathNet and run the tests?

To make this fair and high‑quality, the team:

  • Collected official contest booklets from 47 countries, across 17 languages, and over about four decades (30,676 problems with expert solutions).
  • Converted PDFs into text and diagrams, then used a mix of AI and human checks to match each problem with its correct solution and make sure nothing was misread or invented.

They created three parts (think of them as different “games” to play with the data):

  1. MathNet‑Solve (Problem Solving)
  • 30K+ problems with solutions for testing how well models solve them.
  • Models write full solutions; a “judge” model scores them from 0–7. A 6 or 7 counts as correct.
  1. MathNet‑Retrieve (Math‑Aware Retrieval)
  • For 10,000 “anchor” problems, they generated:
    • 1 equivalent version (same math, different look),
    • 3 “hard negatives” (look similar on the surface, but actually different).
  • Retrieval models must return the truly equivalent one. Success is measured by Recall@k (did the right match show up in the top k results?).
  1. MathNet‑RAG (Retrieval‑Augmented Problem Solving)
  • 35 pairs of real Olympiad problems chosen by experts because they share structure/ideas.
  • They test three settings:
    • Zero‑shot: the model just gets the new problem.
    • Embed‑RAG: the model also gets a retrieved similar problem and its official solution (found by an embedding retriever).
    • Expert‑RAG: the model gets the expert‑paired similar problem and its solution (a “perfect” retrieval).

Helpful analogies for the tech:

  • Embeddings: Turning a piece of text into a long number “fingerprint,” so the computer can compare two texts by how close their fingerprints are.
  • Cosine similarity: A way to measure how similar two fingerprints are—like checking how close two arrows point in the same direction.
  • Math‑aware retrieval: Not just matching words, but recognizing the same math idea in disguise—like recognizing the same song when played on different instruments or in another key.

They also define three types of “sameness” between problems:

  • Invariance: Same problem in disguise (e.g., different variable names or an equivalent formula).
  • Resonance: Different problems that use the same key trick or lemma.
  • Affinity: Same general topic but not necessarily the same method.

What did they find?

  • Problem solving: Top models do well but still make plenty of mistakes.
    • The best model (Gemini‑3.1‑Pro) scored about 78% overall. GPT‑5 was about 69%.
    • Algebra was the easiest; geometry and discrete math were the hardest for most models.
    • Even strong models can struggle with full, correct reasoning on Olympiad tasks.
  • Math‑aware retrieval: This is surprisingly hard.
    • At top‑1 (the single best guess), even the best embedding models only got around 5% right.
    • At larger cutoffs (top‑5 or top‑10), results improved a lot, but top‑1 is what really tests if the model “gets it.”
    • Models often matched by surface words (“triangle,” “polynomial”) instead of true math structure, so near‑miss problems got ranked higher than truly equivalent ones.
  • Retrieval‑augmented solving (RAG): Quality of retrieval is everything.
    • When the retrieved example was truly well‑matched (Expert‑RAG), multiple models improved, and one model (DeepSeek‑V3.2‑Speciale) saw gains up to about 12%, reaching the highest scores in that setting.
    • When retrieval came from a general embedding model (Embed‑RAG), improvements were inconsistent and sometimes worse than zero‑shot—because the “similar” problem wasn’t actually helpful.

Why this matters: It shows that today’s AI can often solve hard problems, but its “math search engine” side is weak—finding the right helpful example is still a major bottleneck.

Why is this important?

  • For students and teachers: Better math‑aware search could help you find the right example or idea, even if your problem is written differently.
  • For contest organizers and researchers: Helps detect when a new problem is too close to an old one and connects ideas across languages and notations.
  • For AI progress: Good retrieval is essential for RAG systems. If retrieval doesn’t truly understand math structure, it feeds the model the wrong examples and hurts performance.

Bottom line and future impact

  • MathNet is the largest, high‑quality Olympiad dataset with expert solutions across many countries and languages, and it includes the first benchmark focused on math‑aware problem retrieval.
  • The results highlight a gap: solving is getting strong, but retrieving truly equivalent or structurally related problems is still weak.
  • This dataset and benchmark should push research toward:
    • Embeddings that encode mathematical structure (not just matching words),
    • Better RAG systems that actually help reasoning,
    • Closer integration of symbolic math tools with LLMs.

In short, the paper gives the community a global, reliable “playground” to build AI that not only solves hard math but also finds the right math ideas—no matter how they’re written or drawn.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, consolidated list of concrete gaps and open questions the paper leaves unresolved, intended to guide follow-up research.

  • Synthetic equivalence validation: The MathNet-Retrieve positives and hard negatives are generated by GPT-5; the paper does not quantify human verification rates or noise levels. How often are “equivalent” variants truly equivalent, and how often do “hard negatives” accidentally preserve equivalence?
  • Narrow retrieval scope (invariance only): The retrieval benchmark targets strict equivalence (invariance) but omits retrieval for structural resonance and affinity. A benchmark with expert-labeled strategy-level and thematic similarity (including graded similarity) is needed.
  • Single-positive recall metric: Each anchor has exactly one positive. This makes Recall@k brittle and underestimates systems that retrieve other valid equivalents. Provide multiple positives per anchor and report precision/NDCG/MRR to better reflect ranking quality.
  • Cross-lingual math-aware retrieval is untested: The dataset is multilingual, but MathNet-Retrieve appears monolingual. Construct cross-language equivalent/resonant pairs (e.g., anchor in Spanish, positive in Russian) to evaluate language-agnostic mathematical retrieval.
  • Cross-modal retrieval is untested: The benchmark does not evaluate retrieving between textual, symbolic, and diagrammatic formulations (e.g., a geometry problem text vs a figure-centric variant). Design cross-modal equivalents and evaluate retrievers that fuse text and figure representations.
  • Missing baselines for symbolic/structured retrieval: Only general-purpose dense embeddings are reported. Evaluate math-specific retrievers using MathML/LaTeX parsing, expression trees (ASTs), CAS normalization, graph matching, sparse lexical (BM25/SPLADE), late interaction (ColBERT), and hybrid architectures.
  • Underspecified “formula-aware baseline”: The discussion references strong performance from a formula-aware baseline, but no methods or numbers are provided. What was the approach, and how does it compare across domains and k?
  • Limited RAG set size and diversity: MathNet-RAG has only 35 anchor–pair sets (70 problems), raising variance concerns and limiting domain/language coverage. Expand to hundreds of pairs, report per-domain/language results, and include inter-annotator agreement for similarity labels.
  • RAG ablations are minimal: Only k=1 retrieved context and a single retriever (gemini-embedding-001) are tested. How do gains vary with k, different retrievers, re-ranking, and mixture-of-retrievers strategies?
  • Realism of RAG inputs: RAG supplies the retrieved problem together with its official solution. Evaluate more realistic settings (problem-only, hints-only, noisy/partial solutions) and measure susceptibility to misleading or off-target retrieved solutions.
  • Grading reliability on MathNet-Solve: Problem-solving is graded by GPT-5 with a 0–7 score and a threshold ≥6, but no agreement with human graders is reported for the large-scale benchmark. Quantify inter-rater reliability, bias across model families, and robustness to style variations.
  • Potential training contamination: Many models likely saw official booklets during pretraining. Provide contamination audits, time-aware splits (e.g., train ≤2015, test >2015), and “clean” held-out subsets to assess true generalization.
  • Extraction pipeline fidelity: The OCR+LLM extraction pipeline (dots-ocr + LLM normalization/judging) lacks an end-to-end error rate report. Release a fully human-verified gold subset, and report per-language and per-modality extraction error rates and an error taxonomy.
  • Diagram fidelity and utilization: The impact and quality of diagrams are not quantified. Provide vectorized (e.g., SVG) or graph representations, evaluate diagram alignment accuracy, and analyze when and why images help or harm LMM performance.
  • Deduplication and split leakage: The deduplication methodology is not described in detail. Quantify near-duplicate rates and verify that paraphrases or variants of the same problem (or solution) do not straddle train/test splits.
  • Similarity taxonomy reliability: Invariance/resonance/affinity labels lack reported inter-annotator agreement. Establish detailed guidelines, measure agreement, and explore finer-grained or continuous similarity scales.
  • External-corpus generalization: Retrieval is evaluated only within MathNet. Test transfer to external corpora (AoPS, textbooks, arXiv), real shortlist vetting, and literature search scenarios to assess out-of-domain robustness and practical utility.
  • No supervised retriever training experiments: The paper evaluates off-the-shelf embeddings only. Train and evaluate retrieval models using supervised contrastive learning on MathNet-Retrieve (with careful validation to avoid overfitting to synthetic patterns).
  • Equivalence hardness: Many positives appear to be trivial renamings or algebraic restatements. Introduce harder equivalents (e.g., nontrivial isomorphisms, domain changes, geometric transformations, cross-modal restatements) and analyze failure modes.
  • Retrieval metrics and diagnostics: Beyond Recall@k, include MRR, NDCG, pairwise accuracy, and calibration curves; release per-anchor difficulty indicators and similarity threshold analyses to diagnose mis-rankings.
  • Proof validity vs final answers: For proof-style problems, LLM-judge scoring may conflate plausibility with correctness. Integrate formal proof verification (e.g., Lean/Coq) for a subset and evaluate models’ ability to produce formally checkable proofs.
  • Country/language/topic biases: The corpus skews toward English (≈74%) and certain domains. Report per-country/competition performance, balance splits by geography/language, and study transfer across styles and national curricula.
  • Temporal generalization: No analysis of performance across eras (e.g., pre-2000 vs 2010s vs 2020s). Construct chronological splits to probe drift in styles and techniques and assess models’ temporal robustness.
  • Multimodal failure analysis: The claim that visual augmentation yields limited gains lacks per-problem breakdowns. Provide error analyses highlighting specific diagram types (e.g., angle chasing vs locus problems) and common failure modes.
  • Availability of aligned translations: It is unclear whether non-English problems have aligned English translations. Provide high-quality bilingual pairs to enable fair cross-lingual evaluation and training.
  • Licensing and usage constraints: Clarify permissions for redistribution and model training on official national materials, and provide guidance for compliant academic and commercial use.

Practical Applications

Immediate Applications

Below are actionable use cases that can be deployed now by leveraging MathNet’s corpus, taxonomy (Invariance/Resonance/Affinity), benchmarks, and curation pipeline, with attention to the paper’s findings that retrieval remains a bottleneck and that RAG gains depend on retrieval quality.

  • Benchmarking and CI for AI/math models (Software/AI industry; Academia)
    • Tools/products/workflows: Integrate MathNet-Solve and MathNet-Retrieve into continuous evaluation pipelines and leaderboards; track domain-specific scores (algebra, number theory, geometry, discrete math); add retrieval dashboards (Recall@k) for embedding releases.
    • Assumptions/dependencies: Access to MathNet and grading harness; compute budget for large-scale evaluation; respect dataset licenses.
  • Math-aware search for education platforms (Education/EdTech)
    • Tools/products/workflows: Deploy a “find similar problems” feature using MathNet-Retrieve (top-k surfacing) with human-in-the-loop curation to mitigate low Recall@1; index problems by the similarity taxonomy to return invariant and resonant examples with official solutions.
    • Assumptions/dependencies: Current embeddings work best at higher k; teacher review recommended to filter near-miss negatives.
  • Novelty and duplication checker for Olympiad committees (Policy/Education; Academic competitions)
    • Tools/products/workflows: Use the taxonomy and retrieval to flag candidate duplicates (invariance) and near-duplicates (resonance) across historical archives before shortlist finalization; provide reviewer dashboards with provenance and side-by-side notations.
    • Assumptions/dependencies: Retrieval is imperfect at rank-1; requires human adjudication; ensure compliance with competition privacy policies.
  • RAG-based tutoring that emphasizes analogical learning (Education; Daily life)
    • Tools/products/workflows: Build tutoring assistants that present a target problem plus a structure-aligned worked example (Expert-RAG pattern) and prompt students to map solution ideas; include multilingual variants for accessibility.
    • Assumptions/dependencies: Retrieval quality matters; start with instructor-curated resonant pairs while embeddings improve; avoid over-reliance on model-generated steps without teacher oversight.
  • Embedding diagnostics for IR/enterprise search teams (Software/AI industry)
    • Tools/products/workflows: Use MathNet-Retrieve as a “unit test” suite for embeddings to detect lexical-overlap bias; track similarity distribution drift and failure modes on hard negatives; set gate criteria before shipping embeddings for math-heavy domains.
    • Assumptions/dependencies: Availability of internal evaluation harness; willingness to tune/train domain-specific embeddings.
  • Document digitization for technical archives (Software; Libraries/Archives; Government)
    • Tools/products/workflows: Adopt the OCR+LLM verification pipeline (dots-ocr + LLM normalization + rule-based and LLM judging + human spot-check) to convert scanned, multilingual STEM booklets into structured, LaTeX-friendly corpora.
    • Assumptions/dependencies: OCR quality for low-resolution scans; LLM costs; data governance for sensitive documents.
  • Plagiarism and equivalence-aware similarity checks in math coursework (Education/Academic integrity)
    • Tools/products/workflows: Use invariance-aware retrieval to detect solution or problem reuse that is algebraically reformulated; feed top-k candidates to instructors for verification.
    • Assumptions/dependencies: False positives from near-misses; requires faculty review to avoid over-penalization; adhere to student privacy policies.
  • Curriculum and assessment design across languages (Education/EdTech; Policy)
    • Tools/products/workflows: Build balanced problem sets by topic and difficulty using MathNet’s ontology; provide aligned multilingual versions; calibrate tests with domain-wise difficulty stats.
    • Assumptions/dependencies: Localization quality; educator validation; accessibility guidelines.
  • Retrieval-quality-aware RAG guardrails for math agents (Software/AI industry)
    • Tools/products/workflows: Add pre-inference checks on similarity distributions (e.g., equivalent vs. near-miss gap); fallback to zero-shot or chain-of-thought when retrieval looks noisy; log provenance and uncertainty signals for human review.
    • Assumptions/dependencies: Engineering effort to wire retrieval diagnostics; clear UX for fallbacks.
  • Training data curation with hard negatives for retriever fine-tuning (Software/AI industry; Academia)
    • Tools/products/workflows: Use MathNet’s synthetic hard negatives to fine-tune embedding models for math structure; monitor Recall@1/5 uplift and confusion on near-misses.
    • Assumptions/dependencies: Access to instruction-tuned LLMs for generation may be required; hyperparameter tuning and compute resources.

Long-Term Applications

The following concepts require advances in math-aware embeddings, multi-modal understanding, symbolic integration, or broader ecosystem adoption before they become robust.

  • Hybrid symbolic–neural math retrievers (Software/AI industry; Academia)
    • Tools/products/workflows: Build retrievers that parse equations/graphs and reason over invariances (e.g., normalization of formulas, graph isomorphism in geometry) combined with text semantics; return proofs/lemmas as nodes in a structured graph.
    • Assumptions/dependencies: New model architectures and training data; reliable symbolic parsers; standardized math representations.
  • General-purpose math-aware search in scientific and engineering knowledge bases (Energy, Aerospace, Manufacturing, Pharma R&D)
    • Tools/products/workflows: Integrate math-equivalence retrieval into enterprise document search to find standards, identically transformed formulas, or resonant methods across manuals and papers; surface cross-domain analogs (Resonance).
    • Assumptions/dependencies: Domain adaptation; IP and security constraints; higher-quality embeddings than current state.
  • Prior-art and patent search for math-heavy inventions (Legal/IP; Software)
    • Tools/products/workflows: Equivalence-aware retrieval of claims that differ by notation or transformation; examiner dashboards with structured alignment of formulas and proof sketches.
    • Assumptions/dependencies: Legal acceptance of AI-assisted evidence; robustness to adversarial paraphrase.
  • Verified theorem-discovery assistants (Academia/Research)
    • Tools/products/workflows: Agents that mine literature for resonant problems and lemmas (Common Lemma/Structural Reduction) and propose conjectures or proof sketches with machine-checked verification; maintain provenance chains.
    • Assumptions/dependencies: Stronger retrieval + formal verification integration; human-in-the-loop research workflows.
  • Reliable automated grading for proofs at scale (Education; Policy)
    • Tools/products/workflows: Extend the 0–7 scoring protocol to classroom and exam settings with calibrated LLM judges, formal checks for key steps, and adversarial audits; multilingual support for graders.
    • Assumptions/dependencies: Bias and robustness studies; institutional approval; appeals and auditing mechanisms.
  • Cross-lingual math education platforms with adaptive analogical practice (Education/EdTech; Daily life)
    • Tools/products/workflows: Personalized practice recommending invariant/resonant problems to build transfer; difficulty ramps based on taxonomy and solution process analytics; multimodal (diagram) handling.
    • Assumptions/dependencies: Improved geometry/diagram understanding; learner modeling; privacy-by-design.
  • STEM-wide multimodal retrieval (diagrams, plots, CAD) (Software/AI industry; Engineering)
    • Tools/products/workflows: Extend math-aware retrieval to engineering schematics and scientific figures; align textual problem statements with diagrammatic constraints for robust cross-modal search.
    • Assumptions/dependencies: Advances in visual-symbolic representations; high-quality multimodal datasets.
  • Competition problem authoring tools with novelty constraints (Education; Academic competitions)
    • Tools/products/workflows: Co-authoring assistants that suggest modifications preserving difficulty while avoiding invariance/resonance with known problems; live similarity scores during drafting.
    • Assumptions/dependencies: High-precision top-1 retrieval; buy-in from committees; secure, offline deployment.
  • Safety/regulatory capability tests for high-stakes AI (Policy/Regulation)
    • Tools/products/workflows: Use MathNet-like benchmarks to assess structured reasoning and analogical generalization before deployment in critical systems; include retrieval-sensitive RAG tests and pass/fail criteria.
    • Assumptions/dependencies: Consensus on metrics and thresholds; transparent reporting; third-party auditing.
  • Finance and quant research assistants (Finance)
    • Tools/products/workflows: Retrieval of equivalent identities, transforms, or bounds across papers and codebases (e.g., risk metrics reformulations); RAG pipelines that attach provenance and caveats.
    • Assumptions/dependencies: Domain adaptation; regulatory compliance; rigorous validation to avoid subtle mathematical mismatches.
  • Open-source math-aware embedding/toolkit ecosystem (Software/AI industry; Academia)
    • Tools/products/workflows: Libraries for training/evaluating embeddings on invariance/resonance; standardized datasets, evaluation harnesses, and public leaderboards; plugins for Jupyter/Overleaf to suggest analogs with citations.
    • Assumptions/dependencies: Community maintenance; sustained funding; benchmarking governance.
  • Dynamic RAG orchestration based on retrieval confidence (Software/AI industry)
    • Tools/products/workflows: Agent frameworks that adapt top-k, switch retrievers (lexical vs. symbolic), or escalate to human review based on similarity diagnostics; continuous learning from user feedback.
    • Assumptions/dependencies: Mature retrieval-quality estimators; ops maturity for human escalation loops.

Glossary

  • Affinity: A loose thematic relatedness between problems without structural equivalence; useful for grouping by topic rather than shared methods. "Affinity refers to a broad sense of relatedness without structural equivalence."
  • Anchor problems: Base problems used to construct evaluation datasets by pairing with equivalents and adversarial variants. "MathNet-Retrieve is built from 10,000 anchor problems from MathNet-Solve."
  • Automatic grading: Scoring model-generated solutions using an automated judge rather than human evaluators. "using both automatic grading and human expert grading."
  • Binarize: Convert a numeric score into a binary correct/incorrect label using a threshold. "We binarize the score by marking outputs with score ≥ 6 as correct (fully correct or containing only minor errors) and scores < 6 as incorrect."
  • Cosine similarity: A measure of angular similarity between embedding vectors used to compare problem statements. "We compute similarities between problem statements using cosine similarity over the embedding representations."
  • Cross-modal cues: Signals shared across different modalities (e.g., text and images) that inform retrieval or reasoning. "modern IR excels at semantic paraphrase but is often blind to symbolic equivalence and cross-modal cues."
  • Deliberate reasoning: A model inference style that engages extended, structured reasoning steps. "LLMs and LMMs with deliberate reasoning, includ- ing gemini-3.1-pro-preview, gemini-3-flash-preview, gemini-2.5-flash, claude-opus-4. 6, gpt-5, gpt-5-mini, gpt-5-nano, and DeepSeek-R1."
  • dots-ocr: A multilingual document parsing framework used to extract and normalize text from diverse math booklets. "using the multilingual document parsing framework dots-ocr (Li et al., 2025; Zheng et al., 2026)."
  • Embedding-based systems: Retrieval systems that represent text or problems as vectors and search via similarity. "a novel retrieval task that measures whether embedding-based systems can identify related problems based on deeper structural relationships rather than surface-level features."
  • Embedding models: Models that produce vector representations of text/statements for similarity search. "Experimental results show that even state-of-the-art reasoning models (78.4% for Gemini-3.1-Pro and 69.3% for GPT-5) remain challenged, while embedding models struggle to retrieve equivalent problems."
  • Embed-RAG: A retrieval-augmented setting where context is sourced via embedding-based retrieval. "In Embed-RAG, we retrieve one related problem using gemini-embedding-001, then provide the retrieved problem and its official solution as additional context."
  • Equivalent Positives: Synthetic problem variants that are mathematically equivalent to an anchor. "Equivalent Positives. For each anchor problem, we generate one mathematically equivalent variant via variable renaming (e.g., x -> a), algebraic manipulation, and paraphrasing using GPT-5."
  • Expert-RAG: A retrieval-augmented setting where context is sourced from expert-curated structurally similar problems. "In Expert-RAG, we instead provide the expert-paired related problem from MathNet-RAG together with its official solution."
  • Formula-aware indexing: Retrieval methods that index and match mathematical formulas rather than purely text semantics. "There has been work on formula-aware indexing (Zanibbi et al., 2025; Das et al., 2025), but such systems predate LLMs and typically operate at the formula level"
  • Hard Negatives: Adversarial, non-equivalent problem variants designed to be lexically similar and confuse retrievers. "Hard Negatives. For each anchor, we also generate three adversarial hard negatives that preserve much of the surface form while changing the underlying mathematics"
  • Hallucinated content: Spurious content introduced by generative models during extraction or formatting. "to ensure that LLM editing is restricted to formatting adjustments and introduces no hallucinated content."
  • Human grading: Evaluation of solutions or performance by human experts rather than automated judges. "Results are reported as accuracy (%) ± standard error under human grading and average LLM grading"
  • Information Retrieval (IR): The field focused on searching and ranking relevant information from large corpora. "Meanwhile, modern IR excels at semantic paraphrase but is often blind to symbolic equivalence and cross-modal cues."
  • Invariance: Strict equivalence between problems under transformations such as renaming or reformulation. "Invariance refers to strict equivalence under transformation."
  • Isomorphism: A structural correspondence between domains that preserves problem form or properties. "Examples include syntactic renaming, algebraic reformulation, geometric re-characterization, or cross-domain isomorphism."
  • LaTeX: A typesetting system used to represent mathematical content in a structured, reproducible format. "in a LATEX-friendly format."
  • LLM grading: Scoring solutions using a LLM acting as an automated judge. "Under average LLM grading, the same overall pattern holds"
  • LLM-based pipeline: A multi-stage workflow that leverages LLMs for extraction, alignment, and verification. "we designed a novel LLM-based pipeline for problem-solution alignment"
  • LLMs: LLMs capable of text-based reasoning and generation. "Recent LLMs"
  • LMMs: Large Multimodal Models that process and reason over both text and images. "large multimodal models (LMMs)"
  • Macro Avg.: Average performance across categories treated equally, used in reporting results. "Macro Avg. Micro Avg."
  • Math-Aware Retrieval: Retrieval that considers symbolic structure and equivalence, not just textual semantics. "Math-Aware Retrieval, or the ability to recognize and retrieve mathematically equivalent or related problems."
  • MathNet-RAG: The dataset for evaluating retrieval-augmented problem solving with expert-curated pairs. "MathNet-RAG: an evaluation dataset of 35 anchor problems and 35 expert-paired real problems"
  • MathNet-Retrieve: The dataset for evaluating math-aware retrieval using synthetic equivalents and hard negatives. "MathNet-Retrieve: an evaluation dataset for Math-Aware Retrieval built from 10,000 anchor problems"
  • MathNet-Solve: The main corpus of expert-written Olympiad problems and solutions used for problem solving. "MathNet-Solve: 30,676 expert-written Olympiad problems with solutions"
  • Micro Avg.: Average performance weighted by sample counts, used in reporting results. "Macro Avg. Micro Avg."
  • Multimodal: Involving multiple data types (e.g., text and images) in tasks and benchmarks. "Multimodal benchmarks integrate visual information with textual descriptions"
  • Near-miss distractors: Non-equivalent problem variants that closely resemble the anchor to mislead retrievers. "These near-miss distractors make it difficult for models to succeed by relying only on lexical similarity."
  • OCR: Optical Character Recognition used to extract text from scanned or typeset documents. "the original OCR out- put"
  • Ontology: A formal categorization of problem types or concepts within the dataset. "Ontology of 68. Problem Types"
  • Oracle contexts: Evaluation setting where the model is provided with ground-truth or expert-selected relevant context. "used for Retrieval-Augmented Problem Solving under both retrieved and oracle contexts."
  • Provenance: Metadata tracking the source and origin of problems for verification and traceability. "to maintain provenance information for each problem."
  • Recall@k: A retrieval metric indicating whether a correct item appears among the top-k results. "The primary evaluation metric for our retrieval task is Recall@k"
  • Retrieval-Augmented Generation (RAG): A technique that supplements generation with retrieved supporting context. "retrieval-augmented generation (RAG)"
  • Retrieval-Augmented Problem Solving: Solving problems with the aid of retrieved, relevant examples and solutions. "Retrieval-Augmented Problem Solving under both retrieved and oracle contexts."
  • Rule-based analytical checker: A deterministic verification component used to validate extraction fidelity. "we validate the extracted problem-solution pairs through a three-stage verification process of (1) a rule-based analytical checker"
  • Semantic retrieval: Retrieval focused on meaning-level similarity rather than exact symbolic or structural equivalence. "unlike existing semantic retrieval"
  • Structural Parsing: Parsing step that analyzes document structure to align problems and solutions. "Structural Parsing"
  • Structural Reduction: A similarity mode where problems relate via reductions in structure or form. "Structural Reduction"
  • Structural Resonance: Partial similarity where problems share ideas, lemmas, or strategies without strict equivalence. "Resonance refers to partial similarity."
  • Taxonomy: A structured categorization of similarity modes or problem types to analyze performance. "A key feature of MATHNET is its fine-grained taxonomy of mathematical similarity"
  • Team Selection Tests (TST): National exams used to select teams for international competitions like the IMO. "Team Selection Tests (TST)"
  • Zero Shot: An inference setting where the model receives only the target problem without additional context. "In Zero Shot, the model receives only the target problem."

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 16 tweets with 62 likes about this paper.