MathArena: Benchmarking LLM Math Reasoning
- MathArena is a benchmarking framework that evaluates LLMs on new math competition problems to ensure authentic reasoning over memorization.
- It supports dual evaluation of final-answer and proof-writing tasks using automated tools and human-in-the-loop grading for high-fidelity assessment.
- Its evolving leaderboard and transparent scoring pipeline drive continuous improvement and benchmarking of LLM mathematical capabilities.
MathArena is a benchmarking framework for evaluating the mathematical reasoning and proof-writing capabilities of LLMs via real-time assessments on newly released math competition problems. Its principal innovation is the rigorous elimination of contamination—models are only tested on problems published after their last training date, ensuring that performance reflects authentic reasoning rather than memorization. MathArena is distinguished by its dual support for both final-answer and proof-writing tasks, transparent reproducibility guarantees, and an evolving leaderboard. It currently encompasses 30 models across 149 problems and multiple contests such as AIME 2024, HMMT 2025, USAMO 2025, BRUMO 2025, and SMT 2025, maintaining a continuously updated platform for documenting advances in LLM mathematical competence (Balunović et al., 29 May 2025).
1. Formal Structure and Objective
MathArena is formally specified by the 4-tuple
where:
- is a dynamic set of math competitions, each indexed by with problem sets and corresponding official solutions.
- comprises the evaluated LLMs.
- denotes the real-time evaluation pipeline, which automates problem ingestion, model querying, and grading upon each new competition release.
- is the grading scheme mapping model outputs to numeric scores.
The objective is contamination-free benchmarking: only problems released post-model cutoff are submitted, enabling the disambiguation of genuine reasoning from memorized content. MathArena guarantees support for high-fidelity evaluation—including both answer and proof-based tasks—with automated and human-in-the-loop verification.
2. Real-Time Evaluation Workflow
MathArena's pipeline activates precisely upon official competition release (), executing a series of standardized steps:
- Problem Ingestion
- Acquire official competition materials.
- Extract and normalize LaTeX statements and answers, with manual QA for accuracy.
- Model Submission
- For each model with release date , prompt with “Solve problem ”.
- Cap generations at 64k tokens; execute four independent runs per problem.
- Enforce output formatting: for final answers, full natural-language for proofs.
- Scoring
- Answer-based: Parse boxed responses using a rule-based Sympy parser, apply LLM (Gemini-2.5-Flash) secondary judgement; discrepancies are flagged for manual review.
- Proof-based: Anonymize responses; expert graders (two per proof) apply contest rubrics to assign points.
- Leaderboard & Transparency
- Aggregate and publish model scores and confidence intervals at https://matharena.ai with full solution-level drill-down and reviewer flags.
This operationalized workflow enables immediate, reproducible, and publicly visible assessment of LLMs against the most current and unexposed mathematical content.
3. Contamination Detection and Mitigation
MathArena addresses the fundamental challenge of contamination—a phenomenon where models may have seen or even been tuned on evaluation problems prior to assessment. Contamination is rigorously defined: a model is contaminated on if any problems (or close variants) exist in ’s training data, or if was tuned using performance.
Detection methodology consists of:
- Temporal Check: Ensure ; flag models failing this criterion.
- Performance Anomaly: Compare historical () and fresh () contest scores, benchmarking against human performance quantile . Large discrepancies suggest contamination: Results are visualized as relative to the baseline ; outliers above the diagonal indicate likely contamination, as observed in AIME 2024.
To ensure uncontaminated input, competitions are ingested 1–2 days after release, long after any feasible exposure by training or fine-tuning.
4. Task Types and Scoring Metrics
MathArena evaluates LLMs on two primary mathematical task types:
- Final-Answer Tasks: Problems requiring a single numeric/symbolic solution.
- Proof-Writing Tasks: Problems demanding multi-step, natural-language proofs.
Numerical-Answer Scoring:
For each problem and model , four independent generations are sampled. Pass@1 accuracy is computed as
(exact Sympy equivalence), with reported scores averaged over the four runs.
Proof-Writing Scoring:
Each proof is graded on a scale (with for USAMO 2025). The total proof score is
Grading is rubric-based, awarding for correct steps, logical coherence, and completeness.
Statistical Analyses:
- Accuracy variance per contest: .
- Rank confidence intervals via paired permutation test:
5. Implementation Pipeline
The framework’s implementation encompasses:
- Data Collection: Manual retrieval and normalization of competition problems, LaTeX standardization, CSV extraction of ground-truth answers, and quality assurance.
- Model Interfacing: All models are accessed via public API, using provider-recommended hyperparameters, with no further tuning. Prompts are formatted to enforce answer/proof conventions.
- Validation:
- Answer-based grading is automated and relies on double-parsing (Sympy, Gemini-2.5-Flash), resorting to manual adjudication for flagged answers.
- Proof-based grading is executed by two blinded experts per proof, following structured evaluation rubrics.
This bifurcation ensures scalability for objective answer tasks and maintains high fidelity for subjective, nuanced proof judgments.
6. Experimental Results and Performance Landscape
MathArena’s experimental campaign to date spans 143 problems (numerical) and 6 problems (proof-based). Key findings include:
Accuracy per contest:
| Model | AIME (%) | HMMT (%) | BRUMO (%) | SMT (%) | Avg (%) |
|---|---|---|---|---|---|
| o3 (high) | 89.17 | 77.50 | 95.83 | 87.74 | 87.56 |
| o4-mini (high) | 91.67 | 82.50 | 86.67 | 88.68 | 87.38 |
| Gemini-2.5-Pro | 87.50 | 82.50 | 90.00 | 84.91 | 86.23 |
Contamination was confirmed in AIME 2024, with model scores often 10–20 points above the human baseline, consistent with leakage. In contrast, SMT 2025 (competition released post-model) yielded top model performance near 88%, indicative of strong uncontaminated reasoning.
Proof-Writing Results (USAMO 2025):
| Model | Total /42 pts | % |
|---|---|---|
| Gemini-2.5-Pro | 10.1 | 24.0 |
| o3 | 9.2 | 21.9 |
| o4-mini | 8.1 | 19.3 |
| Human median | 15.0 | 35.7 |
The best LLMs attain below 25% of available proof points, notably lagging behind median human performance. This suggests that proof-writing—requiring symbolic manipulation, logical deduction, and narrative clarity—remains an active frontier for LLM evolution.
7. Outlook and Prospective Extensions
MathArena substantiates that real-time, uncontaminated competition-based evaluation yields reproducible, interpretable benchmarks for LLM mathematical progress. While final-answer task performance approaches saturation (88%+ on recent contests), proof-writing capabilities are both nascent and challenging.
Future roadmap items include:
- Expansion to additional advanced competitions (e.g. IMO 2025, Putnam 2025).
- Research into automated evaluation protocols for natural-language proofs, which may necessitate hybrid human/LLM grading.
- Sourcing or constructing more difficult final-answer tasks, given the predicted saturation by 2026.
- Formalization and refinement of contamination quantification via new metrics (e.g. “contamination score” ).
MathArena provides a robust living platform for documenting and interrogating mathematical reasoning advances in LLMs, fully shielded from data leakage and equipped to capture both computational and discursive dimensions of mathematical expertise (Balunović et al., 29 May 2025).