Papers
Topics
Authors
Recent
Search
2000 character limit reached

MathArena: Benchmarking LLM Math Reasoning

Updated 9 January 2026
  • MathArena is a benchmarking framework that evaluates LLMs on new math competition problems to ensure authentic reasoning over memorization.
  • It supports dual evaluation of final-answer and proof-writing tasks using automated tools and human-in-the-loop grading for high-fidelity assessment.
  • Its evolving leaderboard and transparent scoring pipeline drive continuous improvement and benchmarking of LLM mathematical capabilities.

MathArena is a benchmarking framework for evaluating the mathematical reasoning and proof-writing capabilities of LLMs via real-time assessments on newly released math competition problems. Its principal innovation is the rigorous elimination of contamination—models are only tested on problems published after their last training date, ensuring that performance reflects authentic reasoning rather than memorization. MathArena is distinguished by its dual support for both final-answer and proof-writing tasks, transparent reproducibility guarantees, and an evolving leaderboard. It currently encompasses 30 models across 149 problems and multiple contests such as AIME 2024, HMMT 2025, USAMO 2025, BRUMO 2025, and SMT 2025, maintaining a continuously updated platform for documenting advances in LLM mathematical competence (Balunović et al., 29 May 2025).

1. Formal Structure and Objective

MathArena is formally specified by the 4-tuple

MathArena  =  (C,M,Π,G)\text{MathArena} \;=\; (\mathcal{C},\,\mathcal{M},\,\Pi,\,G)

where:

  • C\mathcal{C} is a dynamic set of math competitions, each indexed by CC with problem sets {p1,,pNC}\{p_1,\dots,p_{N_C}\} and corresponding official solutions.
  • M\mathcal{M} comprises the evaluated LLMs.
  • Π\Pi denotes the real-time evaluation pipeline, which automates problem ingestion, model querying, and grading upon each new competition release.
  • GG is the grading scheme mapping model outputs to numeric scores.

The objective is contamination-free benchmarking: only problems released post-model cutoff are submitted, enabling the disambiguation of genuine reasoning from memorized content. MathArena guarantees support for high-fidelity evaluation—including both answer and proof-based tasks—with automated and human-in-the-loop verification.

2. Real-Time Evaluation Workflow

MathArena's pipeline activates precisely upon official competition release (tCt_C), executing a series of standardized steps:

  1. Problem Ingestion
    • Acquire official competition materials.
    • Extract and normalize LaTeX statements and answers, with manual QA for accuracy.
  2. Model Submission
    • For each model MM with release date tMtCt_M\leq t_C, prompt with “Solve problem {pi}\{p_i\}”.
    • Cap generations at 64k tokens; execute four independent runs per problem.
    • Enforce output formatting: \boxed{\cdot} for final answers, full natural-language for proofs.
  3. Scoring
    • Answer-based: Parse boxed responses using a rule-based Sympy parser, apply LLM (Gemini-2.5-Flash) secondary judgement; discrepancies are flagged for manual review.
    • Proof-based: Anonymize responses; expert graders (two per proof) apply contest rubrics to assign points.
  4. Leaderboard & Transparency
    • Aggregate and publish model scores and confidence intervals at https://matharena.ai with full solution-level drill-down and reviewer flags.

This operationalized workflow enables immediate, reproducible, and publicly visible assessment of LLMs against the most current and unexposed mathematical content.

3. Contamination Detection and Mitigation

MathArena addresses the fundamental challenge of contamination—a phenomenon where models may have seen or even been tuned on evaluation problems prior to assessment. Contamination is rigorously defined: a model MM is contaminated on CC if any CC problems (or close variants) exist in MM’s training data, or if MM was tuned using CC performance.

Detection methodology consists of:

  • Temporal Check: Ensure tMtCt_M\leq t_C; flag models failing this criterion.
  • Performance Anomaly: Compare historical (SM,ColdS_{M,C}^{\mathrm{old}}) and fresh (SM,CnewS_{M,C}^{\mathrm{new}}) contest scores, benchmarking against human performance quantile Sα,ChumanS_{\alpha,C}^{\mathrm{human}}. Large discrepancies suggest contamination: ΔC(M)=SM,ColdSα,ChumanSM,CnewSα,Chuman\Delta_C(M) = S_{M,C}^{\mathrm{old}} - S_{\alpha,C}^{\mathrm{human}} \gg S_{M,C}^{\mathrm{new}} - S_{\alpha,C}^{\mathrm{human}} Results are visualized as (SM,Cold,SM,Cnew)(S_{M,C}^{\mathrm{old}}, S_{M,C}^{\mathrm{new}}) relative to the baseline S=SS=S; outliers above the diagonal indicate likely contamination, as observed in AIME 2024.

To ensure uncontaminated input, competitions are ingested 1–2 days after release, long after any feasible exposure by training or fine-tuning.

4. Task Types and Scoring Metrics

MathArena evaluates LLMs on two primary mathematical task types:

  • Final-Answer Tasks: Problems requiring a single numeric/symbolic solution.
  • Proof-Writing Tasks: Problems demanding multi-step, natural-language proofs.

Numerical-Answer Scoring:

For each problem ii and model MM, four independent generations {a^i(1),,a^i(4)}\{\hat a_{i}^{(1)},\dots,\hat a_{i}^{(4)}\} are sampled. Pass@1 accuracy is computed as

acc(M,C)=1NCi=1NC1[a^i(1)=ai]\mathrm{acc}(M,C) = \frac{1}{N_C} \sum_{i=1}^{N_C} \mathbf{1}[\hat a_i^{(1)} = a_i^\star]

(exact Sympy equivalence), with reported scores averaged over the four runs.

Proof-Writing Scoring:

Each proof qiq_i is graded on a scale [0,Pmax][0, P_{\max}] (with Pmax=7P_{\max}=7 for USAMO 2025). The total proof score is

Sproof(M)=i=1NCsi(M),percent(M)=Sproof(M)NCPmaxS_{\mathrm{proof}}(M) = \sum_{i=1}^{N_C} s_i(M), \quad \mathrm{percent}(M) = \frac{S_{\mathrm{proof}}(M)}{N_C P_{\max}}

Grading is rubric-based, awarding for correct steps, logical coherence, and completeness.

Statistical Analyses:

  • Accuracy variance per contest: Var(p^)=p^(1p^)N\mathrm{Var}(\hat{p}) = \frac{\hat{p}(1-\hat{p})}{N}.
  • Rank confidence intervals via paired permutation test: T=i=1nxiyiNci,CI95%=[rankmin,rankmax]T = \sum_{i=1}^n \frac{x_i - y_i}{N_{c_i}},\quad \mathrm{CI}_{95\%} = [\mathrm{rank}_{\min}, \mathrm{rank}_{\max}]

5. Implementation Pipeline

The framework’s implementation encompasses:

  • Data Collection: Manual retrieval and normalization of competition problems, LaTeX standardization, CSV extraction of ground-truth answers, and quality assurance.
  • Model Interfacing: All models are accessed via public API, using provider-recommended hyperparameters, with no further tuning. Prompts are formatted to enforce answer/proof conventions.
  • Validation:
    • Answer-based grading is automated and relies on double-parsing (Sympy, Gemini-2.5-Flash), resorting to manual adjudication for flagged answers.
    • Proof-based grading is executed by two blinded experts per proof, following structured evaluation rubrics.

This bifurcation ensures scalability for objective answer tasks and maintains high fidelity for subjective, nuanced proof judgments.

6. Experimental Results and Performance Landscape

MathArena’s experimental campaign to date spans 143 problems (numerical) and 6 problems (proof-based). Key findings include:

Accuracy per contest:

Model AIME (%) HMMT (%) BRUMO (%) SMT (%) Avg (%)
o3 (high) 89.17 77.50 95.83 87.74 87.56
o4-mini (high) 91.67 82.50 86.67 88.68 87.38
Gemini-2.5-Pro 87.50 82.50 90.00 84.91 86.23

Contamination was confirmed in AIME 2024, with model scores often 10–20 points above the human baseline, consistent with leakage. In contrast, SMT 2025 (competition released post-model) yielded top model performance near 88%, indicative of strong uncontaminated reasoning.

Proof-Writing Results (USAMO 2025):

Model Total /42 pts %
Gemini-2.5-Pro 10.1 24.0
o3 9.2 21.9
o4-mini 8.1 19.3
Human median 15.0 35.7

The best LLMs attain below 25% of available proof points, notably lagging behind median human performance. This suggests that proof-writing—requiring symbolic manipulation, logical deduction, and narrative clarity—remains an active frontier for LLM evolution.

7. Outlook and Prospective Extensions

MathArena substantiates that real-time, uncontaminated competition-based evaluation yields reproducible, interpretable benchmarks for LLM mathematical progress. While final-answer task performance approaches saturation (88%+ on recent contests), proof-writing capabilities are both nascent and challenging.

Future roadmap items include:

  1. Expansion to additional advanced competitions (e.g. IMO 2025, Putnam 2025).
  2. Research into automated evaluation protocols for natural-language proofs, which may necessitate hybrid human/LLM grading.
  3. Sourcing or constructing more difficult final-answer tasks, given the predicted saturation by 2026.
  4. Formalization and refinement of contamination quantification via new metrics (e.g. “contamination score” CM,CC_{M,C}).

MathArena provides a robust living platform for documenting and interrogating mathematical reasoning advances in LLMs, fully shielded from data leakage and equipped to capture both computational and discursive dimensions of mathematical expertise (Balunović et al., 29 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MathArena Framework.