Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs

Published 1 May 2026 in cs.CL | (2605.00674v1)

Abstract: LLMs are becoming increasingly capable mathematical collaborators, but static benchmarks are no longer sufficient for evaluating progress: they are often narrow in scope, quickly saturated, and rarely updated. This makes it hard to compare models reliably and track progress over time. Instead, we need evaluation platforms: continuously maintained systems that run, aggregate, and analyze evaluations across many benchmarks to give a comprehensive picture of model performance within a broad domain. In this work, we build on the original MathArena benchmark by substantially broadening its scope from final-answer olympiad problems to a continuously maintained evaluation platform for mathematical reasoning with LLMs. MathArena now covers a much wider range of tasks, including proof-based competitions, research-level arXiv problems, and formal proof generation in Lean. Additionally, we maintain a clear evaluation protocol for all models and regularly design new benchmarks as model capabilities improve to ensure that MathArena remains challenging. Notably, the strongest model, GPT-5.5, now reaches 98% on the 2026 USA Math Olympiad and 74% on research-level questions, showing that frontier models can now comfortably solve extremely challenging mathematical problems. This highlights the importance of continuously maintained evaluation platforms like MathArena to track the rapid progress of LLMs in mathematical reasoning.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces MathArena, a dynamic platform that updates LLM math evaluation tasks continuously, addressing the limitations of static benchmarks.
It details a robust methodology using item response theory-based imputation, multi-judge proof grading, and automated analysis for precise performance tracking.
It reveals significant performance gaps in proof validation and formalization, highlighting critical challenges and future research priorities in LLM mathematical reasoning.

MathArena: A Comprehensive Platform for LLM Mathematical Evaluation

Motivation and Framework for Evaluation Platforms

The rapid acceleration in LLM mathematical reasoning capabilities has rendered static benchmarks inadequate. Traditional benchmarks, typically released as fixed datasets with summary metrics, soon become uninformative as new LLM architectures saturate their test sets. Their narrow scope fails to capture the expanding set of real-world mathematical skills, and their lack of continuous maintenance limits their value for practitioners seeking up-to-date model comparisons. The MathArena platform addresses these shortcomings by establishing a dynamic, transparent, and adaptable framework for evaluating mathematical reasoning in LLMs.

MathArena is positioned as an evaluation platform, distinguished from benchmarks by three main properties: 1) continuously incorporating new task types as models improve, 2) regularly evaluating state-of-the-art models with consistent protocols, and 3) providing an open interface for granular and aggregate performance data, model outputs, and cost metrics.

Platform Design and Implementation

MathArena's evolution as a platform is evident in its support for longitudinal tracking of model performance, benchmark curation, and in-depth analysis tools for failure cases. As new competitions emerge and LLM capabilities shift, MathArena deprecates uninformative tasks and incorporates novel challenges across the spectrum of mathematical problem solving, from school-level competitions to proofs extracted from contemporary arXiv publications.

The platform maintains a continuous cycle of benchmark design, automated and human-centric evaluation, and interface improvements, illustrated by the timeline of task and feature additions, empirical analyses, and the introduction of research-centric capabilities.

Figure 2: The MathArena homepage surfaces aggregate and per-problem model performance with live leaderboard updates and robust filtering.

Coverage and Benchmark Categories

MathArena covers an expanded range of mathematical tasks, grouped into three major categories:

Final-Answer Benchmarks: Includes school and olympiad-level problems (AIME, HMMT), curated hard sets (Apex), and visual reasoning tasks (Kangaroo). Automated answer extraction ensures reliable grading and high scalability. Benchmark updates track recent competitions, and visual benchmarks stress test VLM capabilities.
Proof-Based Benchmarks: Incorporate undergrad and olympiad proof competitions (Putnam, Miklós Schweitzer, USAMO, IMC). The platform collaborates with organizing committees for official grading or deploys rigorous LLM and human jury pipelines for evaluation. Extensive rubric engineering and reconciliation mitigate bias in LLM judging.
Research-Level Benchmarks: Extract both final-answer and proof construction tasks from recent arXiv papers (ArXivMath, BrokenArXiv, ArXivLean). Problems undergo automated filtering for self-containedness and relevance, with manual review to enforce quality. For Lean proofs, formalization and semantic faithfulness are checked using both LLMs and specialized tools (Comparator), and models are required to provide machine-verifiable solutions.
Figure 4: The benchmark index offers traceability and access to scores, datasets, and benchmark provenance.

Evaluation Protocols and Aggregate Metrics

To balance coverage and cost, MathArena does not exhaustively run every model on every benchmark. Instead, it employs an item response theory-based imputation (following (Ho et al., 28 Nov 2025)) that predicts missing model-benchmark entries, enabling consistent overall rankings. Evaluation protocols are standardized—automatic answer parsing for final-answer tasks, multi-judge and rubric-based grading for proofs, and LLM-supported evaluation for reliability and formal proof tasks. Where automation is insufficient, expert human judgment is deployed.

Empirical interval calibration is validated, showing that the adopted protocols for estimating aggregate scores and statistical uncertainty do not introduce significant bias (Figure 5).

Figure 6: Empirical calibration demonstrates that the MathArena interval estimation method achieves near-ideal coverage.

Results and Observations

Numerical Results and Robust Claims

MathArena reveals strong stratification of model performance. As of the current evaluation, frontier models such as GPT-5.5 achieve 98% on the 2026 USAMO and 74% on research-level final-answer tasks, indicating that competitive LLMs can reliably replicate or surpass elite human performance on the majority of olympiad and undergraduate benchmarks. Notably, however, open models lag by up to 20% on challenging benchmarks, and the closed/open performance differential is accentuated on proof and research tasks.

Figure 5: Cost-performance tradeoffs across top models show that the best-performing closed models remain substantially ahead of open competitors.

Model performance is highly stable to benchmark ablations (Figure 6), with very few changes to the top ranks when omitting individual benchmark families.

Figure 1: Removing benchmark families has minimal impact on overall model ranking; top model ordering is robust.

Proof quality analysis demonstrates that top models produce not only more correct but also more readable and structurally sound proofs. Grading accuracy for USAMO proofs with multi-model LLM juries aligns closely with human expert judgment, with the strongest grader (GPT-5.4) matching expert scores on all cases and weaker graders (Gemini, Claude) overestimating correctness.

MathArena's research-reliability benchmarks uncover a critical failure mode: all but the top model (GPT-5.5, 72%) consistently "prove" false statements in the BrokenArXiv setting, directly demonstrating that confirmation bias and unreliability are unresolved obstacles for model deployment in research.

Formal Lean proof tasks (ArXivLean) remain essentially unsolved, with top models solving only 17% of problems, reflecting both the depth and the autoformalization challenges at the research frontier.

Platform Interface and Analytic Tools

MathArena's public interface supports a range of analytic workflows. Users can transition from live leaderboard summaries to inspection of individual benchmarks, side-by-side model comparisons, drilldowns into per-problem runs, and isolation of surprising failure traces.

Figure 3: The detailed leaderboard reveals additional metadata crucial for evaluating the practical model impact, including retry counts and openness.

Figure 7: The model comparison interface facilitates nuanced analysis of cost-accuracy tradeoffs between two selected models.

Figure 10: Per-problem trace inspection enables granular identification of reasoning failures and model-specific limit modes.

Figure 8: The interface for "surprising traces" highlights highly anomalous model failures warranting further investigation.

Limitations and Future Directions

Despite covering a broad slice of mathematical practice, MathArena's current scope excludes several vital dimensions: interactive workflows, higher-level mathematical creativity such as conjecture formulation, and research tool integration. Synergy with agentic, tool-augmented systems and more sophisticated multi-turn evaluation protocols are identified as key directions for future iterations. The platform also restricts tool access for research benchmarks due to contamination concerns, limiting direct real-world parity.

The reliability benchmarks, while carefully constructed, only stress certain confirmation biases and do not substitute for full correctness verification in mathematics research—a recognized open challenge in the field.

Theoretical and Practical Implications

MathArena demonstrates that continuous, platform-based evaluation architectures are required to keep pace with both the breadth and quality of LLM mathematical reasoning. The explicit exposure of LLM brittleness on reliability tasks, robust demonstration of proof-writing gains, and the persistent gap in formal theorem proving highlight targeted research priorities. As LLMs are further integrated into mathematical practice and research, platforms like MathArena provide the necessary scaffolding for meaningful, reproducible, and transparent assessment.

From a broader perspective, MathArena's paradigm may generalize to other complex cognitive domains, where skill sets and tasks are rapidly evolving and static leaderboards fail to capture real-world progress or reliability risks.

Conclusion

"Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs" (2605.00674) firmly establishes the essential role of open, dynamic platforms in mathematical AI evaluation. By continuously aggregating and analyzing diverse benchmarks with robust protocols and a transparent interface, MathArena enables precise tracking of LLM mathematical progress, supports nuanced error analysis, and reveals unresolved reliability and formalization bottlenecks. Its design and results inform not only the trajectory for future mathematical LLM development but also the broader methodology for AI capability assessment.

Markdown Report Issue