MathArena AIME-2024 Benchmark
- MathArena AIME-2024 is a benchmark that uses newly released AIME problems to assess advanced LLM mathematical reasoning.
- It employs automated scraping, LaTeX parsing, semantic equivalence testing, and permutation tests to ensure data integrity.
- The evaluation reveals performance inflation in top LLMs and introduces approaches like LeaP to mitigate reasoning deficits.
MathArena AIME-2024 designates both a specific instantiation of the broader MathArena benchmarking framework and the use of the American Invitational Mathematics Examination (AIME) 2024 contest problems for the real-time, large-scale evaluation of mathematical reasoning capabilities in advanced LLMs. It establishes rigorous standards for dataset provenance, model evaluation, contamination analysis, and aggregation of quantitative and qualitative results, thereby providing a data-rich lens on both model progress and benchmarking limitations in automated mathematical reasoning.
1. MathArena Framework and Real-Time Benchmarking
MathArena is defined as a benchmark utilizing newly released math competition problems, systematically addressing the prevalent issue of dataset contamination in LLM training and evaluation. The methodology consists of immediate post-release scraping and LaTeX-parsing of competition problems, dispatching of model queries (four shots per question), parsing of boxed numerical answers, and semantic equivalence testing via Sympy, supplemented by human-and-LLM judge loops for parser edge cases (Balunović et al., 29 May 2025). For AIME 2024, this protocol ensures no model had prior access to the test set, establishing an uncontaminated evaluation environment.
Ground truth consistencies are verified, and pass@1 accuracy is estimated for each participating model across problem domains and over the entire set, with 95% confidence intervals derived using Bernoulli variance and paired permutation tests for model ranks. This statistical apparatus is essential given the limited number of problems per contest.
2. Structure and Content of the AIME 2024 Problem Set
AIME 2024 comprises 30 integer-answer questions, each requiring , distributed across four classic mathematical domains:
- Algebra (9 problems)
- Combinatorics (9 problems)
- Geometry (8 problems)
- Number Theory (6 problems)
Problem statements are presented in LaTeX format and exhibit high analytic and computational demands, as typified by sample questions such as divisibility and modular constraints (e.g., sum of divisors of $9!$ congruent to ) and geometric construction (e.g., lengths relating to the incenter in a triangle with given side lengths) (Balunović et al., 29 May 2025).
3. Contamination Analysis and Statistical Inflation
Substantial performance inflation on AIME 2024 is detected by comparing each model's results with its performance on the uncontaminated AIME 2025 set. For , where is pass@1 accuracy over 30 questions, the top 12 models exhibit an average inflation of approximately (Balunović et al., 29 May 2025). Nearly all top models surpass the top human quantile for 2024 by $10$–$20$ percentage points, whereas their 2025 scores align with human expectations. Paired permutation tests validate the statistical significance ( for 10/12 models), indicating that AIME 2024 was present in pretraining or fine-tuning pipelines for most competitive LLMs. Use of AIME 2024 as a retrospective benchmark for mathematical reasoning is thus unreliable.
4. Model Results and Performance Differentiation
A comprehensive evaluation of 30 models—encompassing both closed-source systems (e.g., o3-high, o4-mini-high, Gemini-2.5-Pro) and Pareto-front open LLMs (e.g., Qwen3-235B-A22B, DeepSeek-R1)—is performed. Representative Pass@1 accuracies for top-performing models are summarized below:
| Model | Overall Acc | Algebra | Comb. | Geo. | NT |
|---|---|---|---|---|---|
| o4-mini (high) | 91.67% | 94.1% | 84.3% | 81.5% | 92.0% |
| o3 (high) | 89.17% | 92.0% | 83.1% | 84.5% | 92.0% |
| Gemini-2.5-Pro | 87.50% | 96.3% | 76.2% | 81.0% | 90.9% |
| o3-mini (high) | 86.67% | 84.4% | 72.4% | 71.1% | 80.0% |
| o4-mini (medium) | 84.17% | 85.1% | 71.5% | 73.8% | 88.6% |
Each score is averaged over four independent runs and is subject to confidence intervals (Balunović et al., 29 May 2025). The observed model ranking is statistically robust, but inflation renders these scores non-representative of pure reasoning capabilities.
5. Error Modes and Solution Analysis
Qualitative investigation reveals the spectrum of model proficiency in mathematical reasoning. Exemplary correct reasoning (o3-high) incorporates multiplicative structure analysis, divisor constraints, and stepwise modular deduction, producing accurate solutions (e.g., Problem A1, final answer ). In contrast, lower-fidelity outputs (e.g., Qwen3-30B-A3B) manifest surface pattern matching, mis-enumeration, and modularity misinterpretation (answer , including non-divisors such as $11$, $111$) (Balunović et al., 29 May 2025).
This dichotomy highlights persistent failure modes: incorrect problem parsing, superficial analogic reasoning, and lack of mathematical formalism in the handling of modular arithmetic and combinatorial enumeration.
6. Mitigation of Reasoning Deficits: The LeaP Architecture
LeaP (Learning from Peers) introduces a collaborative approach to chain-of-thought reasoning in large reasoning models (LRMs), aimed at overcoming the "Prefix Dominance Trap"—a phenomenon in which a model's trajectory, initialized with a poor prefix (e.g., first tokens from incorrect reasoning), leads to a reduction in Pass@1 accuracy on AIME 2024 (Luo et al., 12 May 2025).
LeaP interleaves blocks every tokens across parallel paths, each with Summarization and Routing stages. Summaries are limited to $256$ tokens and generated via summary trigger templates. Peer selection is based on Levenshtein similarity:
Paths receive peer summaries for context extension, adjusting subsequent token distributions as follows:
Fine-tuning small models into the LeaP-T series (7B) leverages ~1,000 AIME problems (1984–2023), filtering for correct final answers and adequate summary length. Auxiliary losses are included for summary and reflection template generation.
LeaP demonstrates empirical improvements:
| Model | Baseline | +leaP (Disp Top-4) | |
|---|---|---|---|
| DeepSeek-7B | 51.35 | 60.52 | +9.17 |
| DeepSeek-14B | 64.47 | 77.29 | +12.82 |
| QwQ-32B | 79.69 | 85.83 | +6.14 |
| LeaP-T-7B | — | 64.38 | — |
LeaP’s multi-path, peer-reflective protocol shrinks the Prefix Dominance gap by points, provides +5–13 point absolute performance gains on AIME and GPQA, and robustly tolerates up to bad initializations (Luo et al., 12 May 2025).
7. Lessons, Outlook, and Implications for Automated Mathematical Reasoning
Critical analysis of MathArena AIME-2024 leads to two central conclusions:
- Traditional static benchmarks (GSM8K, MATH, AIME) are "almost certainly contaminated," leading to reported accuracy inflations of $10$– and overstating LLM capabilities for mathematical reasoning.
- MathArena’s protocol of evaluating on newly released, uncontaminated competition problems ensures a forward-looking and reliable assessment. AIME 2025 scores are consistent with human expert quantiles, reinforcing methodological rigor (Balunović et al., 29 May 2025).
A plausible implication is that, as LLMs improve, MathArena will continue to add new competitions in real time—tracking genuine progress across increasingly complex domains, such as university-level contests (HMMT, SMT) and proof-based olympiads (USAMO), where even top models currently score below .
LeaP provides a pathway to robust error correction, collective reasoning, and earlier consensus formation by enabling cross-path interaction and peer reasoning synthesis, marking a methodological milestone for LRMs (Luo et al., 12 May 2025).
MathArena AIME-2024 thus represents both a benchmark and a methodological archetype for rigorous, real-time, and contamination-free evaluation of mathematical cognition in LLM systems.