Papers
Topics
Authors
Recent
2000 character limit reached

AIME Benchmarks: Math & AI Evaluation

Updated 16 December 2025
  • AIME Benchmarks are standardized evaluation tasks that use AIME-style problems to test mathematical reasoning and AI performance with precise numeric responses.
  • They employ strict protocols including multi-seed evaluation, standardized hardware/software, and exact-match accuracy (Pass@1) to ensure statistical rigor.
  • Empirical results show that supervised fine-tuning and multi-LLM evaluator frameworks significantly boost performance, emphasizing robust and reproducible metrics.

AIME Benchmarks refer to a family of challenging mathematical reasoning and evaluation tasks inspired by the American Invitational Mathematics Examination (AIME), as well as to distinct protocols and methods—sometimes sharing the “AIME” acronym—utilized in broader AI evaluation contexts. In mathematical LLM (LM) evaluation, AIME-style benchmarks are employed as canonical tests for upper-secondary-level problem solving, with integer or short-form numeric answers. In agentic AI and code generation, “AIME” designates protocols such as the multi-LLM Evaluator framework, as opposed to merely a data source. The following sections outline the landscape, key protocols, pitfalls, and research insights around AIME-style evaluations and applications.

1. Canonical AIME-Like Mathematical Reasoning Benchmarks

AIME-style math benchmarks center on problems from the American Invitational Mathematics Examination, an annual North American contest of 15 integer-answer problems, administered in two variants (A and B). Standard AIME-derived evaluations utilize 30 problems covering algebra, geometry, number theory, and combinatorics, with each answer constrained to the range 0–999. The primary dataset for model assessment is the AIMO Validation AIME set (“AI-MO/aimo-validation-aime”), as reflected in the AIME’24 benchmark, which follows the distribution of roughly 10 algebra, 8 geometry, 6 number theory, and 6 combinatorics problems (Hochlehnert et al., 9 Apr 2025).

Several companion and extended AIME-level resources exist:

  • OlymMATH-EASY subset: 100 problems at AIME difficulty, manually curated and verified for contamination, with parallel English/Chinese presentation and automated answer verification using symbolic evaluation (Sun et al., 27 Mar 2025).
  • MathArena AIME-2024: 30 problems from the 2024 AIME, used for live, uncontaminated leaderboard evaluation and statistical comparison to human and model performance (Balunović et al., 29 May 2025).

These benchmarks emphasize rigorous formatting (exact integers, LaTeX normalization, box delimiters), and are foundational to the assessment of medium-difficulty mathematical reasoning in current LLMs. Problems are selected to avoid data contamination, especially in OlymMATH and MathArena, by leveraging original print sources and timeline-locked evaluation (Sun et al., 27 Mar 2025, Balunović et al., 29 May 2025).

2. Evaluation Protocols and Metrics

The dominant evaluation metric for AIME-style datasets is exact-match accuracy, or Pass@1, computed as

Pass@1=1Ni=1N1(predi=goldi)\mathrm{Pass}@1 = \frac{1}{N} \sum_{i=1}^N \mathbf{1}(\mathrm{pred}_i = \mathrm{gold}_i)

where match is determined by robust parsing, normalization, and symbolic/numeric equivalence. For open-ended or generative settings, partial credit via Pass@k is available but rarely used. Notably, a single correct answer difference shifts Pass@1 by approximately 3.3 percentage points for N=30 (Hochlehnert et al., 9 Apr 2025).

To ensure statistical rigor and eliminate spurious improvements, state-of-the-art protocols enforce:

  • Multi-seed evaluation: Report mean ± standard deviation over at least 10 different random seeds.
  • Standardized hardware/software environments: Dockerized runs on fixed-node A100 GPUs with frozen versions of the evaluation stack, e.g., LightEval v0.8.1 and vLLM backend.
  • Prompt standardization: Instruction-tuned models use their native chat wrappers plus explicit math prompts; base models are tested with minimal or no prompt templates.
  • Hyperparameter tuning and reporting: Decoding temperature and top-p sampling are model-tuned (e.g., temperature=0.8, top_p=0.9) and fixed across tasks (Hochlehnert et al., 9 Apr 2025).
  • Statistical uncertainty reporting: Single-seed “gains” <7 pp are typically not statistically significant.

MathArena extends statistical controls by reporting standard error, 95% confidence intervals, and performing paired permutation tests to establish ranking significance (Balunović et al., 29 May 2025).

3. Baselines and Model Performance on AIME Benchmarks

AIME’24 and companion datasets serve as primary leaderboards for comparing LM performance on structured mathematical reasoning.

Summary of results (AIME’24, mean ± std over 10 seeds) (Hochlehnert et al., 9 Apr 2025):

Model Class Representative Model Method AIME’24 Pass@1 (%)
Zero-shot base Qwen2.5-Math-1.5B ZS 11.3 ± 3.6
Few-shot (5-shot) Qwen2.5-Math-1.5B FS ~15 ± 3
Supervised Finetuned DeepSeek-R1-Distill-1.5B SFT 28.7 ± 4.8
Reinforcement Learning DeepScaleR (on R1-Distill) RL 37.0 ± 6.6

Supervised finetuning (SFT) robustly outperforms zero/few-shot and most RL-based approaches. RL “gains” are typically modest and within or just above statistical noise. On OlymMATH-EASY, top-32B models (o3-mini, DeepSeek-R1) achieve 80–89.7% accuracy (Sun et al., 27 Mar 2025). On MathArena’s AIME-2024, o4-mini reaches 91.7% (27.5/30), o3 (“high”) 89.2%, with confidence intervals provided per model (Balunović et al., 29 May 2025).

Test-time data augmentation strategies such as Prompting Test-Time Scaling (P-TTS) further boost performance; for example, a 32B model fine-tuned with P-TTS (900 example “Full-P-TTS”) achieves 73.33% (AIME’24) and 53.33% (AIME’25), exceeding the S1 and S1.1 baselines by +16–26 percentage points (Bsharat et al., 10 Oct 2025).

4. Sensitivity Analyses, Pitfalls, and Best Practices

Extensive sensitivity analyses reveal extreme dependence of AIME scores on minor implementation details (Hochlehnert et al., 9 Apr 2025):

  • Random seed variance: Standard deviation in Pass@1 can reach 7–15 percentage points, especially in small models.
  • Decoding hyperparameters: Varying temperature or top_p can shift scores by up to 15 or 8 points, respectively. Best performance occurs at higher temperature but with increased instability.
  • Prompt format: Instruction content and template selection can introduce multi-point swings.
  • Hardware/software stack: Evaluating the same model with different backends or on different GPU clusters causes up to 8-point differences.
  • Statistical significance: Most reported gains from RL or minor intervention are indistinguishable from noise, requiring careful t-testing across seeds.

Best-practice recommendations include:

  • Use at least 10 seeds for small-N benchmarks.
  • Standardize and report all hyperparameters.
  • Maximize context length to preclude answer truncation.
  • Implement robust (LaTeX- and expression-aware) answer parsing.
  • Open-source all code, logs, and evaluation artifacts.
  • Apply live, timeline-locked evaluation to prevent data leakage (Hochlehnert et al., 9 Apr 2025, Balunović et al., 29 May 2025).

5. AIME Protocols Beyond Math: Multi-LLM Evaluation in Code Generation

In code generation and broader agentic LM evaluation, AIME also refers to the protocol “AI system optimization via Multiple LLM Evaluators” for iterative text-based refinement (Patel et al., 4 Oct 2024). This framework addresses the suboptimality of using a single LLM as evaluator during reinforcement loops. Theoretically, a linear mixture of diverse evaluator distributions can better approximate the inaccessible oracle loss:

ΔEvasuboptΠemax  dTV(πe,k=1Kαkπk)\Delta^{\Pi}_\mathrm{Eva-subopt} \leq |e^*|_\text{max}\; d_{TV}\left(\pi_e^*, \sum_{k=1}^K \alpha_k \pi_k \right)

Practically, AIME inserts K role-specific LLM evaluators (e.g., “check syntax,” “check logic”) into each refinement round. Quantitative experiments on LeetCodeHard and HumanEval show:

Metric Single-Eval AIME Improvement
Error Detection 31.4% 91.1% +59.7 pp
Success Rate 83.7% 89.3% +5.6 pp
Completion Rate 76.1% 82.9% +6.8 pp

Success and error detection rates increase with role-diversity and number of evaluators. Ablations demonstrate that omitting any of syntax, correctness, or logic degrades results by up to 12 points. Robustness analyses confirm that AIME is less vulnerable to adversarial (“always-positive”) graders by up to 16 percentage points (Patel et al., 4 Oct 2024).

6. Extensions, Limitations, and Future Directions

Several studies explore and extend AIME-style evaluation and methods:

  • Generalization and Proof-Writing: MathArena argues for continuous “live” benchmarks to minimize contamination and introduces proof-writing tracks beyond answer-only formats. Future AIME-like datasets are recommended to require intermediate-step checking and to include symbolic or process-level scoring (Balunović et al., 29 May 2025, Sun et al., 27 Mar 2025).
  • Test-Time Scaling: Multiple independent reasoning trajectories and majority voting (“self-consistency”) provide significant accuracy gains over single-pass inference on AIME. Advanced methods, such as symbolic weak-verifiers leveraging step-wise algebraic checks, further lift performance, especially in smaller models (Gao et al., 25 Jun 2025).
  • Data Contamination and Statistical Rigour: Evidence indicates that widely available AIME 2024 problems have contaminated pretraining for several LLMs, inflating model scores by 10–20 points over uncontaminated contests (e.g., AIME 2025, BRUMO 2025). Rigorously timeline-controlled evaluation is essential (Balunović et al., 29 May 2025).
  • Generalization to Non-math Domains: The “AIME” terminology also extends to model-based imitation learning in vision-based control, where the Action Inference by Maximising Evidence algorithm (AIME) is used. Here, sample efficiency and generalization are improved via online regularization and surrogate rewards (Zhang et al., 29 Apr 2024).

Future directions identified include expanding benchmarks to multilingual and multimodal contexts, increasing scale and diversity, integrating intermediate process verification, and extending robust AIME-style evaluator frameworks to scientific and agentic reasoning tasks.

7. Summary Table: Major AIME Benchmarks and Protocols

Setting Key Dataset/Protocol Notable Features
Mathematical Reasoning AIME’24 30 problems, integer answer, Pass@1, standardized protocol (Hochlehnert et al., 9 Apr 2025)
Olympiad-level Math OlymMATH-EASY 100 AIME-level bilingual problems, symbolic verification (Sun et al., 27 Mar 2025)
Live/Uncontaminated Eval MathArena AIME-2024 Timeline-locked runs, human-comparable stats (Balunović et al., 29 May 2025)
Code Gen Evaluation AIME (Multi-LLM) Multiple role-specific LLM evaluators, linear aggregation (Patel et al., 4 Oct 2024)
Imitation Learning AIME/AIME-v2 Action inference with latent world models, sample-efficient (Zhang et al., 29 Apr 2024)

This diversity of “AIME Benchmarks” illustrates both the centrality of the AIME format in mathematical model assessment and the growing impact of multi-criteria, statistically rigorous, and process-aware evaluation protocols in broader AI evaluation research.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to AIME Benchmarks.