Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 33 tok/s Pro
GPT-5 High 26 tok/s Pro
GPT-4o 126 tok/s Pro
Kimi K2 192 tok/s Pro
GPT OSS 120B 437 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

AIME 2024–2025 Benchmark Overview

Updated 22 October 2025
  • AIME 2024–2025 Benchmark is a curated set of mathematically rigorous problems from AIME competitions designed to assess LLMs in algebra, geometry, number theory, and combinatorics.
  • It utilizes contamination-resistant curation and expert verification, ensuring objective, automated grading through standardized LaTeX formulation and singular numerical answers.
  • The benchmark advances collaborative reasoning and data-efficient distillation frameworks while incorporating bilingual tasks, thereby highlighting cross-linguistic performance and model robustness.

The AIME 2024–2025 Benchmark refers to a suite of mathematical reasoning evaluations derived from the American Invitational Mathematics Examination (AIME), designed to rigorously test the capabilities of large language and @@@@1@@@@ after prior benchmark saturation. Recent research efforts, anchored in the release and analysis of OlymMATH and complementary frameworks, have utilized AIME-level problems both as standalone benchmarks and as constituent elements in larger, multi-faceted evaluation sets for algorithmic reasoning, collaboration, and distillation. The benchmarking process, methodologies, and applications have directly influenced the development and assessment of advanced AI systems with a focus on objective, rule-based performance measurement, contamination prevention, and bilingual reasoning.

1. Definition and Benchmark Structure

AIME 2024–2025 Benchmark refers to a collection of medium-difficulty mathematical problems sourced from official AIME competitions and curated sets such as OlymMATH (Sun et al., 27 Mar 2025). The problems are structured to test the baseline capabilities of LLMs in algebra, geometry, number theory, and combinatorics. Within benchmarks like OlymMATH, AIME-level problems are classified as “easy” and serve as a standardized baseline for mathematical reasoning, in contrast to more challenging “hard” subsets specially designed to exceed the typical AIME difficulty.

AIME-level problems typically feature elegant LaTeX formulations with clearly defined answer expectations. For example, representative problems include combinatorial sums:

k=01234(2016×12342016k)mod20172\sum_{k=0}^{1234}\binom{2016\times 1234}{2016k} \mod 2017^2

and geometry:

cosBAC=45\cos\angle BAC=\frac{4}{5}

with final answers reported as boxed real numbers (e.g., $\boxed{\frac{18}{5}$). This format ensures answers are easily parsable and objectively verifiable.

2. Problem Curation and Verification

To prevent data contamination—a critical concern for benchmark fidelity—problems for the AIME 2024–2025 Benchmark are manually selected from printed sources such as math magazines, textbooks, and official competition materials (Sun et al., 27 Mar 2025). This excludes widely available online datasets, minimizing the likelihood that contemporary LLMs have encountered the exact problem set during training. Each problem undergoes rigorous expert verification to eliminate ambiguity and guarantee compatibility with automated evaluation tools (e.g., sympy for numerical equality checks).

Answers in the AIME 2024–2025 Benchmark are restricted to singular numerical values or intervals, facilitating robust automated grading and objective scoring. Standardized presentation protocols are adopted from established datasets (e.g., the MATH dataset), further enhancing comparability and reproducibility.

3. Empirical Performance and Evaluation Metrics

Empirical evaluations on AIME 2024–2025, as part of composite benchmarks like OlymMATH, MathArena, and collaborative frameworks, show significant variance in model performance based on problem difficulty and dataset contamination (Sun et al., 27 Mar 2025, Balunović et al., 29 May 2025). Pass@1 and Cons@10 (consistency) metrics are frequently employed:

  • On OlymMATH-EASY (AIME-level), state-of-the-art models such as DeepSeek-R1, QwQ-32B, and OpenAI’s o3-mini (high) achieve pass@1 scores in the upper 60–90% range, indicating that AIME-level problems remain non-trivial for even advanced models.
  • On OlymMATH-HARD, model accuracy drops substantially (e.g., DeepSeek-R1: ~21.2% on the English hard set).
  • In MathArena, evidence of “contamination” is observed, with models scoring 10–20% higher on AIME 2024 than on the newly released AIME 2025, suggesting inadvertent memorization in earlier benchmarks (Balunović et al., 29 May 2025).
  • Benchmark results are statistically validated, including permutation tests for rank significance and variance estimation:

Var(p^)=p^(1p^)N\text{Var}(\hat{p}) = \frac{\hat{p}(1-\hat{p})}{N}

where p^\hat{p} is the observed accuracy and NN the number of problems.

4. Collaborative and Peer-Based Reasoning Advances

Recent works leverage the AIME 2024–2025 Benchmark as an experimental platform for innovations in collaborative reasoning (Luo et al., 12 May 2025). The Learning from Peers (LeaP) framework enables parallel reasoning paths to exchange summaries every TT tokens, overcoming the “Prefix Dominance Trap”:

  • Dispersed Routing distributes peer insights based on normalized Levenshtein similarity:

$\mathcal{C}_i = \operatorname{Bottom}\mbox{-}k\left\{\operatorname{similarity}(s_i, s_j) \mid j \neq i\right\}$

with:

similarity(si,sj)=1Dlev(si,sj)max(si,sj)\operatorname{similarity}(s_i, s_j) = 1 - \frac{D_{\mathrm{lev}}(s_i, s_j)}{\max(|s_i|, |s_j|)}

  • Empirical improvements: QwQ-32B with LeaP attains Pass@1 scores up to 85.83 on AIME 2024, surpassing even larger models such as DeepSeek-R1-671B, with smaller fine-tuned versions (LeaP-T-7B) matching performance of models double their size.

AIME 2024–2025 thus serves not only as a baseline but also as a rigorous testbed for peer collaboration, error correction, and robustness in multi-path LLM reasoning.

5. Contamination-Resistant and Proof-Writing Evaluation

Dynamic, contamination-resistant evaluation pipelines, as exemplified by MathArena, utilize real-time competition releases to ensure freshness and eliminate data leakage (Balunović et al., 29 May 2025). In this schema:

  • AIME 2025 problems are designated “uncontaminated,” used to ascertain genuine model reasoning capabilities.
  • Grading protocols for answer-based tasks include automated LaTeX parsing and symbolic equivalence checks.
  • For proof-writing tasks (e.g., USAMO 2025), AIME serves as a contrast: while LLMs score highly on AIME numerical tasks (up to 87%), scores on proof tasks remain below 25%, indicating major gaps in stepwise deductive capabilities.

Systematic analysis across competitions reveals cost–performance tradeoffs, with AIME benchmarks utilized to benchmark both numerical reasoning and illuminate the limits of present LLM proof-generation.

6. Data-Efficient Distillation and Benchmark-Driven Training

AIME 2024–2025 (alongside similar benchmarks) underpins recent advances in data-efficient distillation frameworks (Wu et al., 13 Aug 2025):

  • The Distillation framework (DED) selects teacher models not by benchmark score, but by empirical transfer efficacy on AIME-level problems. For instance, QwQ-32B is often more effective than larger models for distillation purposes.
  • The framework curates ~800 highly targeted examples after filtering for correctness, token length (len>16k\text{len} > 16\text{k}), and diversity.
  • Diversity of reasoning paths, quantified by inter-sample Levenshtein distance, is enforced per question; only the farthest PP trajectories are used in training.
  • Distilled student models using DED surpass their teacher models on AIME 2024/2025, demonstrating that factors such as token entropy and stable latent space representation are crucial for practical reasoning performance.

This suggests that scaling corpus size is neither necessary nor always beneficial: careful, benchmark-driven data selection and trajectory diversity are critical for optimal distillation and model robustness.

7. Bilingual and Multilingual Benchmarking

OlymMATH introduces side-by-side bilingual benchmarking by making every problem available in both English and Chinese (Sun et al., 27 Mar 2025). The translation pipeline involves initial LLM-based generation, subsequent multi-stage refinement, and expert human verification:

  • Bilingual benchmarking reveals consistent performance discrepancies, with LLMs performing better in English than in Chinese.
  • These findings are attributed to model pre-training corpus composition, underscoring the need for balanced language support in future mathematical reasoning tasks.

AIME 2024–2025 variants, when included in bilingual benchmarks, facilitate the paper of cross-linguistic capabilities and highlight language gaps in mathematical inference.


In summary, the AIME 2024–2025 Benchmark is a pivotal resource for the objective, contamination-resistant, and cross-linguistic evaluation of LLMs in mathematical reasoning. It provides baseline challenges, illuminates the need for collaborative and distilled reasoning advancements, and grounds empirical progress assessment across algorithmic, collaborative, and multilingual AI research directions.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to AIME 2024-2025 Benchmark.