AIME 2024 Math Reasoning Benchmark

Updated 2 February 2026

The benchmark is a systematic evaluation featuring 30 free-response, integer-constrained math problems spanning algebra, combinatorics, geometry, and number theory with strict symbolic answer standardization.
It employs advanced methodologies including contamination diagnostics and multi-instance symbolic templates to robustly measure LLM reasoning and guard against data leakage.
Innovative training pipelines, such as difficulty mining, long-context distillation, and offline reinforcement learning, have significantly boosted model performance on this benchmark.

The AIME 2024 Mathematical Reasoning Benchmark indexes a pivotal era in evaluating advanced mathematical problem solving by large language and reasoning models (LLMs/LRMs). Anchored to the 2024 American Invitational Mathematics Examination, the benchmark exposes both the limits of current models' reasoning capabilities and the complexities introduced by data contamination, evaluation protocol fragility, and the necessity for rigorous, uncontaminated, and variabilized assessments. Research surrounding AIME 2024 has catalyzed both advanced training methodologies—data scaling, sophisticated RL paradigms, long-context distillation—and critical methodological innovations such as symbolic multi-instance evaluation and contamination diagnostics. The benchmark has consequently become a reference point for meaningful progress and enduring obstacles in LLM-driven symbolic mathematics.

1. Benchmark Definition and Problem Structure

The AIME 2024 benchmark consists of 30 integer-response problems across algebra, combinatorics, geometry, and number theory, with each answer constrained to the interval $[0,999]$ (Balunović et al., 29 May 2025). Problems are free-response, demanding symbolic and logical reasoning through multi-step, chain-of-thought (CoT) derivations rather than factual recall. Rigorous answer format standardization is enforced (LaTeX canonicalization; boxed integer) and automatic checker pipelines ensure evaluation objectivity.

Typical topical breakdown:

Algebra: 9
Combinatorics: 9
Geometry: 8
Number Theory: 6

Problems are transcribed and matched exactly to official exam releases to eliminate formatting drift and facilitate programmatic answer verification. Accuracy is computed as the fraction of correctly boxed integer answers across all 30 problems (pass@1 metric), and no partial credit is given (Balunović et al., 29 May 2025, Zhang et al., 2 Apr 2025).

2. Evaluation Methodologies and Contamination Assessment

AIME 2024's availability predating the release of frontier LLMs raises substantial contamination risk—models may have seen the problems or full solutions during training, thus artificially inflating their measured reasoning prowess (Balunović et al., 29 May 2025, Yao et al., 17 Jul 2025). MathArena's evaluation protocol mandates model benchmarking within 24 hours of contest release, prior to public discussion, systematically timestamping evaluations relative to model release dates and flagging suspect overlaps. Chain-of-thought majority voting is eschewed in standard accuracy scoring (only the first boxed integer counts), though higher-order sampling metrics (maj@k, pass@k) are reported for robustness (Du et al., 17 Dec 2025, Balunović et al., 29 May 2025).

Contamination diagnosis leverages performance drift metrics: $C = \frac{\mathrm{score}_{2024} - \mathrm{score}_{2025}}{1 - \mathrm{score}_{2025}}$ A drift of $C \approx 0.2$ –$0.3$ is common among leading models, indicating that $20$– $30\%$ of their AIME 2024 performance is plausibly attributable to data leakage (Balunović et al., 29 May 2025). Benchmark integrity is further probed using variabilized (symbolic, parameterized) templates: VAR-MATH replaces numerical constants with symbolic variables, yielding structural isomorphs for multi-instance consistency checks. This exposes drastic relative drops in model accuracy under isomorphic instantiations (up to $58\%$ on strict VAR-AIME-24) (Yao et al., 17 Jul 2025).

3. Advances in Training: Scaling Difficult Data and Curriculum Engineering

State-of-the-art performance on AIME 2024 has been driven by engineering pipelines that explicitly mine and amplify the difficult region of the training distribution. ScaleDiff (Pei et al., 25 Sep 2025) introduces a three-stage pipeline:

Efficient Difficulty Mining: AdaptThink, a $7$B parameter LRM, segments “difficult” instances via a two-mode policy: problems solvable without explicit CoT are labeled simple; all others as difficult. This process grants near-instantaneous, forward-pass-only difficulty annotation.
Specialized Difficult-Problem Generation: DiffGen-8B is trained exclusively on difficult seeds, yielding large pools of novel, hard competition-style problems, further distilled by AdaptThink to ensure over $88\%$ difficulty retention.
Solution Generation and Filtration: Chains-of-thought for new problems are generated using Qwen3-8B (“thinking mode”), passing through both rule-based and model-based filters (against Qwen2.5-Math-7B-Instruct) for solution validity and novelty.

Fine-tuning Qwen2.5-Math-7B-Instruct on this synthesized corpus (totaling $1.7$M samples) yields a substantial $C = \frac{\mathrm{score}_{2024} - \mathrm{score}_{2025}}{1 - \mathrm{score}_{2025}}$ 0 gain over the original dataset, with AIME’24 accuracy climbing from $C = \frac{\mathrm{score}_{2024} - \mathrm{score}_{2025}}{1 - \mathrm{score}_{2025}}$ 1 (baseline) to $C = \frac{\mathrm{score}_{2024} - \mathrm{score}_{2025}}{1 - \mathrm{score}_{2025}}$ 2 (ScaleDiff-7B). Notably, a monotonic scaling law is observed: augmenting difficult data boosts performance with no sign of saturation (Pei et al., 25 Sep 2025).

4. Model Architectures and Training Paradigms

Contemporary solutions for AIME 2024 span dense transformers, expert MoEs, and hierarchical reasoning agents. Fine-grained methodological advances include:

Long-Context Distillation: Nemotron-Math (Du et al., 17 Dec 2025) harnesses $C = \frac{\mathrm{score}_{2024} - \mathrm{score}_{2025}}{1 - \mathrm{score}_{2025}}$ 3M long-form CoT traces across three reasoning depths (high/medium/low) and two tool settings (with/without Python TIR). Fine-tuning with a sequential bucketed strategy enables 128K token contexts at $C = \frac{\mathrm{score}_{2024} - \mathrm{score}_{2025}}{1 - \mathrm{score}_{2025}}$ 4– $C = \frac{\mathrm{score}_{2024} - \mathrm{score}_{2025}}{1 - \mathrm{score}_{2025}}$ 5 speedup versus naïve training, with negligible accuracy loss.
Offline Reinforcement Learning: PCL-Reasoner-V1.5 (Lu et al., 21 Jan 2026) advances model reasoning stability and sample efficiency by decoupling rollout inference from training. Binary-reward policy gradients over static datasets are optimized via a geometric-mean likelihood objective. Although policy refinement disproportionately boosts “long–CoT” problems, response lengths and reasoning depth reliably increase post-offline RL.
Verifiable, Hierarchical Agents: Intern-S1-MO (Gao et al., 11 Dec 2025) employs a multi-agent architecture for multi-round lemma-based reasoning, integrating summary, value-propagation, and process verification stages. The OREAL-H framework applies RL over hierarchical MDPs, aligning low-level token policies and high-level action selection via conjugate reward denoising and value back-propagation through lemma dependency graphs.

5. Empirical Performance and Compression Effects

Recent evaluations (see table below) delineate current model capabilities on AIME 2024:

Model	Params	Accuracy (AIME24)
o4-mini (high)	—	91.67%
o3 (high)	—	89.17%
Gemini-2.5-Pro	—	87.50%
ScaleDiff-7B	7B	73.0% ± 5.0
DeepSeek-R1 (full)	671B	73.3%
DeepSeek-R1 (2.51-bit)	671B	76.7%
PCL-Reasoner-V1.5	32B	90.9%
Nemotron-Math (maj@16)	30B	100% (high+TIR)

Performance remains highly sensitive to contamination; pass@1 on AIME 2024 can overestimate generalization by $C = \frac{\mathrm{score}_{2024} - \mathrm{score}_{2025}}{1 - \mathrm{score}_{2025}}$ 6– $C = \frac{\mathrm{score}_{2024} - \mathrm{score}_{2025}}{1 - \mathrm{score}_{2025}}$ 7. Quantization to $C = \frac{\mathrm{score}_{2024} - \mathrm{score}_{2025}}{1 - \mathrm{score}_{2025}}$ 8 bits preserves or slightly improves accuracy. Aggressive pruning ( $C = \frac{\mathrm{score}_{2024} - \mathrm{score}_{2025}}{1 - \mathrm{score}_{2025}}$ 9 zeros) or low-bit quantization ( $C \approx 0.2$ 0 bits) collapse reasoning. Notably, concise reasoning chains (shortest 20–30%) yield near-perfect accuracy, while long, verbose outputs correspond to drastically lower correctness (see (Zhang et al., 2 Apr 2025)).

6. Symbolic, Multi-Instance, and Bilingual Benchmarking

To mitigate contamination and pattern overfitting, VAR-MATH (Yao et al., 17 Jul 2025) systematizes symbolic problem parameterization, enforcing model consistency over $C \approx 0.2$ 1 independently sampled instantiations per template. Under a strict “all-or-nothing” scoring rule, models experience $C \approx 0.2$ 2– $C \approx 0.2$ 3 drops in accuracy, demonstrating that robust structural generalization is frequently lacking, especially in smaller (7–32B) architectures.

The OlymMATH benchmark (Sun et al., 27 Mar 2025) complements AIME 2024-style evaluation with bilingual (English/Chinese) coverage, rigorously curated and formatted for automatic answer-checking. On its AIME-tier (OlymMATH-EASY), pass@1 reaches up to $C \approx 0.2$ 4 (o3-mini, high) and $C \approx 0.2$ 5 (DeepSeek-R1, cons@10) in English, with correctness gated by both symbolic normalization and numerical tolerance.

7. Limitations, Best Practices, and Prospects

Despite analytical advances, AIME 2024 benchmarking is subject to the following limitations:

Data contamination: High risk for benchmarks released prior to recent LLMs; forward-only evaluation on “future” contests is essential (Balunović et al., 29 May 2025).
Evaluation fragility: Single-instance evaluation is fragile to sampling variability and model stochasticity; multi-instance symbolic evaluation is recommended (Yao et al., 17 Jul 2025).
Verification deficits: Most data generation pipelines lack formal automated verification; solutions may propagate undetected errors (Pei et al., 25 Sep 2025).
Reasoning depth vs. knowledge recall: Compression and distillation degrade knowledge-intensive performance more than reasoning per se (Zhang et al., 2 Apr 2025).

Best practices, codified by MathArena and VAR-MATH, emphasize timestamped evaluation, contamination tracking, variabilized templates, and the reporting of both strict and loose metrics. Prospective improvements include theorem-prover-based solution verification, dynamic curriculum schedules, expansion to non-mathematical reasoning domains, and integration of multi-lingual, proof-based, and multi-agent protocols.

References

(Pei et al., 25 Sep 2025) ScaleDiff: Scaling Difficult Problems for Advanced Mathematical Reasoning
(Balunović et al., 29 May 2025) MathArena: Evaluating LLMs on Uncontaminated Math Competitions
(Du et al., 17 Dec 2025) Nemotron-Math: Efficient Long-Context Distillation of Mathematical Reasoning from Multi-Mode Supervision
(Lu et al., 21 Jan 2026) PCL-Reasoner-V1.5: Advancing Math Reasoning with Offline Reinforcement Learning
(Zhang et al., 2 Apr 2025) When Reasoning Meets Compression: Benchmarking Compressed Large Reasoning Models on Complex Reasoning Tasks
(Yao et al., 17 Jul 2025) VAR-MATH: Probing True Mathematical Reasoning in LLMs via Symbolic Multi-Instance Benchmarks
(Sun et al., 27 Mar 2025) Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for LLMs
(Yue et al., 2024) HARP: A challenging human-annotated math reasoning benchmark
(Gao et al., 11 Dec 2025) Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving