AIME 2024 and AIME 2025 Benchmarks
- AIME 2024 and AIME 2025 benchmarks are rigorous mathematical reasoning challenges derived from Olympiad-level AIME problems, featuring contamination concerns and live evaluation strategies.
- Advanced reinforcement learning, data-efficient distillation, and specialized model architectures are utilized to boost chain-of-thought reasoning and overall performance.
- Innovative inference techniques, including Parallel-Distill-Refine and value-guided search, enhance efficiency, verification, and scalability in automated theorem proving.
The AIME 2024 and AIME 2025 benchmarks are prominent mathematical reasoning challenges used to evaluate LLMs on advanced Olympiad-style mathematics—specifically in the context of rapid advances in automated theorem proving, chain-of-thought (CoT) reasoning, reinforcement learning (RL), data-efficient distillation, and new inference and scaling paradigms. These benchmarks play a central role in both tracking progress and illuminating limits of current models, as well as shaping methodological innovations around reasoning, efficiency, and generalization.
1. Benchmark Composition and Contamination
AIME 2024 and AIME 2025 are derived from problems of the American Invitational Mathematics Examination (AIME), which consist of challenging, multi-step problems primarily in algebra, number theory, combinatorics, geometry, and probability. Earlier benchmarks (notably AIME 2024) are widely accessible online, leading to concerns regarding data contamination—models may have encountered problem statements during pretraining or tuning, causing artificially inflated scores (Balunović et al., 29 May 2025). For instance, analyses found several models scoring 10–20 points above human baseline, with one model (QwQ-Preview-32B) exceeding expectations by as much as 60%. This motivated the development of "live" evaluation frameworks such as MathArena, which exclusively use post-release problems from AIME 2025 and similar contests to guarantee "uncontaminated" evaluation conditions.
The miniF2F dataset (Zheng et al., 2021) exemplifies rigorous formalization of Olympiad-level problems, including AIME questions, for cross-system neural theorem proving. Problems are manually translated to multiple formal systems (Metamath, Lean, Isabelle, HOL Light), enabling exact comparisons of automated reasoning strategies.
| Benchmark | Problem Origin | Contamination Status | Additional Notes |
|---|---|---|---|
| AIME 2024 | Historical AIME problems | Contaminated | Inflated scores, common pretraining |
| AIME 2025 | New competition release | Uncontaminated | MathArena uses for live model testing |
| miniF2F | Formalized Olympiad | Mixed/Evolving | For neural theorem proving |
2. RL Training, Distillation, and Model Architectures
AIME 2024/2025 benchmarks catalyzed several reinforcement learning (RL) and distillation strategies. RL training systematically enhances chain-of-thought reasoning, response length, and accuracy, as seen by Qwen2.5-32B and DeepSeek-R1-Distill-Qwen-1.5B reaching up to 39.33% on AIME 2024, with further improvements using RL tool manipulation to 86.67% (Chen et al., 6 Mar 2025). RL hyperparameters (batch size, temperature), reward shaping, dynamic KL annealing, and tool augmentation underpin these gains.
Seed1.5-Thinking (Seed et al., 10 Apr 2025) utilizes a Mixture-of-Experts (MoE) architecture (20B activated/200B total), further strengthened by RL algorithms (adaptive advantage estimation and “clip-higher” PPO tricks). It achieved 86.7% on AIME 2024, 74.0% on AIME 2025, maintaining competitive performance with state-of-the-art proprietary and open-source models.
Data-Efficient Distillation (DED) (Wu et al., 13 Aug 2025) demonstrated that careful teacher model selection, targeted data curation, and exposure to diverse reasoning trajectories—rather than brute-force scaling—produce highly efficient models. Notably, just 0.8k curated examples enabled NTele-R1-32B to reach 81.87% on AIME 2024 and 77.29% on AIME 2025, outperforming larger datasets and previous distillation baselines.
AM-Thinking-v1 (Ji et al., 13 May 2025), a 32B dense model built on Qwen2.5-32B, achieved 85.3 (AIME 2024) and 74.4 (AIME 2025) via a post-training pipeline integrating supervised fine-tuning (SFT) with RL (GRPO), illustrating that mid-scale dense models can approach MoE benchmarks.
| Model/Method | Architecture | AIME 2024 | AIME 2025 | Training Paradigm |
|---|---|---|---|---|
| Seed1.5-Thinking | MoE (20B/200B) | 86.7 | 74.0 | RL + specialized architecture |
| NTele-R1-32B (DED) | Dense, Distill. | 81.87 | 77.29 | Data-efficient distillation |
| AM-Thinking-v1 | Dense 32B | 85.3 | 74.4 | SFT+RL (GRPO), difficulty-aware |
3. Inference and Test-Time Scaling Paradigms
Reasoning accuracy and efficiency have both evolved through innovative orchestration of inference pipelines. Several paradigms emerged:
Parallel-Distill-Refine (PDR) (Madaan et al., 1 Oct 2025): Improves accuracy and reduces latency by dividing inference into parallel candidate generation, workspace distillation (to a bounded context), and iterative refinement. PDR outperforms single-pass long CoT both in tokens used and accuracy (+11% on AIME 2024, +9% on AIME 2025). Sequential Refinement (SR)—a PDR subcase—offers further latency reduction without sacrificing reasoning power.
Multi-round Thinking (Tian et al., 25 Mar 2025): Iteratively prompts the model using prior answers to reduce cognitive inertia and refine outputs, yielding accuracy gains up to ~83% on AIME 2024 for QwQ-32B and DeepSeek-R1 over single-round generation.
Temperature Scaling and Test-Time Scaling (TTS) (Wu et al., 2 Oct 2025): Test-time scaling by increasing the number of samples K plateaus, but sampling across diverse temperature settings unlocks latent reasoning regimes. Averaged over representative benchmarks, temperature scaling yields an additional 7.3 percentage points over conventional single-temperature TTS, and can close gaps with RL-trained models at inference.
Prompting Test-Time Scaling (P-TTS) (Bsharat et al., 10 Oct 2025): Data augmentation via prompt-space manipulation (instructional wrapping, reward/penalty framing) allows fine-tuning on small, diverse data (90–900 problems) to rival 1K-shot baselines. The Qwen-2.5-32B model achieved absolute gains of +23.33% (AIME 2024) and +26.63% (AIME 2025) over S1 baselines, demonstrating that prompt-level augmentation can yield efficient generalization.
4. Verification, Pruning, and Efficiency Mechanisms
Increasing reasoning accuracy is closely tied to mitigating inefficiencies—especially in parallel scaling and verification:
Solve-Detect-Verify Framework and FlexiVe Verifier (Zhong et al., 17 May 2025): The pipeline integrates an early stopping detector (using hesitation cues and log-probabilities), a scalable generative verifier that balances “fast” and “slow” verification, and a flexible verification budget. This approach improves accuracy (from 67.5% to >83% at scale) and reduces token consumption via early exit and adaptive verification.
DeepPrune (Tu et al., 9 Oct 2025): Addresses inter-trace redundancy in parallel scaling by employing a judge model (Qwen3-4B-Instruct, AUROC 0.87) trained with focal loss/oversampling, and an online greedy clustering algorithm. DeepPrune achieves >80% token reduction, maintains or improves answer accuracy (up to 91.4% token saving and increased accuracy on AIME 2025 with Qwen3-32B), and establishes a new efficiency standard.
Value-Guided Search (VGS) (Wang et al., 23 May 2025): Employs a token-level value model trained on 2.5M reasoning traces. Block-wise beam search with VGS for DeepSeek-R1-Distill-1.5B achieved 45.7% accuracy (AIME 2024/2025) with inference budgets of 64, matching larger models and reducing computational cost compared to majority voting or best-of-n.
5. Alignment, RL Stability, and Off-Policy Optimization
Recent methods highlight stabilization, entropy preservation, and sample efficiency in RL for LLMs:
BAPO (Balanced Policy Optimization with Adaptive Clipping) (Xi et al., 21 Oct 2025): Identifies that conventional PPO-style clipping in RL leads to entropy collapse and gradient explosion in off-policy settings. BAPO dynamically adjusts clipping bounds to ensure a balanced contribution of positive/negative advantage tokens, preserving exploration. BP-Math-7B achieved 70.8 (AIME 2024), 62.5 (AIME 2025), outperforming SkyWork-OR1-7B; BP-Math-32B recorded 87.1 and 80.0, surpassing o3-mini and Gemini-2.5-Flash-Thinking.
RLEP (Reinforcement Learning with Experience Replay) (Zhang et al., 10 Jul 2025): Utilizes a blended batch of fresh rollouts and replayed successful trajectories for stabilizing learning and accelerating convergence. On Qwen2.5-Math-7B, RLEP increased accuracy from 38.2% to 39.9% (AIME 2024), and from 19.8% to 22.3% (AIME 2025), reaching peak accuracy in fewer updates.
6. Reasoning, Proof-Writing, and Real-World Evaluation
While answer-based performance on AIME benchmarks regularly exceeds human levels, proof-writing and rigorous logical justification lag far behind. In MathArena (Balunović et al., 29 May 2025) and “Proof or Bluff” (Petrov et al., 27 Mar 2025), models achieved as high as ~91% accuracy on answer-based AIME tasks, but less than 25% on proof-based competitions (USAMO 2025). Common failure modes include logical leaps, unjustified assumptions, and overgeneralization, with optimization artifacts (e.g., unnecessary boxed answers) arising from RL alignment traces.
This exposes the critical gap between pattern-matching numerical answers and the ability to construct valid, multi-step proofs—a subject of ongoing methodological research.
7. Future Directions and Implications
Future benchmarks and methodologies are expected to focus on:
- Uncontaminated, live-evaluated competitions (AIME, USAMO, SMT, etc., via MathArena) (Balunović et al., 29 May 2025)
- Fine-grained proof evaluation and integration of automated/human verification
- Iterative and prompt-level scaling (PDR, Multi-round, P-TTS) for enhanced efficiency and generalization
- Stable and entropy-preserving RL, off-policy optimization (BAPO, RLEP)
- Data-efficient distillation, tool manipulation, and peer communication mechanisms (DED, RL tool calls, LeaP) for robust error-correction and reflection
- Cross-domain and zero-shot generalization to avoid narrow overfitting
In sum, AIME 2024 and AIME 2025 benchmarks have become central to the empirical paper of mathematical reasoning in LLMs, spurring the development of diverse architectural, training, and inference paradigms. While answer-based accuracy has reached super-human performance on contaminated sets, uncontaminated benchmarks and proof-based tasks continue to reveal the boundaries of current methods and direct future research toward deeper, more rigorous reasoning capabilities.