AIME24: Math Reasoning Benchmark

Updated 30 September 2025

AIME24 is a math benchmark featuring problems from AIME 2024, challenging models with multi-step reasoning across algebra, combinatorics, geometry, and number theory.
It employs a graded difficulty ladder from easy to extremely hard, which aids in evaluating and fine-tuning model reasoning capabilities using curriculum learning and advanced RL strategies.
Methodologies such as supervised fine-tuning, agentic reinforcement learning with tool integration, and dynamic data selection have been key to enhancing LLM performance on AIME24.

AIME24 is a competition-level mathematical reasoning benchmark, derived from the American Invitational Mathematics Examination (AIME) 2024, widely adopted as a challenging testbed for evaluating the multi-step reasoning, abstraction, and problem-solving capabilities of LLMs. Its problems encompass a diverse range of secondary mathematics topics—including algebra, combinatorics, geometry, and number theory—and are known for their requirement of creativity, deep logical inference, and multi-step chains of thought. AIME24 is frequently used in the contemporary literature as a core evaluation suite for both pretraining and post-training (including supervised fine-tuning and reinforcement learning) of mathematical reasoners, as well as for probing generalization and robustness to problem variations.

1. Problem Structure and Difficulty Characterization

The AIME24 benchmark consists of approximately 30 high school mathematics problems, each requiring a single integer (typically 0–999) as its final answer. The problem set is explicitly constructed to span a spectrum of difficulty:

Easy Level: Problems requiring routine application of standard methods, solvable by base models without additional fine-tuning or reasoning strategies.
Medium Level: Problems that challenge stepwise reasoning and require models to articulate extended self-reflective chains of thought. Long-form reasoning (so-called "R1 reasoning style" with SFT on 500–1K trajectories) boosts accuracy from roughly 10% to 90% at this tier.
Hard Level: Multi-stage problems involving compound subproblems, latent reasoning steps, or high computational burden. Accuracy for these hard AIME24 items typically follows a logarithmic scaling in relation to dataset size, exhibiting a plateau near 65% for strong SFT approaches.
Extremely Hard (Exh) Level: Problems that demand unconventional insight, geometric or combinatorial intuition, or strategic deviation from common solution methods. These are unsolved by current LLMs even with extensive fine-tuning.

This "ladder"-like structure has informed the design of curriculum learning, scaling, and RL data selection methods that align training difficulty with model capabilities (Sun et al., 16 Apr 2025).

2. Advanced Training and Optimization Strategies

Recent progress on AIME24 is driven by sophisticated training and inference pipelines that combine supervised, reinforcement, and curriculum learning with data/compute-efficient innovations:

Supervised Fine-Tuning (SFT) and R1 Reasoning

Base models fine-tuned on R1-style, long chain-of-thought trajectories across varied math categories quickly reach near-saturation performance for Easy and Medium tiers. For Hard-tier problems, performance is proportional to dataset size but subject to diminishing returns due to error accumulation across reasoning steps (Sun et al., 16 Apr 2025).

Reinforcement Learning (RL) with Verifiable Rewards

RL approaches, including Group Relative Policy Optimization (GRPO), Proximal Policy Optimization (PPO), and their variants (SRPO, SEED-GRPO, MAGIC, CAMPO), use external verifiers to provide binary correctness feedback. The best RL frameworks ensure robust exploration, prevent entropy collapse, and use techniques such as context-aware, multi-stage policy optimization (as in MiroMind-M1 (Li et al., 19 Jul 2025)) to incrementally enhance mathematical reasoning across context lengths and response granularities.

RL with External Tools and Agentic Reasoning

Agentic RL frameworks (e.g., rStar2-Agent (Shang et al., 28 Aug 2025), START (Li et al., 6 Mar 2025)) integrate explicit tool-use, notably invoking Python interpreters for complex calculations, verification, and self-debugging. This enables models to surpass hallucination-prone internal reasoning, self-correct, and iteratively improve upon feedback from code execution environments.

Test-Time and Input-Time Scaling

Methods such as test-time budget forcing (Muennighoff et al., 31 Jan 2025) and input time scaling (Huang et al., 19 Aug 2025) introduce compute- or input-centric scaling. The former controls the number of reasoning tokens generated at inference, allowing models to "reflect" or "double-check" their solution under a controlled compute budget. The latter employs meta-cognitive persona augmentation (modifying queries with domain-relevant, irrelevant, or random persona tags), paired with training-testing co-design to evoke higher reasoning performance, even when applied to small, minimally filtered datasets.

3. Data Curation, Selection, and Efficiency

Data-centric approaches for AIME24 focus on maximizing the efficacy of limited training examples through innovative curation and sampling:

Offline Curation: Sample selection based on difficulty alignment, diversity, and influence (PageRank-weighted DPP), seeking to prune redundancy and overrepresentation (Tang et al., 1 Sep 2025).
Online Explorativity Filtering: Dynamic rollout pruning based on sample-level exploration metrics, with replay for underexplored examples to speed up convergence while maintaining accuracy with only a fraction of the data (Tang et al., 1 Sep 2025, Rao et al., 22 May 2025).
Dynamic Data Selection: Iterative data selection frameworks (e.g., SAI-DPO (Rao et al., 22 May 2025)) monitor model weaknesses in real time, adjusting the sampling probability for knowledge clusters where the model underperforms, thereby focusing learning on the dynamic margin of competence.
Difficulty Flipping and Influence Attribution: Using influence functions to attribute model performance to training examples, then reweighting the dataset by emphasizing high-difficulty math and low-difficulty code for cross-domain transfer, resulting in measurable gains (e.g., doubling AIME24 accuracy from 10% to 20% for Qwen2.5-7B-Instruct (Kou et al., 26 May 2025)).

4. Inference Paradigms, Scoring, and Aggregation

AIME24 evaluation and solution selection employ advanced inference and global ranking strategies to mitigate scoring noise and non-independent decision biases:

Monte Carlo Tree Search with Self-Refinement (SR-MCTS): Explores diverse solution paths in the solution space, employing self-critique and rewriting capabilities of LLMs to expand beyond greedy, chain-of-thought search. The SR-MCTS framework leverages the UCB formula for node/action selection, enabling deeper and more efficient solution space traversal (Zhang et al., 3 Oct 2024).
Pairwise Preference Reward Models and Enhanced Borda Count (EBC): Rather than scoring single-step or greedy solutions, these models compare and synthesize entire reasoning trajectories via pairwise assessments, then aggregate global preference rankings robustly using EBC. This resolves drawbacks in previous methods afflicted by score variability and non-independent path distributions, producing more reliable answer rankings (Zhang et al., 3 Oct 2024).
Symbolic and Multi-instance Evaluation: The AIME24 benchmark has been transformed to VAR-AIME24 (Yao et al., 17 Jul 2025), where each problem is parameterized symbolically and evaluated over multiple instantiations. This paradigm addresses benchmark contamination and evaluation fragility—performance drops of 58.3% are observed compared to conventional AIME24, exposing superficial generalization in RL-trained models and underscoring the necessity for structurally robust reasoning.

5. Empirical Performance and SOTA Comparisons

AIME24 is a competitive venue for state-of-the-art reporting across sizes and approaches:

Model/Strategy	Parameter Count	AIME24 Metric (Specified)
START (Hint-RFT) (Li et al., 6 Mar 2025)	32B	95.0% (accuracy, pass@1)
rStar2-Agent (Shang et al., 28 Aug 2025)	14B	80.6% (pass@1), concise outputs
MiroMind-M1-RL-32B (Li et al., 19 Jul 2025)	32B	~77.5 (accuracy, pass@1)
Skywork-OR1-32B (He et al., 28 May 2025)	32B	82.2 (avg@32)
Input Time Scaling (Huang et al., 19 Aug 2025)	32B	86.7% (DeepSeek-R1, best S-S mode)
s1-32B (Budget Forcing) (Muennighoff et al., 31 Jan 2025)	32B	~56.7% (pass@1; scalable w/ tokens)
SRPO (Zhang et al., 19 Apr 2025)	32B	50.0 (pass@1), efficient training
SEED-GRPO (Chen et al., 18 May 2025)	7B	56.7% (pass@1, G=16)
QuestA (Li et al., 17 Jul 2025)	1.5B	67.1% (pass@1, augmented RL)

Several trends emerge:

Tool-augmented and agentic models (e.g., START, rStar2-Agent) reach frontier-level accuracies with smaller parameter counts due to robust code execution, self-verification, and targeted RL strategies.
Context-aware, multi-stage RL optimization and input-level scaling approaches (e.g., MiroMind-M1, Input Time Scaling) further elevate open models to the upper bound of performance measured on AIME24.
Efficient RLVR pipelines (e.g., DEPO) achieve nearly full-data performance (1.85x speedup) using only a 20% curated subset and adaptive online data selection (Tang et al., 1 Sep 2025).
On symbolic variants (VAR-AIME24), performance of RL-trained models is significantly degraded (–58.3% on average), demonstrating limitations in structural generalization compared to numeric memorization (Yao et al., 17 Jul 2025).

6. Methodological Innovations and Open Problems

Emerging lines of research are defined by:

Explicit Uncertainty Quantification: Methods such as SEED-GRPO modulate policy gradient updates according to semantic entropy, avoiding spurious learning from high-uncertainty prompts and concentrating optimization on confidently answered questions (Chen et al., 18 May 2025).
Adaptive Curriculum and Expert Guidance: Strategies like Adaptive Difficulty Curriculum Learning (ADCL) and Expert-Guided Self-Reformulation (EGSR) dynamically align problem difficulty with model state and inject expert-like guidance only when model rollouts fail, producing synergistic gains on AIME24 and related benchmarks (Zhang et al., 13 May 2025).
Data/Compute-Efficient Scaling: Developments in budget forcing, test-time scaling, and data efficiency (DEPO, SAI-DPO) are reshaping the landscape for training and deploying math reasoners under resource constraints.
Prototype-Based Reasoning: Abstraction to formal prototype spaces (e.g., Prolog for logic, PDDL for planning) yields additional gains in cross-domain generalization, though impact on AIME24 to date is modest (+1%) (He et al., 18 Jun 2025).
Benchmark Robustness and Overfitting: Contamination and evaluation fragility motivate the adoption of multi-instance, symbolic evaluations (VAR-MATH) and expose the overfitting of many RL approaches to static, numeric template benchmarks (Yao et al., 17 Jul 2025).

7. Future Directions and Open Resources

Training/Testing Co-Design: Joint adaptation of training and inference strategies (as in persona-controlled input time scaling) yields combinatorial performance enhancements.
Open-Source Ecosystem: Leading efforts (MiroMind-M1 (Li et al., 19 Jul 2025), Input Time Scaling (Huang et al., 19 Aug 2025), rStar2-Agent (Shang et al., 28 Aug 2025)) are fully open-sourced, offering datasets, code, and recipes for reproducibility and further innovation.
Broader Applications: Agentic, tool-integrated approaches point to future generalization not only in math, but also code, alignment, and scientific reasoning tasks, with the pipeline infrastructure supporting rapid scaling and low-resource fine-tuning.
Continued Benchmark Evolution: The AIME24 suite is expected to be continually adapted (VAR-AIME24, knowledge clustering, difficulty stratification) to address current limitations, encourage robust generalization, and better align model development with long-term progress in machine mathematical reasoning.

AIME24 remains a crucial metric for the field, anchoring the development and comparative analysis of advanced LLMs for mathematical reasoning and serving as both a proving ground and stress test for algorithmic, data-centric, and agentic innovations in LLM research.