Papers
Topics
Authors
Recent
2000 character limit reached

AIME24 Benchmark for LLM Math Reasoning

Updated 3 December 2025
  • AIME24 benchmark is a comprehensive testbed evaluating advanced mathematical reasoning capabilities in LLMs through 15 integer-based problems spanning algebra, number theory, combinatorics, and geometry.
  • The evaluation methodology employs multi-sample metrics like Accuracy@k and pass@k, alongside symbolic variabilization (VAR-AIME24) to assess consistency and resistance to memorization.
  • Innovative protocols such as Input Time Scaling (ITS) and reinforcement learning with verifiable reward (RLVR) drive training improvements, yet challenges in true reasoning generalization remain.

The AIME24 benchmark refers to the 2024 American Invitational Mathematics Examination, which has rapidly become a definitive large-scale testbed for evaluating advanced mathematical reasoning capabilities in LLMs. It serves not only as a standalone mathematical challenge but also as the nucleus for protocol innovations in contamination resistance and consistency-vs-accuracy measurement. The benchmark composition, evaluation protocols, and its impact on LLM research are summarized below.

1. Structure of the AIME24 Benchmark

AIME24 is comprised of 15 free-response math problems, each requiring a single integer answer between 0 and 999. The problems span a wide range of advanced high school mathematics, including:

  • Algebra: polynomial identities, functional equations
  • Number Theory: modular arithmetic, Diophantine equations
  • Combinatorics: counting under explicit and implicit constraints
  • Euclidean Geometry: computations with lengths, areas, angles

Each problem admits exactly one valid integer answer, facilitating automated evaluation. The benchmark’s choice of problems reflects themes requiring multistep symbolic reasoning and chain-of-thought (CoT) capabilities. Performance on AIME24 is typically measured under sampling-based protocols: for example, “accuracy@64” scores a problem as correct if at least one of 64 independently sampled rollouts yields the correct integer (Li et al., 19 Jul 2025).

2. Evaluation Methodologies and Metrics

AIME24 has motivated the adoption of multi-sample metrics to quantify LLM reasoning abilities. Standard evaluation includes:

  • Accuracy@k: For each problem, the model is sampled kk times. If any of the kk outputs matches the correct answer, the problem is scored as correct.
  • pass@k: Used particularly for greddy decoding (k=1k=1), defined as

pass@k=1(Nnk)(Nk)\mathrm{pass@k} = 1 - \frac{\binom{N-n}{k}}{\binom{N}{k}}

where NN is the total number of samples and nn is the number of correct ones (Huang et al., 19 Aug 2025).

AIME24’s fully verifiable integer answers make it ideal for both supervised fine-tuning and reinforcement learning with verifiable reward (RLVR) (Li et al., 19 Jul 2025).

3. Symbolic Variabilization: VAR-AIME24

AIME24 has been extended into a contamination-resistant, symbolically variabilized format called VAR-AIME24, under the VAR-MATH evaluation framework (Yao et al., 17 Jul 2025). Key features are:

  • Template Construction: Each problem is structurally abstracted into a symbolic template, replacing all fixed numeric constants (except mathematical constants such as π\pi in 6/30 problems) with parameters. For example, instead of “Find the number of ordered triples (x,y,z)(x, y, z) of positive integers with x+y+z=2024x + y + z = 2024,” the template

{(x,y,z)Z+3:x+y+z=n}=(n12)|\{(x,y,z)\in\mathbb{Z}^+{}^3 : x+y+z = n\}| = \binom{n-1}{2}

is created, with nn an integer parameter.

  • Instance Sampling: Each template is instantiated K=5K = 5 times with different numeric settings, yielding 150 concrete problems from 30 templates.
  • Domain Constraints: Instantiation domains are tailored to each template, maintaining semantic properties such as positivity or integrality.
  • Consistency Metric: For each template, models only receive credit if all KK instantiations are solved exactly. The consistency rate is

Cons(M)=1Ni=1N1[j:  Ai,j=1]\mathrm{Cons}(M) = \frac{1}{N}\sum_{i=1}^N \mathbf{1}\bigl[\forall j:\;A_{i,j}=1\bigr]

where Ai,j{0,1}A_{i,j}\in\{0,1\} is the correctness on the jjth instantiation of the iith template.

This methodology reveals whether models have learned genuine mathematical reasoning or merely memorized fixed patterns, and sharply reduces benchmark contamination.

4. Empirical Results and Model Performance

Direct evaluations on AIME24 and VAR-AIME24 have exposed substantial overfitting and reasoning limitations in current LLMs. Key findings include:

Model AIME24 Consistency (%) VAR-AIME24 Consistency (%) Relative Drop (%)
Qwen2.5-Math-7B 10.8 3.3 –69.3
SimpleRL-Zoo-7B 23.8 8.3 –64.9
Skywork-OR1-Math-7B 41.5 24.4 –41.2
DeepSeek-R1-0528 (frontier) 83.3 73.3 –12.0
OpenAI-o4-mini-high (closed) 90.0 73.3 –18.5
SEED-THINK-v1.6 (frontier) 93.3 86.7 –7.1

Empirical analyses reveal that smaller RL-tuned models typically suffer a 40–75% drop in symbolic consistency under VAR-AIME24, indicating substantial reliance on superficial heuristics rather than robust generalization. Even top-tier models experience 7–18% losses.

5. Protocol Innovations: Input Time Scaling and Data Paradigms

AIME24 has catalyzed research into new training and evaluation paradigms such as Input Time Scaling (ITS) (Huang et al., 19 Aug 2025):

  • Input Time Scaling (ITS): Shifts computational and data investment to input prompt refinement (e.g.~meta-knowledge personas) at both training and inference. ITS uses automatic persona induction—appending relevant, irrelevant, or random persona statements to both training and test prompts.
  • Training–Testing Co-Design: ITS demonstrates that persona strategies must be congruent at training and test time; mismatched application leads to severe degradation.
  • Findings on Data “Quality”: ITS experiments show that diverse, seemingly low-quality data or context can outperform hand-curated high-quality datasets (“garbage in, garbage out” does not always hold).
  • State-of-the-Art Results: With ITS, Qwen2.5-32B-Instruct achieves 76.7% pass@1, while DeepSeek-R1-Distill-Qwen-32B reaches 86.7%—well above prior open-source baselines.

6. Systematic Training and RL Innovations

Training regimens for state-of-the-art models on AIME24 integrate large-scale supervised fine-tuning (SFT) with reinforcement learning using verifiable reward (RLVR) (Li et al., 19 Jul 2025):

  • Data Curation: Datasets for SFT (719K verified CoT traces) and RLVR (62K hard verifiable problems) are meticulously deduplicated and decontaminated against AIME24/25.
  • RL Algorithms: The Context-Aware Multi-Stage Policy Optimization (CAMPO) algorithm introduces a repetition penalty and staged rollout-length curriculum, enhancing learning stability and token efficiency.
  • Ablation Studies: RLVR confers substantial gains over SFT-only training (e.g.~+13% absolute improvement for 7B models).
  • Open Science: All code, training data, and evaluation protocols have been released, enabling full reproducibility.

7. Impact, Limitations, and Future Directions

AIME24 and its symbolic extension VAR-AIME24 have rapidly established themselves as rigorous, contamination-resistant benchmarks for mathematical reasoning. They:

  • Provide strong defenses against contamination and memorization via symbolic problem variation.
  • Demand reasoning consistent across numerical and abstract settings.
  • Expose persistent failures to generalize even in high-capacity models, underlining the current limits in LLM mathematical understanding.
  • Drive the development of new training-evaluation paradigms (ITS, RLVR, CAMPO) and robust assessment methodologies.

A plausible implication is that, while LLMs have achieved notable advances in standard AIME24 accuracy, true reasoning generalization—as demanded by VAR-AIME24—remains an open challenge, motivating further research into training regimes, evaluation strategies, and architectural innovations (Yao et al., 17 Jul 2025, Huang et al., 19 Aug 2025, Li et al., 19 Jul 2025).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to AIME24 Benchmark.