AMO-Bench: Rigorous Benchmark Suite

Updated 1 November 2025

AMO-Bench is a high-fidelity benchmark suite featuring 50 original, IMO-level problems across diverse mathematical domains for testing LLM reasoning and ML predictions.
It employs expert-driven design, blind quality review, and dual automated grading (parser and LLM-based) to ensure accuracy, originality, and robust evaluation.
Empirical findings show top models scoring around 52.4%, highlighting significant potential for improvement in advanced problem-solving and model generalization.

AMO-Bench is a term applied across multiple domains, most notably in atomistic machine learning and advanced mathematical reasoning with Olympiad-level tasks. It designates rigorously constructed, high-fidelity benchmark suites tailored to evaluate state-of-the-art models on technically demanding tasks, whether in material property prediction, generative crystal structure modeling, or mathematical problem solving. This entry focuses on AMO-Bench as described in (An et al., 30 Oct 2025), with contextual notes on its relation to atomistic ML benchmarks and other research usages.

1. Benchmark Definition and Rationale

AMO-Bench, within the context of mathematical reasoning, is a comprehensive benchmark suite consisting of 50 entirely original, human-crafted problems at or above International Mathematical Olympiad (IMO) difficulty. Its principal motivation derives from the observed saturation on existing benchmarks (e.g., AIME24/25, HMMT, MATH500), where advanced LLMs routinely exceed 90% accuracy, rendering these datasets insufficient for tracking the upper bounds of current model performance. AMO-Bench is explicitly constructed to overcome these limitations by ensuring: (1) deep problem complexity; (2) robust measures against data contamination/memorization; and (3) an answer format compatible with automated grading.

In atomistic machine learning, AMO-Bench is employed for benchmarking forward property prediction tasks (e.g., surrogates for DFT properties), while the complementary AtomBench (Campbell et al., 17 Oct 2025) evaluates inverse generative models for atomic structures.

2. Problem Construction and Validation Protocol

The benchmark's pipeline incorporates multiple expert-driven layers:

Data Creation: All problems and annotated solutions originate from mathematicians with advanced competition backgrounds.
Blind Quality Review: At least three independent experts review each problem for correctness, clarity, and mathematical substance.
Originality Filtering: A 10-gram overlap check and web-scale search ensure strict novelty, excluding derivative material.
Difficulty Verification: Each problem undergoes IMO-standard assessment. Additionally, state-of-the-art LLMs (e.g., GPT, DeepSeek, Gemini) attempt each problem; if two or more models solve a problem in three independent trials, it is excluded.

Problems are evenly distributed among five mathematical domains: Algebraic Equations & Inequalities (11), Functions & Sequences (13), Geometry (5), Number Theory (9), and Combinatorics (12).

3. Evaluation Methodology and Automated Grading

AMO-Bench utilizes a dual grading approach:

Parser-based grading: Applied to problems with numerical, set, or variable-expression answers (39/50). Final answers are required in boxed LaTeX format:
1
### The final answer is: %%%%0%%%%
LLM-based grading: For descriptive/open-ended answers (11/50), an LLM grader (o4-mini), with majority voting over five samples, determines correctness.

A manual audit reveals a grading accuracy of 99.2% over 1,000 model outputs. Sampling is performed at $k=32$ per problem (AVG@32), mitigating stochastic variance. All data and code are released at amo-bench.github.io.

Problem Type	Grader	Format/Example
Numerical/Set	Parser	$\boxed{1382935444}$
Variable-expression	Parser	$\boxed{f(n) = ...}$
Descriptive	LLM	Structured text output

4. Model Performance and Scaling Trends

A cohort of 26 LLMs—covering both proprietary and advanced open-source models—was assessed. The best model, GPT-5-Thinking (High), registered an accuracy of 52.4% (AVG@32). Most models fell below the 40% threshold on the full benchmark. The open-source champion trails the proprietary best by approximately 5%. Notably, several non-reasoning, instruction-tuned models (e.g., Qwen3-Max-Instruct, LongCat-Flash) outperform specialized mathematical reasoners, indicating opportunities for further architectural synergy.

Results are reported on both the full AMO-Bench and the parser-friendly subset ("AMO-Bench-P", 39 problems). On the latter, the highest score is 54.8%.

Model	AMO-Bench AVG@32
GPT-5-Thinking (High)	52.4
Qwen3-235B-A22B-Thinking	47.8
DeepSeek-V3.1-Thinking	47.6
LongCat-Flash-Thinking	43.6
o4-mini (High)	40.2
Gemini-2.5-Pro	38.7

Additional scaling analyses reveal that model accuracy improves predictably with output token volume; top models generate significantly longer solutions (e.g., 37,000 tokens, compared to 7,000 for legacy benchmarks such as AIME25). The accuracy vs. log(token length) trace remains sub-saturated, and pass@32 rates exceed 70%, suggesting latent reasoning capacity not yet fully harnessed.

5. Implications for LLM Reasoning and Generalization

AMO-Bench sharply delineates the limitations of current LLMs in advanced mathematical reasoning. With no model crossing the 60% line, substantial progress remains necessary. The design—final-answer-only, cross-validated for difficulty and originality, and robustly graded—advances benchmark integrity by eliminating contamination, bias, and overfitting pathways present in prior datasets.

The presence of human-annotated stepwise solutions for all problems (included in the release but not used in grading) supports error analysis, prompt engineering, and possible future training/fine-tuning protocols. The strong scaling law observed (logarithmic improvement as token count increases) suggests continued advances are possible through sampling, reranking, chain-of-thought prompting, or RL-based optimization.

6. Technical and Community Significance

AMO-Bench's rigorous construction and robust automated evaluation pipeline position it as a gold standard for mathematical reasoning benchmarks in LLM research. Its impact is amplified by public availability, explicit annotations, specification of problem categories, and detailed reporting. For the broader AI and benchmarking community, AMO-Bench augments the spectrum of high-fidelity tasks that demand generalization, creativity, and multi-step reasoning.

Given the saturation in earlier benchmarks, AMO-Bench is instrumental in differentiating advances among models, monitoring progress, and avoiding ephemeral gains attributable to memorization or brute sampling. The technical depth and transparency of its method set a precedent for future benchmark suites across domains.

7. Relation to Atomistic ML and Alternative "AMO-Bench" Usages

In the atomistic ML literature, AMO-Bench also denotes benchmarking suites for forward property prediction tasks—such as the evaluation of ML surrogates for DFT-calculated material properties. These are distinct from the mathematical AMO-Bench, although both share an emphasis on original, contamination-resistant datasets, rigorous metrics, and transparent public leaderboards. AtomBench (Campbell et al., 17 Oct 2025) expands this infrastructure to generative models; AMO-Bench thereby denotes a family of robust, domain-specific benchmarks across computational science.

Summary Table: AMO-Bench Features (Mathematical Reasoning)

Aspect	Characteristic
Problems	50 IMO-level, original, multi-domain
Categories	Algebra, Functions/Sequences, Geometry, Number Theory, Combinatorics
Validation	Multi-expert review, novelty search, LLM-based filtering
Answer Format	Final answer, boxed LaTeX
Grading	Parser (39/50), LLM (11/50); 99.2% overall accuracy
Top Model Score	52.4% (GPT-5-Thinking; AVG@32)
Release	Public (amo-bench.github.io)

Representative Problem (LaTeX Format)

1
2
3

Let {x}_{1},{x}_{2},\cdots ,{x}_{2024} be positive real numbers such that {x}_{k} + {x}_{m} \geq {km} for any 1 \leq k < m \leq 2024.
Find the minimum value of {x}_{1} + {x}_{2} + \cdots + {x}_{2024}.
Answer: \boxed{1382935444}

Conclusion

AMO-Bench establishes an authoritative standard for benchmarking mathematical reasoning in LLMs, maintaining technical rigor, originality, and grading robustness well beyond prior art. Its empirical results demonstrate considerable remaining headroom for AI research, and its methodology is now foundational for future benchmark development in both mathematical and scientific domains.

PDF Markdown Chat (Pro)

References (2)

AMO-Bench: Large Language Models Still Struggle in High School Math Competitions (2025)

AtomBench: A Benchmark for Generative Atomic Structure Models using GPT, Diffusion, and Flow Architectures (2025)

Follow Topic

Get notified by email when new papers are published related to AMO-Bench.