Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 161 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 28 tok/s Pro
GPT-4o 85 tok/s Pro
Kimi K2 192 tok/s Pro
GPT OSS 120B 429 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

AMO-Bench: Rigorous Benchmark Suite

Updated 1 November 2025
  • AMO-Bench is a high-fidelity benchmark suite featuring 50 original, IMO-level problems across diverse mathematical domains for testing LLM reasoning and ML predictions.
  • It employs expert-driven design, blind quality review, and dual automated grading (parser and LLM-based) to ensure accuracy, originality, and robust evaluation.
  • Empirical findings show top models scoring around 52.4%, highlighting significant potential for improvement in advanced problem-solving and model generalization.

AMO-Bench is a term applied across multiple domains, most notably in atomistic machine learning and advanced mathematical reasoning with Olympiad-level tasks. It designates rigorously constructed, high-fidelity benchmark suites tailored to evaluate state-of-the-art models on technically demanding tasks, whether in material property prediction, generative crystal structure modeling, or mathematical problem solving. This entry focuses on AMO-Bench as described in (An et al., 30 Oct 2025), with contextual notes on its relation to atomistic ML benchmarks and other research usages.

1. Benchmark Definition and Rationale

AMO-Bench, within the context of mathematical reasoning, is a comprehensive benchmark suite consisting of 50 entirely original, human-crafted problems at or above International Mathematical Olympiad (IMO) difficulty. Its principal motivation derives from the observed saturation on existing benchmarks (e.g., AIME24/25, HMMT, MATH500), where advanced LLMs routinely exceed 90% accuracy, rendering these datasets insufficient for tracking the upper bounds of current model performance. AMO-Bench is explicitly constructed to overcome these limitations by ensuring: (1) deep problem complexity; (2) robust measures against data contamination/memorization; and (3) an answer format compatible with automated grading.

In atomistic machine learning, AMO-Bench is employed for benchmarking forward property prediction tasks (e.g., surrogates for DFT properties), while the complementary AtomBench (Campbell et al., 17 Oct 2025) evaluates inverse generative models for atomic structures.

2. Problem Construction and Validation Protocol

The benchmark's pipeline incorporates multiple expert-driven layers:

  1. Data Creation: All problems and annotated solutions originate from mathematicians with advanced competition backgrounds.
  2. Blind Quality Review: At least three independent experts review each problem for correctness, clarity, and mathematical substance.
  3. Originality Filtering: A 10-gram overlap check and web-scale search ensure strict novelty, excluding derivative material.
  4. Difficulty Verification: Each problem undergoes IMO-standard assessment. Additionally, state-of-the-art LLMs (e.g., GPT, DeepSeek, Gemini) attempt each problem; if two or more models solve a problem in three independent trials, it is excluded.

Problems are evenly distributed among five mathematical domains: Algebraic Equations & Inequalities (11), Functions & Sequences (13), Geometry (5), Number Theory (9), and Combinatorics (12).

3. Evaluation Methodology and Automated Grading

AMO-Bench utilizes a dual grading approach:

  • Parser-based grading: Applied to problems with numerical, set, or variable-expression answers (39/50). Final answers are required in boxed LaTeX format:
    1
    
    ### The final answer is: %%%%0%%%%
  • LLM-based grading: For descriptive/open-ended answers (11/50), an LLM grader (o4-mini), with majority voting over five samples, determines correctness.

A manual audit reveals a grading accuracy of 99.2% over 1,000 model outputs. Sampling is performed at k=32k=32 per problem (AVG@32), mitigating stochastic variance. All data and code are released at amo-bench.github.io.

Problem Type Grader Format/Example
Numerical/Set Parser 1382935444\boxed{1382935444}
Variable-expression Parser f(n)=...\boxed{f(n) = ...}
Descriptive LLM Structured text output

A cohort of 26 LLMs—covering both proprietary and advanced open-source models—was assessed. The best model, GPT-5-Thinking (High), registered an accuracy of 52.4% (AVG@32). Most models fell below the 40% threshold on the full benchmark. The open-source champion trails the proprietary best by approximately 5%. Notably, several non-reasoning, instruction-tuned models (e.g., Qwen3-Max-Instruct, LongCat-Flash) outperform specialized mathematical reasoners, indicating opportunities for further architectural synergy.

Results are reported on both the full AMO-Bench and the parser-friendly subset ("AMO-Bench-P", 39 problems). On the latter, the highest score is 54.8%.

Model AMO-Bench AVG@32
GPT-5-Thinking (High) 52.4
Qwen3-235B-A22B-Thinking 47.8
DeepSeek-V3.1-Thinking 47.6
LongCat-Flash-Thinking 43.6
o4-mini (High) 40.2
Gemini-2.5-Pro 38.7

Additional scaling analyses reveal that model accuracy improves predictably with output token volume; top models generate significantly longer solutions (e.g., 37,000 tokens, compared to 7,000 for legacy benchmarks such as AIME25). The accuracy vs. log(token length) trace remains sub-saturated, and pass@32 rates exceed 70%, suggesting latent reasoning capacity not yet fully harnessed.

5. Implications for LLM Reasoning and Generalization

AMO-Bench sharply delineates the limitations of current LLMs in advanced mathematical reasoning. With no model crossing the 60% line, substantial progress remains necessary. The design—final-answer-only, cross-validated for difficulty and originality, and robustly graded—advances benchmark integrity by eliminating contamination, bias, and overfitting pathways present in prior datasets.

The presence of human-annotated stepwise solutions for all problems (included in the release but not used in grading) supports error analysis, prompt engineering, and possible future training/fine-tuning protocols. The strong scaling law observed (logarithmic improvement as token count increases) suggests continued advances are possible through sampling, reranking, chain-of-thought prompting, or RL-based optimization.

6. Technical and Community Significance

AMO-Bench's rigorous construction and robust automated evaluation pipeline position it as a gold standard for mathematical reasoning benchmarks in LLM research. Its impact is amplified by public availability, explicit annotations, specification of problem categories, and detailed reporting. For the broader AI and benchmarking community, AMO-Bench augments the spectrum of high-fidelity tasks that demand generalization, creativity, and multi-step reasoning.

Given the saturation in earlier benchmarks, AMO-Bench is instrumental in differentiating advances among models, monitoring progress, and avoiding ephemeral gains attributable to memorization or brute sampling. The technical depth and transparency of its method set a precedent for future benchmark suites across domains.

7. Relation to Atomistic ML and Alternative "AMO-Bench" Usages

In the atomistic ML literature, AMO-Bench also denotes benchmarking suites for forward property prediction tasks—such as the evaluation of ML surrogates for DFT-calculated material properties. These are distinct from the mathematical AMO-Bench, although both share an emphasis on original, contamination-resistant datasets, rigorous metrics, and transparent public leaderboards. AtomBench (Campbell et al., 17 Oct 2025) expands this infrastructure to generative models; AMO-Bench thereby denotes a family of robust, domain-specific benchmarks across computational science.

Summary Table: AMO-Bench Features (Mathematical Reasoning)

Aspect Characteristic
Problems 50 IMO-level, original, multi-domain
Categories Algebra, Functions/Sequences, Geometry, Number Theory, Combinatorics
Validation Multi-expert review, novelty search, LLM-based filtering
Answer Format Final answer, boxed LaTeX
Grading Parser (39/50), LLM (11/50); 99.2% overall accuracy
Top Model Score 52.4% (GPT-5-Thinking; AVG@32)
Release Public (amo-bench.github.io)

Representative Problem (LaTeX Format)

1
2
3
Let {x}_{1},{x}_{2},\cdots ,{x}_{2024} be positive real numbers such that {x}_{k} + {x}_{m} \geq {km} for any 1 \leq k < m \leq 2024.
Find the minimum value of {x}_{1} + {x}_{2} + \cdots + {x}_{2024}.
Answer: \boxed{1382935444}

Conclusion

AMO-Bench establishes an authoritative standard for benchmarking mathematical reasoning in LLMs, maintaining technical rigor, originality, and grading robustness well beyond prior art. Its empirical results demonstrate considerable remaining headroom for AI research, and its methodology is now foundational for future benchmark development in both mathematical and scientific domains.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to AMO-Bench.