OlympiadBench: Bilingual STEM Multimodal Benchmark
- OlympiadBench is a large-scale bilingual dataset of 8,952 expert-annotated Olympiad-level math and physics problems, featuring both text and visual data.
- It integrates multimodal elements like diagrams and LaTeX, with rigorous evaluation protocols distinguishing numeric, symbolic, and proof-based answers.
- Benchmark results expose gaps in advanced models, spurring innovations in hybrid RL, neurosymbolic methods, and tool-augmented strategies.
OlympiadBench is a large-scale, bilingual, multimodal benchmark specifically constructed to evaluate the advanced reasoning and problem-solving capabilities of large language and multimodal models on Olympiad-level mathematics and science problems. Its design, annotation standards, and evaluation protocols position it as a pivotal resource for stress-testing generalist reasoning, especially at the upper limits of contemporary model architecture.
1. Benchmark Construction and Dataset Scope
OlympiadBench comprises 8,952 problems (as of (He et al., 2024)), with 6,524 in mathematics (73%) and 2,428 in physics (27%). Chemistry and biology content is a prospective area for extension. The problems are derived from international and regional Olympiad competitions—including IMO, RMM, ARML, EGMO, IPhO, APhO, Chinese college entrance exam mock sets, and their national equivalents. Most items are open-ended; theorem-proving questions account for 19% (1,698 problems), with the majority (81%) requiring numeric, expression, equation, interval, or tuple responses.
All items are released in English and Chinese, after manual OCR post-processing, expert verification, and language alignment. Multimodality is fundamental: 57% of the benchmark includes visual data (e.g., diagrams, charts, experimental setups), and the annotation of embedded LaTeX and images guarantees fidelity of mathematical notation across languages.
The problem subfield coverage is broad: in mathematics, the core areas are algebra, geometry (plane/solid), number theory, combinatorics, and conic sections; physics subfields include mechanics, electromagnetism, thermodynamics, optics, and modern physics.
2. Annotation Standards and Multimodal Characteristics
Each OlympiadBench problem is annotated with three main fields:
- Problem: The full statement (markdown), including multimodal elements.
- Solution: Expert step-by-step reasoning, in natural language, using standard notation and explicit logical justification.
- Answer: Final result, in a prescribed format (numeric, symbolic, etc.), with rigorous LaTeX.
Annotations are performed by graduate-level mathematics/physics experts, employing double annotation and adjudication for quality control. There is no automatic translation between languages; source material is kept in its native language. Diagrams are meticulously aligned to both language versions to ensure consistent context—a prerequisite for robust multimodal training and evaluation.
3. Evaluation Protocols and Metrics
OlympiadBench employs an automated, expertise-level scoring pipeline (He et al., 2024). For open-ended responses, the evaluation pipeline distinguishes numeric values (tolerance ), symbolic expressions (via SymPy equivalence), equations, and tuples/intervals (element-wise). The principal metric is micro-averaged accuracy (exact match), as formalized by
for scored problems.
In some model studies, Pass@1 is computed as the fraction of problems solved correctly on first attempt, with multiple independent runs to average out stochasticity. Judging for freeform answers can be conducted with advanced LLM-based semantic matching (e.g., OpenAI-o4) to allow for equivalent but not syntactically exact answers (Wang et al., 23 Apr 2025).
The official split does not label problems by fine-grained difficulty, but all instances are at least “Olympiad level,” fundamentally exceeding the complexity of school or standardized test benchmarks.
4. Comparative Model Performance and Analysis
OlympiadBench has proven significantly more difficult than prior academic math/physics benchmarks for both open- and closed-source models. For example, early results from (He et al., 2024) place GPT-4V at 20.35% (math), 11.28% (physics), and 17.23% (overall). Other baselines, such as Qwen-VL-Max and Gemini-Pro-Vision, fall below 12%. In (Wang et al., 23 Apr 2025), open-source multimodal models Qwen2.5-VL-72B and QvQ-Preview-72B achieve 40.4% and 33.2% respectively, while Skywork R1V2 surpasses these with 62.6% Pass@1. In small-model, inference-only regimes (e.g., (Liu et al., 29 May 2025)), <3B-parameter models solve barely 21% of OlympiadBench problems.
Performance for LLMs using advanced reinforcement learning or tool-augmented techniques demonstrates rapid recent progress:
- Hybrid RL (Skywork R1V2): 62.6% Pass@1, with gains attributed to mixed preference optimization (MPO) jointly with Group Relative Policy Optimization (GRPO). Adapter-only fine-tuning suffices, with minimal gains from full vision encoder updates.
- Tool Aggregation (Multi-TAG): A finetuning-free method combining CoT, Python, and WolframAlpha yields up to 44.1% exact accuracy (GPT-4o), surpassing all single-tool or simple ensemble baselines (Yao et al., 25 Jul 2025).
- Self-Confidence RL (RLSC): Fine-tuning Qwen2.5-Math-7B with self-confidence signals boosts accuracy from 15.1% to 35.9% (+20.8 pp) on OlympiadBench (Li et al., 5 Jun 2025).
- Prompt-augmented GRPO: Applying prompt augmentation in GRPO RL enables a rise to 41.9–42.3% for 1.5B models, a new state of the art for this size class (Lu et al., 3 Feb 2026).
- Monte Carlo Self-Refine (MCTSr): Iterative search and refinement in LLaMA-3 8B context yields a stepwise gain, but even the best setting only achieves 7.76% on pure-text OlympiadBench (Zhang et al., 2024).
- Few-shot Extreme-Difficulty Synthesis (MathSmith): Synthetic hard problem data yields incremental but significant gains over existing CoT-oriented synthetic data, up to 3 points higher in short-CoT and 0.7 in long-CoT on top-tier Qwen models (Zhan et al., 7 Aug 2025).
In all cases, even top open-weight and closed-source models are far from saturation, with many problems—especially those involving diagrams or multi-modal cues—remaining out of reach.
5. Methodological Innovations and Experimental Strategies
Multiple research directions leverage OlympiadBench to probe new RL paradigms, curriculum learning, tool use, and inference efficiency.
- Multimodal RL Curricula: Infi-MMR (Liu et al., 29 May 2025) applies a three-phase curriculum (Foundational Reasoning Activation→Cross-Modal Reasoning Adaptation→Multimodal Reasoning Enhancement), progressively transferring text-only reasoning capacities to visual multimodal settings. This staged RL approach is critical for achieving stable, high multimodal accuracy.
- Efficiency and Output Control: For small models, techniques such as temperature scaling to tune EOS emission and length-regularized RL (TLDR via GRPO) dramatically cut inference-time token consumption (by ~50%) without harming (or even improving) accuracy (Zhang et al., 12 May 2025).
- Multi-Agent Pipelines: Agentic decomposition—splitting vision parsing and symbolic solving—offers double-digit accuracy boosts for open-source vision–LLMs on diagram-heavy OlympiadBench items, though this is not universally optimal for highly optimized proprietary models (Sobhani et al., 18 Dec 2025).
- Verifiable Code (Neurosymbolic): SymCode (Nezhad et al., 29 Oct 2025) reframes problem solving as code generation for symbolic engines (SymPy), improving both explainability and reliability of outputs. This leads to accuracy boosts of up to +13.6 points over even advanced Tree-of-Thoughts baselines, with errors becoming transparent, programmatic exceptions rather than opaque logical fallacies.
6. Impact and Relation to the Benchmark Ecosystem
OlympiadBench fills a critical gap left by benchmarks such as GSM8K or MATH, which are now near-saturation for frontier LLMs (Gao et al., 2024). Unlike purely textual math-only datasets (e.g., Omni-MATH), OlympiadBench stresses models with multimodal integration, bilingual context, and a wider array of problem types, including true open-ended proofs and diagram-grounded challenges.
Newer benchmarks (EEFSUVA, OIBench) have sought to diversify or go beyond OlympiadBench by targeting low-prevalence contests and algorithmic domains, but OlympiadBench remains the reference for large-scale, expert-annotated, AGI-level multimodal STEM reasoning (Khatibi et al., 23 Sep 2025, Zhu et al., 12 Jun 2025).
Contamination resistance is enhanced by sourcing problems from less accessible competitions and by explicit decontamination protocols, but the field notes ongoing risks, especially for proof- and text-only questions.
7. Common Failure Modes and Open Research Challenges
Failure analysis reveals dominant error patterns for top-tier models (He et al., 2024):
- Logical fallacies and conceptual confusion (30%)
- Hallucinated steps or conclusions (18%)
- Insufficient case discussion in combinatorics or progressive problems (15%)
- Algebraic and arithmetic slip-ups (12%, 10%)
- Multimodal: diagram or visual feature neglect (8%)
- Omitted cases and incomplete reasoning (7%)
Advanced RL and neurosymbolic methods shift mistakes from invisible logic errors toward explicit programmatic exceptions (e.g., out-of-domain roots, API errors) (Nezhad et al., 29 Oct 2025), but overall performance remains far from human expert (>90%).
Partial-credit rubrics, automatic proof verification, and further expansion to chemistry, biology, and algorithmic tasks are identified as necessary extensions to better assess truly general reasoning (He et al., 2024, Gao et al., 2024).
OlympiadBench continues to drive advances in both model capability and benchmark methodology for evaluation at the limits of machine mathematical and scientific reasoning. Its sustained difficulty and multimodal demands ensure enduring relevance in the face of rapid model scaling and algorithmic innovation.