OlympicArena Benchmark
- OlympicArena is a multidisciplinary benchmark that uses Olympiad competition problems to assess high-level cognitive reasoning in AI.
- It integrates a rigorously curated, multimodal dataset with robust anti-leakage protocols and multi-staged evaluations for comprehensive testing.
- Medal-table rankings and process-level evaluations provide actionable insights into model performance, driving progress toward AI superintelligence.
OlympicArena is a multidisciplinary benchmark for evaluating advanced cognitive reasoning in artificial intelligence systems, particularly LLMs and large multimodal models (LMMs). Its problem set is derived from international Olympiad competitions, encompassing the depth, complexity, and multimodal demands required for scientific discovery and analogical reasoning. The benchmark combines a diverse, rigorously curated dataset, robust anti-leakage protocols, multi-faceted evaluation schemes, and a competitive medal-table ranking system to systematically chart progress toward AI superintelligence (Huang et al., 18 Jun 2024, Huang et al., 24 Jun 2024).
1. Benchmark Definition and Motivation
OlympicArena’s core thesis is that international Olympic competition problems are ideal testbeds for high-level cognitive reasoning in AI. Problems from the IMO, IPhO, IChO, IBO, USACO, and related contests demand decompositional reasoning, analogical mapping, symbolic manipulation, spatial interpretation, and interdisciplinary integration—all central components of scientific discovery and “superintelligent” problem-solving capabilities. Unlike benchmarks built on fact recall or narrow question-answer pairs, Olympic competition tasks require multi-step synthesis and cross-domain conceptual transfer. The stated goal is to “advance AI towards superintelligence, equipping it to address more complex challenges in science and beyond.”
2. Dataset Composition
The OlympicArena dataset comprises 11,163 problems collected from 62 national and international Olympiad-style competitions across seven main subjects: Mathematics, Physics, Chemistry, Biology, Geography, Astronomy, and Computer Science (programming). Content is bilingual (English/Chinese), and distributed over two modalities: text-only and interleaved text-image. The dataset covers 13 distinct answer types, including single-choice, multi-choice, numeric value, sets, code generation, and open-ended explanations.
Difficulty annotation is performed by a GPT-4V-based classifier, partitioning the corpus into Knowledge Recall, Concept Application, and Cognitive Reasoning (editor’s term: KCR/CA/CR). Empirically, problems are skewed toward higher complexity: approximately 77% are medium-difficulty and 23% classified as hard.
| Subject | Text-EN | Text-ZH | MM-EN | MM-ZH |
|---|---|---|---|---|
| Math | 2215 | 780 | 193 | 45 |
| Physics | 632 | 164 | 646 | 298 |
| Chemistry | 782 | 124 | 235 | 444 |
| ... | ... | ... | ... | ... |
3. Construction Pipeline and Data Integrity
Benchmark creation involved automated PDF-to-Markdown extraction (Mathpix), followed by human annotation for problem segmentation, metadata labelling, and manual solution entry. The pipeline incorporates deduplication (embedding similarity), multi-stage review, and rule-based data consistency checks. A key methodological innovation is the N-gram Prediction Accuracy (NPA) protocol for data leakage detection: for each problem, k random starting points are sampled and the next 5-gram is predicted using target models; matched predictions across all sampled n-grams signal potential leakage. Empirical findings indicate negligible leakage (zero instances solved by GPT-4o), and all test split ground-truth answers are hidden during evaluation.
4. Evaluation Protocols and Metrics
OlympicArena employs both answer-level and process-level evaluation. Fixed answer types (single-choice, multi-choice, numeric value, etc.) are assessed via rule-based string matching or symbolic equivalence (SymPy); code generation problems use the pass@k metric:
$\operatorname{pass}@k := \mathbb{E}\Bigl[1 - \tfrac{\binom{n-c}{k}{\binom{n}{k}\Bigr], \quad n=5,\; k=1,\; c=\#\text{correct samples}$
Open-ended formats (“Others”) are graded by GPT-4V teacher prompts. Process-level evaluation reconstructs stepwise solution chains for a subset of problems, scoring each chain as
where cites correct steps and is total steps. Inter-annotator agreement with human judges reached 83%.
Multimodal reasoning is explicitly benchmarked in three settings:
- Interleaved text + images
- Image-caption substitution (text describes images)
- Text-only
“Multimodal gain” is quantified as:
Significant gains under true multimodal protocols are realized principally by GPT-4o and a limited subset of LMMs.
5. Medal Table Ranking and Model Performance
OlympicArena employs an Olympic-style medal table to rank models lexicographically by subject-wise gold, silver, and bronze medals, and overall accuracy :
- : golds, : silvers, : bronzes, : overall (%)
- Ranking: higher golds > silvers > bronzes > accuracy
| Model | Gold | Silver | Bronze | Overall (%) |
|---|---|---|---|---|
| GPT-4o | 4 | 3 | 0 | 40.47 |
| Claude-3.5-Sonnet | 3 | 3 | 0 | 39.24 |
| Gemini-1.5-Pro | 0 | 0 | 6 | 35.09 |
| GPT-4V | 0 | 1 | 1 | 33.17 |
| Qwen 1.5-32B-Chat | 0 | 0 | 0 | 24.36 |
| Qwen-VL-Max | 0 | 0 | 0 | 21.41 |
Per-discipline performance shows strong differentiation. For instance, GPT-4o leads in Math and Computer Science, while Claude-3.5-Sonnet surpasses GPT-4o in Physics, Chemistry, and Biology. Open-source models do not earn medals, generally scoring below 25% overall accuracy.
| Model | Math | Physics | Chemistry | Biology | Geography | Astronomy | CS | Overall |
|---|---|---|---|---|---|---|---|---|
| GPT-4o | 28.32 | 30.01 | 46.68 | 53.11 | 56.77 | 44.50 | 8.43 | 40.47 |
| Claude-3.5-Sonnet | 23.18 | 31.16 | 47.27 | 56.05 | 55.19 | 43.51 | 5.19 | 39.24 |
| Gemini-1.5-Pro | 19.99 | 28.93 | 43.80 | 49.16 | 49.67 | 38.29 | 5.37 | 35.09 |
| GPT-4V | 18.98 | 24.94 | 41.06 | 47.69 | 50.33 | 32.07 | 6.94 | 33.17 |
Process-level analysis reveals a Pearson correlation () between answer-level and stepwise reasoning scores. Many models achieve higher process-level scores, indicating partially correct cognitive chains even when the final answer is incorrect. Error nodes cluster towards chain termini, demonstrating error accumulation in deduction.
Fine-grained ability breakdowns show consistent strengths in abduction and cause-effect, but systematic failures in induction and decomposition. Visually, pattern recognition is robust; spatial analysis and symbol interpretation remain challenging.
6. Comparative Analysis and Limitations
OlympicArena exposes a substantial performance gap between proprietary and open-source models. The leading systems achieve just 33–40% accuracy across disciplines, and none approach human Olympiad benchmark norms. This gap is particularly pronounced in domains requiring procedural synthesis (Math, CS) and abstract symbolic manipulation. Even where models excel (e.g., Biology, Geography), performance remains incomplete.
Common failure modes include logical/visual reasoning errors, knowledge deficits, and misunderstanding biases. State-of-the-art multimodal models cannot reliably execute fine-grained visual reasoning in multidisciplinary contexts; long-chain deduction is hampered by error propagation and abstract symbol transformation bottlenecks. A plausible implication is that multimodal and process-level supervision must be fundamentally improved to transcend these boundaries.
7. Resources, Tools, and Future Directions
OlympicArena provides freely accessible resources to support comprehensive AI evaluation:
- Full benchmark dataset (Markdown/JSON) with labeled splits
- Streamlit-based annotation and review platform
- Rule-based and teacher-prompted evaluation suite, including process-level assessment protocols
- Online leaderboard with automatic submission tracking
The framework is positioned for regular annual updates to minimize data leakage and extend coverage. Recommendations for future work include: (i) domain-specific training (curated corpora for knowledge-intensive fields), (ii) improved multimodal pre-training and visual-text alignment (especially for non-English inputs), (iii) fine-grained process-level evaluation democratization, and (iv) continuous integration of new reasoning and planning tasks (Huang et al., 18 Jun 2024, Huang et al., 24 Jun 2024).
OlympicArena thus constitutes a rigorously constructed and objectively quantified reference for “all-round” cognitive AI reasoning, with its medal-table system furnishing an interpretable, competitive landscape to drive model innovation and track progress toward superintelligence.