HiPhO Benchmark: Multimodal Physics Evaluation
- HiPhO Benchmark is an open‐source framework that rigorously evaluates multimodal physical reasoning using 13 high-school physics Olympiad exams from 2024–2025.
- It employs human-aligned, step-level and answer-level scoring protocols that map model scores to gold, silver, and bronze medal thresholds akin to official Olympiad standards.
- The benchmark provides fine-grained performance analysis across diverse physics subfields and modalities, exposing challenges in variable-based and data-figure reasoning.
HiPhO is the first open-source benchmark designed to rigorously evaluate the multimodal physical reasoning abilities of LLMs and multimodal LLMs (LLMs/MLLMs) using real high-school physics Olympiad exams from 2024–2025. It uniquely enables direct, human-aligned comparison by scoring models and humans with the same official answer- and step-level marking schemes, then mapping results to gold, silver, and bronze medal thresholds identical to those in Olympiad competitions. HiPhO comprises 13 recent theoretical Olympiad exams spanning a diverse set of contests and modalities, setting a new standard for domain-specific, multi-modal AI evaluation(Yu et al., 9 Sep 2025).
1. Dataset Composition
HiPhO aggregates 13 theoretical exam papers from seven major Olympiad events conducted during 2024–2025: International Physics Olympiad (IPhO), Asian Physics Olympiad (APhO), European Physics Olympiad (EuPhO), Nordic-Baltic Physics Olympiad (NBPhO), Pan Pearl-River-Delta Physics Olympiad (PanPhO), Pan Pearl-River-Delta Mechanics Test (PanMechanics), and F=MA (U.S.). The dataset contains 360 problems and 519 subquestions, manually extracted, OCR-corrected, QA-matched to official solutions, and expert-verified.
Problems are classified along three orthogonal axes:
- Physics Field: Mechanics (8%), Electromagnetism (9%), Thermodynamics (62%), Optics (7%), Modern Physics (8%).
- Modality:
- Text-only (TO): 39%
- Text+Illustration Figure (TI): 24%
- Text+Variable Figure (TV): 28%
- Text+Data Figure (TD): 9%
- Difficulty (by exam):
- Hard: IPhO, APhO, EuPhO
- Medium: NBPhO, PanPhO
- Easy: PanMechanics, F=MA
| Olympiad Event | Modality Coverage | Difficulty |
|---|---|---|
| IPhO | TO, TI, TV, TD | Hard |
| APhO | TO, TI, TV, TD | Hard |
| EuPhO | TO, TI, TV, TD | Hard |
| NBPhO | TO, TI, TV, TD | Medium |
| PanPhO | TO, TI, TV, TD | Medium |
| PanMechanics | TO, TI | Easy |
| F=MA | TO | Easy |
This coverage allows for both international benchmarking and fine-grained analysis of performance across physics subfields, modalities, and problem complexity.
2. Evaluation Methodology
HiPhO adopts domain-specific professional evaluation protocols utilizing official Olympiad marking schemes to assign both answer-level (coarse) and step-level (fine-grained) scores.
- Answer-Level Scoring: Each model’s boxed final answer is verified using a rule-based numeric/symbolic math checker (“Kydlíček”). If exact matching fails, Gemini-2.5-Flash is invoked for equivalence judgment. The answer score per subquestion .
- Step-Level Scoring: Official marking schemes define independent criteria per subquestion, each weighted by . Gemini-2.5-Flash grades each step . The step score: .
- Aggregation: Per question, score is , in line with Olympiad conventions—full credit for correct final answer, partial otherwise. Exam scores sum over all questions.
- Modality Normalization: For modality , Mean Normalized Score (MNS) is reported:
where is the modality-specific question set.
This rigorous protocol is fully aligned with human examiner practices, supporting fine-grained and holistic performance analysis.
3. Human vs. Model Comparison Procedure
HiPhO implements medal-based comparisons directly reusing the exact cutoff scores for gold, silver, and bronze as defined by Olympiad organizers. For each contest:
- Gold medal threshold is the lowest score achieved by a gold-winning human, .
- Silver: , including gold and silver medalists.
- Bronze: , including all medaled competitors.
Model assignment:
- : Gold
- : Silver
- : Bronze
- Otherwise: No medal
This mechanism allows for direct and transparent assessment of model performance relative to top human candidates.
4. Benchmark Results
Thirty state-of-the-art models—including 11 closed-source MLLMs, 11 open-source MLLMs, and 8 open-source LLMs—were assessed on the HiPhO suite.
Key findings:
- Closed-source reasoning MLLMs consistently outperformed, winning 6–12 gold medals (e.g., Gemini-2.5-Pro: 12 gold; GPT-5: 11 gold, 2 silver; o3: 11 gold, 2 silver; Grok-4: 10 gold, 3 silver).
- Open-source MLLMs predominantly scored at or below bronze; Intern-S1 was the highest with 6 gold medals.
- Open-source LLMs were competitive on easy contests (F=MA, PanMechanics), each earning 4–5 golds (GPT-OSS-120B, DeepSeek-R1, Qwen3-235B-A22B), but lag significantly on more difficult exams.
- No current model matches absolute top human contestants across all events. For example, in IPhO 2025, the highest human score was 29.2/30, while Gemini-2.5-Pro scored 22.7/29.4.
Performance on visual modalities shows a pronounced decline in Mean Normalized Score as complexity increases:
- TO: 86% (Gemini-2.5-Pro)
- TI: 81%
- TV: 75%
- TD: 67%
This suggests that variable-based and data-figure reasoning remain key bottlenecks, especially for open-source models.
5. Contributions
HiPhO introduces several methodological and practical innovations:
- Olympiad Focus & Timeliness: Integration of 13 up-to-date exams from recent contests ensures challenge relevance for both global and regional physics communities.
- Mixed-Modal Coverage: Problems encompass a spectrum from text-only to complex data-figure modalities, supporting nuanced exploration of multimodal reasoning.
- Human-Aligned Scoring: Step-level partial credit via official schemes, exam-level scoring—not just accuracy—reflect actual contest standards.
- Medal-Based Comparison: Direct mapping of scores to contest medal thresholds fully aligns model evaluation with human competitive outcomes.
6. Limitations and Future Directions
Identified limitations include:
- Experimental-lab and Diagram Generation Omission: Current evaluation excludes experimental practicals and requires no drawing, reflecting the present inability of models to physically interact or generate graphical outputs.
- Coverage Expansion Constraints: Some major contests (e.g., USAPhO, CPhO) are absent due to incomplete medal data.
- Update Mechanisms: The need for automated ingestion of new exams and scoring thresholds is recognized to mitigate staleness and possible training contamination.
- Embodied and Generative Capabilities: Achieving true human-level parity will necessitate models capable of producing graphical sketches (trajectories, waveforms) and reasoning about embodied physical experimentation.
A plausible implication is that continued progress in both multimodal input processing and embodied output generation will be crucial for closing the gap with human contestants, particularly on vision-intensive and creative reasoning tasks.
HiPhO provides a comprehensive and rigorous foundation for benchmarking multimodal physics reasoning, establishing direct comparability with high-performing humans and promoting transparent, human-aligned continual progress in AI physical problem-solving(Yu et al., 9 Sep 2025).