Papers
Topics
Authors
Recent
2000 character limit reached

HiPhO Benchmark: Multimodal Physics Evaluation

Updated 19 November 2025
  • HiPhO Benchmark is an open‐source framework that rigorously evaluates multimodal physical reasoning using 13 high-school physics Olympiad exams from 2024–2025.
  • It employs human-aligned, step-level and answer-level scoring protocols that map model scores to gold, silver, and bronze medal thresholds akin to official Olympiad standards.
  • The benchmark provides fine-grained performance analysis across diverse physics subfields and modalities, exposing challenges in variable-based and data-figure reasoning.

HiPhO is the first open-source benchmark designed to rigorously evaluate the multimodal physical reasoning abilities of LLMs and multimodal LLMs (LLMs/MLLMs) using real high-school physics Olympiad exams from 2024–2025. It uniquely enables direct, human-aligned comparison by scoring models and humans with the same official answer- and step-level marking schemes, then mapping results to gold, silver, and bronze medal thresholds identical to those in Olympiad competitions. HiPhO comprises 13 recent theoretical Olympiad exams spanning a diverse set of contests and modalities, setting a new standard for domain-specific, multi-modal AI evaluation(Yu et al., 9 Sep 2025).

1. Dataset Composition

HiPhO aggregates 13 theoretical exam papers from seven major Olympiad events conducted during 2024–2025: International Physics Olympiad (IPhO), Asian Physics Olympiad (APhO), European Physics Olympiad (EuPhO), Nordic-Baltic Physics Olympiad (NBPhO), Pan Pearl-River-Delta Physics Olympiad (PanPhO), Pan Pearl-River-Delta Mechanics Test (PanMechanics), and F=MA (U.S.). The dataset contains 360 problems and 519 subquestions, manually extracted, OCR-corrected, QA-matched to official solutions, and expert-verified.

Problems are classified along three orthogonal axes:

  • Physics Field: Mechanics (8%), Electromagnetism (9%), Thermodynamics (62%), Optics (7%), Modern Physics (8%).
  • Modality:
    • Text-only (TO): 39%
    • Text+Illustration Figure (TI): 24%
    • Text+Variable Figure (TV): 28%
    • Text+Data Figure (TD): 9%
  • Difficulty (by exam):
    • Hard: IPhO, APhO, EuPhO
    • Medium: NBPhO, PanPhO
    • Easy: PanMechanics, F=MA
Olympiad Event Modality Coverage Difficulty
IPhO TO, TI, TV, TD Hard
APhO TO, TI, TV, TD Hard
EuPhO TO, TI, TV, TD Hard
NBPhO TO, TI, TV, TD Medium
PanPhO TO, TI, TV, TD Medium
PanMechanics TO, TI Easy
F=MA TO Easy

This coverage allows for both international benchmarking and fine-grained analysis of performance across physics subfields, modalities, and problem complexity.

2. Evaluation Methodology

HiPhO adopts domain-specific professional evaluation protocols utilizing official Olympiad marking schemes to assign both answer-level (coarse) and step-level (fine-grained) scores.

  • Answer-Level Scoring: Each model’s boxed final answer is verified using a rule-based numeric/symbolic math checker (“Kydlíček”). If exact matching fails, Gemini-2.5-Flash is invoked for equivalence judgment. The answer score Ai{0,1}A_i \in \{0,1\} per subquestion ii.
  • Step-Level Scoring: Official marking schemes define nn independent criteria per subquestion, each weighted by wijw_{ij}. Gemini-2.5-Flash grades each step sij[0,1]s_{ij}\in[0,1]. The step score: Si=j=1nwijsijS_i = \sum_{j=1}^n w_{ij}s_{ij}.
  • Aggregation: Per question, score is max(Ai,Si)\max(A_i, S_i), in line with Olympiad conventions—full credit for correct final answer, partial otherwise. Exam scores sum over all questions.
  • Modality Normalization: For modality MM, Mean Normalized Score (MNS) is reported:

MNS(M)=1QMQMScore(Q)FullMark(Q)×100%MNS(M) = \frac{1}{|Q_M|} \sum_{Q \in M} \frac{Score(Q)}{FullMark(Q)} \times 100\%

where QMQ_M is the modality-specific question set.

This rigorous protocol is fully aligned with human examiner practices, supporting fine-grained and holistic performance analysis.

3. Human vs. Model Comparison Procedure

HiPhO implements medal-based comparisons directly reusing the exact cutoff scores for gold, silver, and bronze as defined by Olympiad organizers. For each contest:

  • Gold medal threshold GG is the lowest score achieved by a gold-winning human, G=minhHgScorehG = \min_{h \in H_g} Score_h.
  • Silver: S=minhHsScorehS = \min_{h\in H_s} Score_h, including gold and silver medalists.
  • Bronze: B=minhHbScorehB = \min_{h\in H_b} Score_h, including all medaled competitors.

Model assignment:

  • ExamScoreGExamScore \geq G: Gold
  • SExamScore<GS \leq ExamScore < G: Silver
  • BExamScore<SB \leq ExamScore < S: Bronze
  • Otherwise: No medal

This mechanism allows for direct and transparent assessment of model performance relative to top human candidates.

4. Benchmark Results

Thirty state-of-the-art models—including 11 closed-source MLLMs, 11 open-source MLLMs, and 8 open-source LLMs—were assessed on the HiPhO suite.

Key findings:

  • Closed-source reasoning MLLMs consistently outperformed, winning 6–12 gold medals (e.g., Gemini-2.5-Pro: 12 gold; GPT-5: 11 gold, 2 silver; o3: 11 gold, 2 silver; Grok-4: 10 gold, 3 silver).
  • Open-source MLLMs predominantly scored at or below bronze; Intern-S1 was the highest with 6 gold medals.
  • Open-source LLMs were competitive on easy contests (F=MA, PanMechanics), each earning 4–5 golds (GPT-OSS-120B, DeepSeek-R1, Qwen3-235B-A22B), but lag significantly on more difficult exams.
  • No current model matches absolute top human contestants across all events. For example, in IPhO 2025, the highest human score was 29.2/30, while Gemini-2.5-Pro scored 22.7/29.4.

Performance on visual modalities shows a pronounced decline in Mean Normalized Score as complexity increases:

  • TO: 86% (Gemini-2.5-Pro)
  • TI: 81%
  • TV: 75%
  • TD: 67%

This suggests that variable-based and data-figure reasoning remain key bottlenecks, especially for open-source models.

5. Contributions

HiPhO introduces several methodological and practical innovations:

  1. Olympiad Focus & Timeliness: Integration of 13 up-to-date exams from recent contests ensures challenge relevance for both global and regional physics communities.
  2. Mixed-Modal Coverage: Problems encompass a spectrum from text-only to complex data-figure modalities, supporting nuanced exploration of multimodal reasoning.
  3. Human-Aligned Scoring: Step-level partial credit via official schemes, exam-level scoring—not just accuracy—reflect actual contest standards.
  4. Medal-Based Comparison: Direct mapping of scores to contest medal thresholds fully aligns model evaluation with human competitive outcomes.

6. Limitations and Future Directions

Identified limitations include:

  • Experimental-lab and Diagram Generation Omission: Current evaluation excludes experimental practicals and requires no drawing, reflecting the present inability of models to physically interact or generate graphical outputs.
  • Coverage Expansion Constraints: Some major contests (e.g., USAPhO, CPhO) are absent due to incomplete medal data.
  • Update Mechanisms: The need for automated ingestion of new exams and scoring thresholds is recognized to mitigate staleness and possible training contamination.
  • Embodied and Generative Capabilities: Achieving true human-level parity will necessitate models capable of producing graphical sketches (trajectories, waveforms) and reasoning about embodied physical experimentation.

A plausible implication is that continued progress in both multimodal input processing and embodied output generation will be crucial for closing the gap with human contestants, particularly on vision-intensive and creative reasoning tasks.

HiPhO provides a comprehensive and rigorous foundation for benchmarking multimodal physics reasoning, establishing direct comparability with high-performing humans and promoting transparent, human-aligned continual progress in AI physical problem-solving(Yu et al., 9 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to HiPhO Benchmark.