PhysicsMinions: Multimodal Physics Solver
- PhysicsMinions is a coevolutionary, multimodal, multi-agent system designed to solve complex Olympiad-level physics problems.
- It integrates Visual, Logic, and Review Studios to extract diagrams, perform symbolic computations, and verify solutions through iterative refinement.
- The framework enhances base model performance, achieving gold-medal-level accuracy by combining structured feedback and dual-stage verification.
PhysicsMinions is a coevolutionary, multimodal, multi-agent system designed to achieve state-of-the-art performance on Olympiad-level physics problems, particularly those presented in major international competitions such as the International Physics Olympiad (IPhO). The framework emphasizes multi-agent orchestration, multimodal perception, structured solution refinement, and dual-stage verification, achieving open-source gold-medal-level performance on tasks that demand complex symbolic reasoning, multimodal understanding, and iterative problem solving (Yu et al., 29 Sep 2025, Chen et al., 17 Nov 2025).
1. System Architecture and Components
PhysicsMinions consists of three interconnected "studios," each responsible for a specialized facet of solving Olympiad physics problems:
- Visual Studio: Parses and structurally models all diagrammatic and data-driven content. Its internal pipeline consists of:
- Inspector: Classifies figures (plots, schematics, free-body diagrams, etc.) and translates their contents to structured JSON, e.g., axis metadata, component lists.
- Introspector (Image): Audits the JSON, enforces consistency (e.g., unit adherence), fills minor gaps with confidence annotations, and ensures self-contained representation.
- Verifier (Image): Reconciles the finalized JSON with the source image, generating bug reports if mismatches arise.
- Logic Studio: Manages symbolic and numerical solution processes:
- Solver: Consumes the problem statement and validated diagram JSON, producing a two-part response—(1) a summary verdict and boxed answer, and (2) a detailed derivation in LaTeX.
- Introspector (Self-Improve): Tightens the derivation, standardizes notation, corrects computational and logical flaws, focusing special attention on issues highlighted by the Verifiers' bug reports.
- Review Studio: Implements a dual-stage verification cascade:
- Physics-Verifier: Checks unit consistency, correct constant usage, and contextual appropriateness (e.g., ensuring quantities align with domain expectations such as force or energy).
- General-Verifier: Performs deep, stepwise logical auditing—ensuring completeness, subpart matching, error-free algebraic manipulations, and correct inference chains.
The studios operate within a coevolutionary feedback loop, iteratively exchanging candidate solutions and bug reports. Convergence is enforced via a "consecutive verification" (CV) criterion: a draft must pass both Review Studio verifiers CV consecutive times before acceptance. The iterative structure enables self-correction and grounds the solution in both domain and general logical validity.
For text-only problems (as in some deployments), Visual Studio is disabled and the workflow proceeds via text prompt engineering and reviewer logic using specialized models such as P1 (Chen et al., 17 Nov 2025).
2. Iterative Refinement and Mathematical Formalization
PhysicsMinions' optimization scheme is cast as a feedback-driven search over the space of textual solutions, targeting satisfaction of both domain-specific and general logical constraints. Let denote a candidate solution, and define indicator functions:
- iff passes Physics-Verifier,
- iff passes General-Verifier.
The system seeks with . At refinement step , is updated using a correction operator , parameterized by the introspector and informed by the latest bug report :
The global objective is defined as:
with the process iteratively reducing , terminating when is observed over consecutive .
Pseudocode for the studio loop:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
I = VisualExtract(image) S = Solver(Q, I) # initial solution c, f = 0, 0 for _ in range(max_iter): S = IntrospectorImprove(S) pass_phy, report_phy = PhysicsVerify(S) if not pass_phy: S = IntrospectorImprove(S + report_phy); f += 1; c = 0 if f >= CV: S = Solver(Q, I); f = c = 0 continue pass_gen, report_gen = GeneralVerify(S) if not pass_gen: S = IntrospectorImprove(S + report_gen); f += 1; c = 0 if f >= CV: S = Solver(Q, I); f = c = 0 continue c += 1; f = 0 if c >= CV: return S return best_S_found |
3. Concrete Workflow Examples
Two representative Olympiad problem traces illustrate the pipeline:
A. IPhO Q1-C4: Visual Table Extraction
- Problem: Identify -coordinates of the three peaks in a frequency-absorption plot.
- Visual Studio emits structured JSON documenting peak coordinates with confidence scores.
- Logic Studio reads JSON and directly reports the three values as boxed answers.
- Review Studio passes the solution without need for further derivation or correction.
B. IPhO Q3-A6: Physics Derivation Failure and Correction
- Problem: Symbolic derivation of a characteristic time , given and constants .
- Single-model baseline produces a result with incorrect units.
- Physics-Verifier flags unit mismatch in ; Introspector corrects variable substitution, returning to the correct formula and verified unit structure.
- General-Verifier then identifies additional derivation gaps, prompting further introspector-led re-derivation. Acceptance follows once both passes succeed.
In both examples, the iterative review and correction loop systematically improves the solution, in contrast to direct one-shot inference.
4. Evaluation Metrics and Empirical Results
PhysicsMinions demonstrates significant performance gains on the HiPhO benchmark, which spans recent editions of global Olympiads:
| Model | IPhO | APhO | EuPhO | NBPhO | PanPhO | PanMech | F=MA | Golds (out of 7) |
|---|---|---|---|---|---|---|---|---|
| Gemini-2.5-FT (single) | 20.2 | 27.4 | 13.2 | 29.0 | 44.6 | 60.5 | 17.8 | 6 |
| + PhysicsMinions | 21.5 | 28.0 | 16.5 | 33.3 | 57.8 | 72.0 | 24.0 | 7 |
| Intern-S1 (single) | 15.9 | 21.7 | 9.0 | 23.0 | 41.1 | 60.4 | 18.4 | 2 |
| + PhysicsMinions | 20.8 | 25.2 | 10.1 | 28.9 | 46.8 | 68.7 | 22.7 | 6 |
| Qwen2.5VL-32B (single) | 9.9 | 16.5 | 6.9 | 15.3 | 22.5 | 28.1 | 7.6 | 0 |
| + PhysicsMinions | 12.4 | 17.7 | 9.0 | 21.0 | 29.5 | 36.0 | 12.0 | 2 |
On the 2025 IPhO, Open-source Intern-S1 with PhysicsMinions achieves a Pass@32 score of 26.8/30 (4th of 406 contestants), outperforming the single-model best of 22.7/30 (22nd place). PhysicsMinions is the first open-source system to win a gold medal under the IPhO average-score metric.
Agentic ablation:
- Pass@k scaling (Intern-S1, IPhO): Pass@1=15.9, Pass@32=26.8 with PhysicsMinions; single-model Pass@32=22.7.
- Comparative agentic methods: PhysicsMinions > Self-Refine (×3) > Best-of-3 > Self-MoA.
5. Integration with Specialized Physics Reasoning Models (P1 Family)
The P1 series of models, notably P1-235B-A22B, are trained via reinforcement learning using group sequence policy optimization (GSPO) with truncated importance sampling. PhysicsMinions integrates P1 as both the Solver Minion and Introspector, with the Physics-Verifier leveraging symbolic checkers (SymPy) for rigorous algebraic and dimensionality validation (Chen et al., 17 Nov 2025).
P1’s RL setup:
- State: full context of the problem and generated tokens.
- Action: next token.
- Reward: aggregate binary correctness per boxed sub-answer.
- Objective: maximize .
- Loss: GSPO surrogate with length-normalized importance ratio and clipped weights.
Deployment in PhysicsMinions yields non-trivial inference-time gains:
- P1-235B-A22B alone achieves average 35.9/57.2 (12 gold + 1 silver, HiPhO).
- Combined with PhysicsMinions: 38.4/57.2, 12 gold + 1 silver, #1 overall (surpassing Gemini-2.5-Pro at 37.7).
- At IPhO 2025: P1-235B-A22B+PhysicsMinions achieves 23.2/30 (top single-model result).
6. Generalization, Scaling, and Ablation Studies
PhysicsMinions’ improvements scale with base model capability: large models (Gemini-2.5-FT, Intern-S1) see absolute score gains of up to +6 points, while mid-sized models (Qwen2.5VL-32B) gain +2–3 points. Visual Studio is indispensable for multimodal problems: omitting it drops performance by up to 4.9 points. Review Studio ablation removes up to 2.4 points (depending on which verifier is withheld).
Variance on stopping criterion (CV) shows that CV=2 maximizes accuracy (e.g., 20.8 on IPhO for Intern-S1), with lower CV yielding under-verification and higher CV leading to over-iteration and diminishing returns.
7. Key Insights, Limitations, and Prospective Extensions
Insights:
- Coevolutionary feedback, combining solution synthesis with iterative, bug-report–driven correction, consistently breaks the performance ceiling of single-pass models.
- Structured extraction of diagrammatic information is critical to robust multimodal physics reasoning.
- Dual-stage verification (Physics-Verifier and General-Verifier) offers substantially higher error coverage than monolithic systems.
Limitations:
- Visual Studio may misinterpret fine-grained chart details without further calibration.
- Computational costs scale by – compared to direct inference.
- Simple problems may incur unnecessary correction cycles due to heuristic stopping.
Future Directions:
- Integration of advanced chart-digitization and computer vision modules for <1% diagram extraction error.
- Coupling with external symbolic and numerical solvers (e.g., SageMath, SciPy) for further algebraic robustness.
- Adapting the coevolutionary multi-agent paradigm to other Olympiad domains, including mathematical proof and complex biological/engineering diagrams.
PhysicsMinions exemplifies a systematic, robust approach to high-level scientific problem solving, setting a new standard for open-source performance in complex symbolic and multimodal benchmarks and providing a platform for generalized, agentic problem solving in STEM contexts (Yu et al., 29 Sep 2025, Chen et al., 17 Nov 2025).