PhyX-AF Benchmark: Multimodal Reasoning
- The paper presents PhyX-AF as a benchmark that integrates visual data with textual prompts to assess physical and mathematical reasoning.
- It employs 3,000 expert-validated problems across diverse physics domains, enforcing model grounding and symbolic chaining in real-world contexts.
- The evaluation protocol measures model accuracy, compile accuracy, and semantic accuracy, highlighting gaps between AI and human performance in multimodal inference.
PhyX-AF is a multimodal benchmark for evaluating models on physical and mathematical reasoning in visually grounded, real-world contexts. Unlike prior benchmarks focusing solely on text or symbolic input, PhyX-AF targets the integration of domain-specific knowledge, symbolic chaining, and perceptual grounding—explicitly quantifying a model’s ability to bridge vision and formal reasoning across both physics and mathematics. The benchmark includes complex schematic images and requires alignment between generated symbolic expressions or formal answer schemas and the information grounded in diagrams and text. PhyX-AF is used both as a stand-alone challenge and as part of larger multimodal autoformalization systems, such as MMFormalizer, to assess the state-of-the-art in unified mathematical and physical inference (Shen et al., 21 May 2025, Xiong et al., 6 Jan 2026).
1. Benchmark Foundations and Objectives
PhyX-AF originated to rigorously measure AI models’ capacity for integrated physical reasoning in scenarios demanding more than factual recall or abstract computation. The primary benchmark objectives are:
- To enforce genuine interplay between explicit visual information, domain knowledge, and symbolic operations within implicit and explicit physical constraints;
- To filter out problems solvable from text alone, ensuring the necessity of multimodal (visual and textual) understanding.
The benchmark measures:
- Physical Model Grounding: Correct mapping of scenes and diagrams to physical principles;
- Symbolic and Multi-Formula Reasoning: Chaining of distinct equations or laws for solution derivation;
- Visual Quantification/Interpretation: Extraction of both quantitative and qualitative cues from images (e.g., discrete readings, surface property identification);
- Implicit and Numerical Reasoning: Handling of unstated assumptions and higher-level mathematics (e.g., system evolution, calculus);
- Predictive Reasoning: Anticipating the evolution or outcome of dynamic systems based on initial conditions in the visual/textual input (Shen et al., 21 May 2025, Xiong et al., 6 Jan 2026).
2. Dataset Composition and Coverage
The dataset consists of 3,000 expert-validated, high-fidelity physics problems—each comprising a multimodal pairing of a schematic image and a textual description.
Major characteristics:
- Split: 1,500 multiple-choice (MC), 1,500 open-ended (OE);
- Physics Domains (question count per domain):
- Mechanics (550)
- Electromagnetism (550)
- Thermodynamics (500)
- Wave/Acoustics (500)
- Optics (500)
- Modern Physics (400)
- **Twenty-five sub-domains, e.g., Kinematics, Electrostatics, Quantum Phenomena, etc.;
- Six physical reasoning types: Model grounding, spatial reasoning, multi-formula chaining, implicit conditions, numerical, and predictive reasoning.
- Distribution: Each type and domain are balanced for an approximately uniform over the 2D grid of domains × reasoning types.
A representative MC example in optics: a schematic of a convex lens with associated question requiring application of the lens equation $1/f = 1/u + 1/v$ (model grounding plus numerical reasoning) (Shen et al., 21 May 2025).
PhyX-AF is also present as a 25-sample subset within the broader MMFormalizer benchmark, where it covers mechanics, electromagnetism, thermodynamics, and modern physics—including tasks derived from real schematic scenes (Xiong et al., 6 Jan 2026).
3. Evaluation Protocol and Metrics
PhyX-AF evaluation relies on reproducible, one-click pipelines (integrated with VLMEvalKit), structured as follows:
- Prediction Generation: Chain-of-thought (CoT) prompting over multimodal input;
- Answer Extraction: Rule-based parsing to isolate answers;
- Judgment:
- MC: Direct option match, with LLM fallback for ambiguity.
- OE: LLM judge (e.g., DeepSeek-V3) ensures >99% agreement with ground-truth answers.
- Variants: Models are assessed under different input redundancy regimes (Full-Text, Text-DeRedundancy, Text-Minimal) to probe over-reliance on textual cues.
Metric Table
| Metric | Definition |
|---|---|
| Accuracy | |
| Compile Accuracy | |
| Semantic Accuracy |
Compile and semantic accuracy are used specifically for formal autoformalization settings, as in MMFormalizer (Xiong et al., 6 Jan 2026).
4. Experimental Results and Model Performance
Frontier models and open-source LLMs are evaluated under various conditions:
- General PhyX-AF Physics Results (text-dereundancy, OE):
- GPT-4o: 32.5%
- Claude 3.7-Sonnet: 42.2%
- GPT-4o-mini: 45.8%
- Human MC baseline (worst/medium/best): 75.6% / 77.8% / 78.9%
- MC top MLLM: GPT-4o-mini achieves 86.9%, still 9.0 points below best human (95.9%)
- Performance gap (e.g., worst human vs. GPT-4o-mini):
In the context of multimodal autoformalization (MMFormalizer):
- GPT-5: Excels in physical reasoning (compile and semantic up to ~71% for PhyX Modern Physics, using images).
- Gemini-3-Pro: Highest compile accuracy (100%) and up to 80% semantic accuracy for MathVerse plane geometry.
- Open-source models (Qwen3-VL-235B, Qwen2.5): Largely underperform (<30% in key domains).
Domain Difficulty
- Synthetic and analytic geometry remain the most challenging domains in the formalization setting (semantic accuracy <50%), while physics domains are relatively less challenging but still nontrivial (Xiong et al., 6 Jan 2026).
5. Error Taxonomy and Model Limitations
Analysis of 96 mispredicted samples yields the following distribution:
| Error Type | Proportion |
|---|---|
| Visual Reasoning Errors | 39.6% |
| Lack of Domain Knowledge | 38.5% |
| Textual Reasoning Errors | 13.5% |
| Calculation Mistakes | 8.3% |
Key findings:
- Models often repeat textbook statements without visual grounding.
- Visual pattern matching errors (e.g., misreading graphs) are prevalent.
- Overemphasis on closed-form mathematical shortcuts, with insufficient attention to implicit physical conditions.
Case studies include misapplication of friction in “smooth surface” problems and neglect of medium effects in optics depth estimation (Shen et al., 21 May 2025).
Ablation studies in the MMFormalizer setting show improvements in compile accuracy for synthetic geometry when code retrieval is disabled, and underscore the critical role of including diagrams for semantic grounding (+20–30% for certain domains) (Xiong et al., 6 Jan 2026).
6. Extensions, Limitations, and Future Directions
PhyX-AF is designed as a living benchmark, with currently available code and data under the MIT license, ensuring reproducibility via standardized evaluation scripts.
Planned and Active Extensions:
- “AF” (Answer-Format) variant—aligns output with formal answer schemas;
- Real-scene photographic imagery and temporal dynamics;
- Multilingual prompts and video-based physical reasoning challenges.
Current limitations include:
- Benchmark size (115 samples in MMFormalizer’s context), with potential coverage gaps for diverse physical/mathematical phenomena;
- Open-source LLMs remain noncompetitive on physically grounded scene reasoning;
- Semantic equivalence checking by LLMs can differ from human annotators, motivating future work in robust automated validation.
A plausible implication is that while progress in multimodal and autoformalization capabilities is significant, genuine model understanding—especially in abstract geometric contexts—lags behind human experts.
7. Significance in Multimodal AI Research
PhyX-AF represents the first comprehensive multimodal autoformalization benchmark to systematically span mathematics and physics, bridging the gap between perception and formal inference. Its adoption has exposed critical bottlenecks in current large multimodal models and motivated advances in model architectures, evaluation protocols, and task definitions. The benchmark continues to set the agenda for unified reasoning evaluation and is likely to underpin further developments in both automated theorem proving and real-world scientific AI (Shen et al., 21 May 2025, Xiong et al., 6 Jan 2026).