MMFormalizer: Multimodal Autoformalization
- MMFormalizer is a unified system that autoformalizes math and physics problems by combining natural language and diagram analysis with Lean proof synthesis.
- It employs recursive grounding and adaptive termination to build complex formal proofs from perceptual subgraphs and foundational axioms.
- Benchmarks across mechanics, quantum theory, and geometry show its robust compile and semantic accuracy in generating verified Lean code.
MMFormalizer is a unified multimodal autoformalization system designed to translate mathematics and physics problems containing both natural language and perceptual content (such as diagrams or scenes) into formal Lean proofs. MMFormalizer addresses longstanding obstacles in autoformalization by extending beyond the textual domain—integrating perceptually-grounded, scene-based reasoning with semantic alignment to formal mathematics and physics, employing recursive abstraction and principled termination grounded in both logic and dimensional analysis. It is the first system in the literature equipped to formalize classical mechanics (including Hamiltonian systems), relativity, quantum mechanics, and thermodynamics, based on both text and visual inputs (Xiong et al., 6 Jan 2026).
1. Motivation and Problem Setting
Classical autoformalization pipelines, focused solely on text-to-proof translation, cannot resolve multimodal dependencies intrinsic to scientific reasoning. Many physical problems embed critical information within diagrams—quantities such as mass, energy, and geometric constraints essential for formulating correct formal statements. Additionally, formulating higher-level concepts in, for example, classical mechanics or quantum theory, necessitates recursively assembling abstractions from more fundamental axioms or dimensional primitives.
MMFormalizer systematically addresses:
- Multimodal grounding: Recovering hidden variables and relations that are only inferable from images and scenes (e.g., identifying a “mass” from a labeled particle in a diagram; reconstructing geometric configurations from points and lines).
- Recursive abstraction: Building complex formal systems from grounded primitives, with mechanisms to adaptively determine when to halt this recursive assembly—using both empirical evidence and foundational axioms/dimensions.
2. System Architecture and Workflow
MMFormalizer's architecture consists of three principal interconnected stages—Recursive Grounding, Adaptive Termination, and Axiom Composition—all within a Lean 4, mathlib4, and PhysLean environment augmented by LeanSearch for semantic retrieval.
2.1 Perceptual Parsing and Representation
Input images are parsed by into discrete, grounded scene graphs: where
and
Each primitive is assigned an informal label (e.g., “particle of mass ”), which forms the base layer for subsequent logical abstraction.
2.2 Logical Layers and PropChains
Formal statements are managed in chains of dependent lemmas: where is a formal proposition and its proof term. Lemmas at each graph abstraction level are organized via
with a lifting operator ensuring that visual structures are transformed into formal logical dependencies.
2.3 Recursive Grounding
For every node , MMFormalizer:
- Focuses on a visual subgraph %%%%10%%%% spanning the relevant primitives.
- Produces candidate informal propositions .
- Employs semantic search against mathlib/PhysLean to retrieve and align with corresponding formal definitions:
- Extends recursively, generating
2.4 Adaptive Recursive Termination
A branch’s recursion halts if its informal predicate is recognized as:
- A dimensional primitive: , e.g., , Force, Energy.
- A fundamental axiom: , e.g., Newton’s laws, Maxwell’s equations.
This adaptivity ensures branches terminate precisely when formal or empirical foundations are reached.
2.5 Axiom Composition and Synthesis
Terminal branches contribute to an axiom chain: with recursively merging lemmas upward. Final Lean code is type-checked and semantically verified.
2.6 High-Level Pseudocode
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
Input: image I, text T 1. G₀ ← parse(I) # SceneGraph 2. t ← 0; initialize P₀ from perceptual labels in G₀ 3. L₀ ← Grounding(G₀, P₀) # list of Lemmas 4. while True: 5. if ∃p∈Pₜ s.t. p∈Dₜ ∪ Aₜ: # dimensional or axiom hit 6. mark branch as terminated; break 7. Gₜ₊₁ ← select_subgraph(Gₜ) # visual decomposition 8. Pₜ₊₁ ← {p(v,l | Gₜ) ∣ v∈Vₜ₊₁} 9. Lₜ₊₁ ← Grounding(Gₜ₊₁, Pₜ₊₁) 10. t ← t+1 11. end while 12. Compose all L's bottom-up into final AxiomChain 13. Verify Lean compilation and semantic checking Output: Lean code + proof terms |
3. Benchmark: PhyX-AF
Evaluation is carried out on the PhyX-AF benchmark, constructed from 115 curated multimodal samples:
- MathVerse: Plane and solid geometry, function graphs
- PhyX: Mechanics, electromagnetism, thermodynamics, modern physics (relativity, quantum theory)
- Synthetic Geometry: Out-of-distribution Euclidean constructions
- Analytic Geometry: Cartesian and coordinate-based problems
Each sample consists of an image and a text statement; text-only instances are excluded. Three key metrics are reported:
- Compile accuracy: Percentage of Lean code that type-checks
- Semantic accuracy: Percentage of code semantically capturing the intended result
- Human verification: Manual check of final proofs
4. Empirical Results
Frontier LLMs outperform open-source models on PhyX-AF, especially for problems in physics and modern domains. Key findings include:
| Model | Compile Accuracy (img/text) | Semantic Accuracy (img/text) |
|---|---|---|
| GPT-5 | 71.4 % / 16.7 % | 71.4 % / 0.0 % |
| Gemini-3-Pro | 28.6 % / 42.9 % | 28.6 % / 28.6 % |
| Qwen3-VL-235B | 0.0 % / 0.0 % | 0.0 % / 0.0 % |
- Physics (PhyX): GPT-5 achieves the highest compile and semantic accuracy, particularly in quantum and relativity tasks.
- Geometry (Synthetic and Analytic): Remains the most challenging subdomain. Deviations in angle/length reasoning and out-of-distribution (OOD) constructions limit model performance.
- Model Supervision: Gemini-2.5-Pro reliably agrees with human semantic labels, implying that weaker models can supervise and audit stronger ones for consistency.
A plausible implication is that supervised semantic metrics from lightweight models might serve as effective proxies for human evaluation in future autoformalization pipelines (Xiong et al., 6 Jan 2026).
5. Case Studies Demonstrating Multimodal Autoformalization
MMFormalizer is demonstrated on representative problem types:
- Regular Hexagonal Prism (3D geometry, image input):
1 2 3 4 5
structure HexagonalPrism (V : Type*) [MetricSpace V] := (base₁ : RegularPolygon V 6) (base₂ : RegularPolygon V 6) (laterals : list (Segment V)) (perp_faces : ∀ e ∈ laterals, ...)
- From Hamiltonian to Newton’s Laws:
retrieves Hamilton’s equations and, through recursive grounding, terminates at Newton’s Second Law:
- Quantum Tunneling:
From boundary conditions and the Schrödinger equation, MMFormalizer builds
terminating with
- Relativistic Velocity Addition:
$\texttt{definition rel\_vel (u v : Velocity) : Velocity := (u + v)/(1 + (u * v)/c^2)}$
For , this computes .
6. System Limitations and Prospects
MMFormalizer remains subject to several limitations:
- Geometry: Substantial computational and representational difficulty arises in OOD geometric configurations, especially those involving strict angle/length interpretation from novel diagrams.
- Recursive Over-Decomposition: Without robust termination heuristics, recursion can generate excessive subproblems, leading to combinatorial search blow-up.
- Retrieval Dependence: When formal proof elements are absent from the underlying libraries, the system may have to generate new Lean types, a process susceptible to error.
Planned future directions include:
- Enhanced perceptual-symbolic methods, such as graph neural networks for richer scene graph representations.
- Learned heuristics or value functions for adaptive recursion pruning.
- Extension to physics domains beyond those currently supported, e.g., fluid mechanics or general relativity.
- Closer integration with automated provers, possibly via reinforcement learning-guided proof search.
MMFormalizer’s capacity to unify scene-based reasoning with deep formal proof synthesis positions it as a pioneering framework for multimodal mathematical and physical autoformalization in open-world settings (Xiong et al., 6 Jan 2026).