MMFormalizer: Multimodal Autoformalization

Updated 12 January 2026

MMFormalizer is a unified system that autoformalizes math and physics problems by combining natural language and diagram analysis with Lean proof synthesis.
It employs recursive grounding and adaptive termination to build complex formal proofs from perceptual subgraphs and foundational axioms.
Benchmarks across mechanics, quantum theory, and geometry show its robust compile and semantic accuracy in generating verified Lean code.

MMFormalizer is a unified multimodal autoformalization system designed to translate mathematics and physics problems containing both natural language and perceptual content (such as diagrams or scenes) into formal Lean proofs. MMFormalizer addresses longstanding obstacles in autoformalization by extending beyond the textual domain—integrating perceptually-grounded, scene-based reasoning with semantic alignment to formal mathematics and physics, employing recursive abstraction and principled termination grounded in both logic and dimensional analysis. It is the first system in the literature equipped to formalize classical mechanics (including Hamiltonian systems), relativity, quantum mechanics, and thermodynamics, based on both text and visual inputs (Xiong et al., 6 Jan 2026).

1. Motivation and Problem Setting

Classical autoformalization pipelines, focused solely on text-to-proof translation, cannot resolve multimodal dependencies intrinsic to scientific reasoning. Many physical problems embed critical information within diagrams—quantities such as mass, energy, and geometric constraints essential for formulating correct formal statements. Additionally, formulating higher-level concepts in, for example, classical mechanics or quantum theory, necessitates recursively assembling abstractions from more fundamental axioms or dimensional primitives.

MMFormalizer systematically addresses:

Multimodal grounding: Recovering hidden variables and relations that are only inferable from images and scenes (e.g., identifying a “mass” from a labeled particle in a diagram; reconstructing geometric configurations from points and lines).
Recursive abstraction: Building complex formal systems from grounded primitives, with mechanisms to adaptively determine when to halt this recursive assembly—using both empirical evidence and foundational axioms/dimensions.

2. System Architecture and Workflow

MMFormalizer's architecture consists of three principal interconnected stages—Recursive Grounding, Adaptive Termination, and Axiom Composition—all within a Lean 4, mathlib4, and PhysLean environment augmented by LeanSearch for semantic retrieval.

2.1 Perceptual Parsing and Representation

Input images $I: \mathsf{Image}$ are parsed by $\mathsf{parse}: I \to \mathsf{SceneGraph}$ into discrete, grounded scene graphs: $\mathsf{SceneGraph} := \Sigma(V_t : \mathsf{List\;Primitive}),\; \mathsf{Rel}(V_t)$ where

$\mathsf{Primitive} ::= \texttt{point} \mid \texttt{line} \mid \texttt{region}$

and

$\mathsf{Rel}(V_t): \Pi(v_i, v_j \in V_t),\; \mathsf{SpatialRel}(v_i, v_j)$

Each primitive $v_i$ is assigned an informal label $l_t$ (e.g., “particle of mass $m$ ”), which forms the base layer $L_0$ for subsequent logical abstraction.

2.2 Logical Layers and PropChains

Formal statements are managed in chains of dependent lemmas: $\mathsf{Lemma} := \Sigma(P:\mathsf{PropChain}),\;(p:P)$ where $P$ is a formal proposition and $p:P$ its proof term. Lemmas at each graph abstraction level are organized via

$\mathsf{PropChain} := \Sigma(L_t : \mathsf{List\;Lemma})$

with a lifting operator $\mathsf{lift}: \mathsf{SceneGraph} \to \mathsf{PropChain}$ ensuring that visual structures are transformed into formal logical dependencies.

2.3 Recursive Grounding

For every node $t$ , MMFormalizer:

Focuses on a visual subgraph %%%%10%%%% spanning the relevant primitives.
Produces candidate informal propositions $P_t := \{p(v_i, l_t \mid G_{t-1}) \mid v_i \in V_t\}$ .
Employs semantic search against mathlib/PhysLean to retrieve and align $p \in P_t$ with corresponding formal definitions: $\mathsf{Grounding}: G_{t-1} \to P_t \to \mathsf{Lemma}$
Extends recursively, generating

$L_{t+1} := \mathsf{Grounding}(G_t, P_{t+1})$

2.4 Adaptive Recursive Termination

A branch’s recursion halts if its informal predicate $p \in P_t$ is recognized as:

A dimensional primitive: $\mathsf{Termination}(P_t) = \mathsf{dim}(p)$ , e.g., $[M], [L], [T]$ , Force, Energy.
A fundamental axiom: $\mathsf{Termination}(P_t) = \mathsf{axiom}(p)$ , e.g., Newton’s laws, Maxwell’s equations.

This adaptivity ensures branches terminate precisely when formal or empirical foundations are reached.

2.5 Axiom Composition and Synthesis

Terminal branches contribute to an axiom chain: $\mathsf{AxiomChain} := \Sigma(A_t : \mathsf{List\;Axiom}, D_t : \mathsf{List\;Dim})$ with $\mathsf{Compose}(\{L_{t+1}^k\}, G_t, P_t) \to L_t$ recursively merging lemmas upward. Final Lean code is type-checked and semantically verified.

2.6 High-Level Pseudocode

Input: image I, text T
1. G₀ ← parse(I)                          # SceneGraph
2. t ← 0; initialize P₀ from perceptual labels in G₀
3. L₀ ← Grounding(G₀, P₀)                # list of Lemmas
4. while True:
5.   if ∃p∈Pₜ s.t. p∈Dₜ ∪ Aₜ:             # dimensional or axiom hit
6.     mark branch as terminated; break
7.   Gₜ₊₁ ← select_subgraph(Gₜ)           # visual decomposition
8.   Pₜ₊₁ ← {p(v,l | Gₜ) ∣ v∈Vₜ₊₁}
9.   Lₜ₊₁ ← Grounding(Gₜ₊₁, Pₜ₊₁)
10.  t ← t+1
11. end while
12. Compose all L's bottom-up into final AxiomChain
13. Verify Lean compilation and semantic checking
Output: Lean code + proof terms

(Xiong et al., 6 Jan 2026)

3. Benchmark: PhyX-AF

Evaluation is carried out on the PhyX-AF benchmark, constructed from 115 curated multimodal samples:

MathVerse: Plane and solid geometry, function graphs
PhyX: Mechanics, electromagnetism, thermodynamics, modern physics (relativity, quantum theory)
Synthetic Geometry: Out-of-distribution Euclidean constructions
Analytic Geometry: Cartesian and coordinate-based problems

Each sample consists of an image and a text statement; text-only instances are excluded. Three key metrics are reported:

Compile accuracy: Percentage of Lean code that type-checks
Semantic accuracy: Percentage of code semantically capturing the intended result
Human verification: Manual check of final proofs

4. Empirical Results

Frontier LLMs outperform open-source models on PhyX-AF, especially for problems in physics and modern domains. Key findings include:

Model	Compile Accuracy (img/text)	Semantic Accuracy (img/text)
GPT-5	71.4 % / 16.7 %	71.4 % / 0.0 %
Gemini-3-Pro	28.6 % / 42.9 %	28.6 % / 28.6 %
Qwen3-VL-235B	0.0 % / 0.0 %	0.0 % / 0.0 %

Physics (PhyX): GPT-5 achieves the highest compile and semantic accuracy, particularly in quantum and relativity tasks.
Geometry (Synthetic and Analytic): Remains the most challenging subdomain. Deviations in angle/length reasoning and out-of-distribution (OOD) constructions limit model performance.
Model Supervision: Gemini-2.5-Pro reliably agrees with human semantic labels, implying that weaker models can supervise and audit stronger ones for consistency.

A plausible implication is that supervised semantic metrics from lightweight models might serve as effective proxies for human evaluation in future autoformalization pipelines (Xiong et al., 6 Jan 2026).

5. Case Studies Demonstrating Multimodal Autoformalization

MMFormalizer is demonstrated on representative problem types:

Regular Hexagonal Prism (3D geometry, image input):

structure HexagonalPrism (V : Type*) [MetricSpace V] :=
  (base₁ : RegularPolygon V 6)
  (base₂ : RegularPolygon V 6)
  (laterals : list (Segment V))
  (perp_faces : ∀ e ∈ laterals, ...)

From Hamiltonian to Newton’s Laws:

$H(\mathbf{p},\mathbf{q})=\sum_i \frac{\|\mathbf{p}_i\|^2}{2m_i}+V(\mathbf{q})$

retrieves Hamilton’s equations and, through recursive grounding, terminates at Newton’s Second Law:

$\forall i,\;F_i=\frac{d p_i}{d t}$

Quantum Tunneling:

From boundary conditions and the Schrödinger equation, MMFormalizer builds

$\psi''(x)=\kappa^2\psi(x),\quad \kappa=\sqrt{2m(U_0-E)}/\hbar$

terminating with

$\psi(x)=A e^{-\kappa(x-L)}$

Relativistic Velocity Addition:

$\texttt{definition rel\_vel (u v : Velocity) : Velocity := (u + v)/(1 + (u * v)/c^2)}$

For $u=0.8c, v=-0.6c$ , this computes $v_{\rm rel}=0.946\,c$ .

6. System Limitations and Prospects

MMFormalizer remains subject to several limitations:

Geometry: Substantial computational and representational difficulty arises in OOD geometric configurations, especially those involving strict angle/length interpretation from novel diagrams.
Recursive Over-Decomposition: Without robust termination heuristics, recursion can generate excessive subproblems, leading to combinatorial search blow-up.
Retrieval Dependence: When formal proof elements are absent from the underlying libraries, the system may have to generate new Lean types, a process susceptible to error.

Planned future directions include:

Enhanced perceptual-symbolic methods, such as graph neural networks for richer scene graph representations.
Learned heuristics or value functions for adaptive recursion pruning.
Extension to physics domains beyond those currently supported, e.g., fluid mechanics or general relativity.
Closer integration with automated provers, possibly via reinforcement learning-guided proof search.

MMFormalizer’s capacity to unify scene-based reasoning with deep formal proof synthesis positions it as a pioneering framework for multimodal mathematical and physical autoformalization in open-world settings (Xiong et al., 6 Jan 2026).

PDF Markdown Chat (Pro)

References (1)

MMFormalizer: Multimodal Autoformalization in the Wild (2026)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to MMFormalizer.