Papers
Topics
Authors
Recent
2000 character limit reached

MMFormalizer: Multimodal Autoformalization

Updated 12 January 2026
  • MMFormalizer is a unified system that autoformalizes math and physics problems by combining natural language and diagram analysis with Lean proof synthesis.
  • It employs recursive grounding and adaptive termination to build complex formal proofs from perceptual subgraphs and foundational axioms.
  • Benchmarks across mechanics, quantum theory, and geometry show its robust compile and semantic accuracy in generating verified Lean code.

MMFormalizer is a unified multimodal autoformalization system designed to translate mathematics and physics problems containing both natural language and perceptual content (such as diagrams or scenes) into formal Lean proofs. MMFormalizer addresses longstanding obstacles in autoformalization by extending beyond the textual domain—integrating perceptually-grounded, scene-based reasoning with semantic alignment to formal mathematics and physics, employing recursive abstraction and principled termination grounded in both logic and dimensional analysis. It is the first system in the literature equipped to formalize classical mechanics (including Hamiltonian systems), relativity, quantum mechanics, and thermodynamics, based on both text and visual inputs (Xiong et al., 6 Jan 2026).

1. Motivation and Problem Setting

Classical autoformalization pipelines, focused solely on text-to-proof translation, cannot resolve multimodal dependencies intrinsic to scientific reasoning. Many physical problems embed critical information within diagrams—quantities such as mass, energy, and geometric constraints essential for formulating correct formal statements. Additionally, formulating higher-level concepts in, for example, classical mechanics or quantum theory, necessitates recursively assembling abstractions from more fundamental axioms or dimensional primitives.

MMFormalizer systematically addresses:

  • Multimodal grounding: Recovering hidden variables and relations that are only inferable from images and scenes (e.g., identifying a “mass” from a labeled particle in a diagram; reconstructing geometric configurations from points and lines).
  • Recursive abstraction: Building complex formal systems from grounded primitives, with mechanisms to adaptively determine when to halt this recursive assembly—using both empirical evidence and foundational axioms/dimensions.

2. System Architecture and Workflow

MMFormalizer's architecture consists of three principal interconnected stages—Recursive Grounding, Adaptive Termination, and Axiom Composition—all within a Lean 4, mathlib4, and PhysLean environment augmented by LeanSearch for semantic retrieval.

2.1 Perceptual Parsing and Representation

Input images I:ImageI: \mathsf{Image} are parsed by parse:ISceneGraph\mathsf{parse}: I \to \mathsf{SceneGraph} into discrete, grounded scene graphs: SceneGraph:=Σ(Vt:List  Primitive),  Rel(Vt)\mathsf{SceneGraph} := \Sigma(V_t : \mathsf{List\;Primitive}),\; \mathsf{Rel}(V_t) where

Primitive::=pointlineregion\mathsf{Primitive} ::= \texttt{point} \mid \texttt{line} \mid \texttt{region}

and

Rel(Vt):Π(vi,vjVt),  SpatialRel(vi,vj)\mathsf{Rel}(V_t): \Pi(v_i, v_j \in V_t),\; \mathsf{SpatialRel}(v_i, v_j)

Each primitive viv_i is assigned an informal label ltl_t (e.g., “particle of mass mm”), which forms the base layer L0L_0 for subsequent logical abstraction.

2.2 Logical Layers and PropChains

Formal statements are managed in chains of dependent lemmas: Lemma:=Σ(P:PropChain),  (p:P)\mathsf{Lemma} := \Sigma(P:\mathsf{PropChain}),\;(p:P) where PP is a formal proposition and p:Pp:P its proof term. Lemmas at each graph abstraction level are organized via

PropChain:=Σ(Lt:List  Lemma)\mathsf{PropChain} := \Sigma(L_t : \mathsf{List\;Lemma})

with a lifting operator lift:SceneGraphPropChain\mathsf{lift}: \mathsf{SceneGraph} \to \mathsf{PropChain} ensuring that visual structures are transformed into formal logical dependencies.

2.3 Recursive Grounding

For every node tt, MMFormalizer:

  • Focuses on a visual subgraph %%%%10%%%% spanning the relevant primitives.
  • Produces candidate informal propositions Pt:={p(vi,ltGt1)viVt}P_t := \{p(v_i, l_t \mid G_{t-1}) \mid v_i \in V_t\}.
  • Employs semantic search against mathlib/PhysLean to retrieve and align pPtp \in P_t with corresponding formal definitions: Grounding:Gt1PtLemma\mathsf{Grounding}: G_{t-1} \to P_t \to \mathsf{Lemma}
  • Extends recursively, generating

Lt+1:=Grounding(Gt,Pt+1)L_{t+1} := \mathsf{Grounding}(G_t, P_{t+1})

2.4 Adaptive Recursive Termination

A branch’s recursion halts if its informal predicate pPtp \in P_t is recognized as:

  • A dimensional primitive: Termination(Pt)=dim(p)\mathsf{Termination}(P_t) = \mathsf{dim}(p), e.g., [M],[L],[T][M], [L], [T], Force, Energy.
  • A fundamental axiom: Termination(Pt)=axiom(p)\mathsf{Termination}(P_t) = \mathsf{axiom}(p), e.g., Newton’s laws, Maxwell’s equations.

This adaptivity ensures branches terminate precisely when formal or empirical foundations are reached.

2.5 Axiom Composition and Synthesis

Terminal branches contribute to an axiom chain: AxiomChain:=Σ(At:List  Axiom,Dt:List  Dim)\mathsf{AxiomChain} := \Sigma(A_t : \mathsf{List\;Axiom}, D_t : \mathsf{List\;Dim}) with Compose({Lt+1k},Gt,Pt)Lt\mathsf{Compose}(\{L_{t+1}^k\}, G_t, P_t) \to L_t recursively merging lemmas upward. Final Lean code is type-checked and semantically verified.

2.6 High-Level Pseudocode

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Input: image I, text T
1. G  parse(I)                          # SceneGraph
2. t  0; initialize P from perceptual labels in G
3. L  Grounding(G, P)                # list of Lemmas
4. while True:
5.   if pPₜ s.t. pDₜ  Aₜ:             # dimensional or axiom hit
6.     mark branch as terminated; break
7.   Gₜ  select_subgraph(Gₜ)           # visual decomposition
8.   Pₜ  {p(v,l | Gₜ)  vVₜ}
9.   Lₜ  Grounding(Gₜ, Pₜ)
10.  t  t+1
11. end while
12. Compose all L's bottom-up into final AxiomChain
13. Verify Lean compilation and semantic checking
Output: Lean code + proof terms
(Xiong et al., 6 Jan 2026)

3. Benchmark: PhyX-AF

Evaluation is carried out on the PhyX-AF benchmark, constructed from 115 curated multimodal samples:

  • MathVerse: Plane and solid geometry, function graphs
  • PhyX: Mechanics, electromagnetism, thermodynamics, modern physics (relativity, quantum theory)
  • Synthetic Geometry: Out-of-distribution Euclidean constructions
  • Analytic Geometry: Cartesian and coordinate-based problems

Each sample consists of an image and a text statement; text-only instances are excluded. Three key metrics are reported:

  • Compile accuracy: Percentage of Lean code that type-checks
  • Semantic accuracy: Percentage of code semantically capturing the intended result
  • Human verification: Manual check of final proofs

4. Empirical Results

Frontier LLMs outperform open-source models on PhyX-AF, especially for problems in physics and modern domains. Key findings include:

Model Compile Accuracy (img/text) Semantic Accuracy (img/text)
GPT-5 71.4 % / 16.7 % 71.4 % / 0.0 %
Gemini-3-Pro 28.6 % / 42.9 % 28.6 % / 28.6 %
Qwen3-VL-235B 0.0 % / 0.0 % 0.0 % / 0.0 %
  • Physics (PhyX): GPT-5 achieves the highest compile and semantic accuracy, particularly in quantum and relativity tasks.
  • Geometry (Synthetic and Analytic): Remains the most challenging subdomain. Deviations in angle/length reasoning and out-of-distribution (OOD) constructions limit model performance.
  • Model Supervision: Gemini-2.5-Pro reliably agrees with human semantic labels, implying that weaker models can supervise and audit stronger ones for consistency.

A plausible implication is that supervised semantic metrics from lightweight models might serve as effective proxies for human evaluation in future autoformalization pipelines (Xiong et al., 6 Jan 2026).

5. Case Studies Demonstrating Multimodal Autoformalization

MMFormalizer is demonstrated on representative problem types:

  • Regular Hexagonal Prism (3D geometry, image input):
    1
    2
    3
    4
    5
    
    structure HexagonalPrism (V : Type*) [MetricSpace V] :=
      (base₁ : RegularPolygon V 6)
      (base₂ : RegularPolygon V 6)
      (laterals : list (Segment V))
      (perp_faces : ∀ e ∈ laterals, ...)
  • From Hamiltonian to Newton’s Laws:

H(p,q)=ipi22mi+V(q)H(\mathbf{p},\mathbf{q})=\sum_i \frac{\|\mathbf{p}_i\|^2}{2m_i}+V(\mathbf{q})

retrieves Hamilton’s equations and, through recursive grounding, terminates at Newton’s Second Law:

i,  Fi=dpidt\forall i,\;F_i=\frac{d p_i}{d t}

  • Quantum Tunneling:

From boundary conditions and the Schrödinger equation, MMFormalizer builds

ψ(x)=κ2ψ(x),κ=2m(U0E)/\psi''(x)=\kappa^2\psi(x),\quad \kappa=\sqrt{2m(U_0-E)}/\hbar

terminating with

ψ(x)=Aeκ(xL)\psi(x)=A e^{-\kappa(x-L)}

  • Relativistic Velocity Addition:

$\texttt{definition rel\_vel (u v : Velocity) : Velocity := (u + v)/(1 + (u * v)/c^2)}$

For u=0.8c,v=0.6cu=0.8c, v=-0.6c, this computes vrel=0.946cv_{\rm rel}=0.946\,c.

6. System Limitations and Prospects

MMFormalizer remains subject to several limitations:

  • Geometry: Substantial computational and representational difficulty arises in OOD geometric configurations, especially those involving strict angle/length interpretation from novel diagrams.
  • Recursive Over-Decomposition: Without robust termination heuristics, recursion can generate excessive subproblems, leading to combinatorial search blow-up.
  • Retrieval Dependence: When formal proof elements are absent from the underlying libraries, the system may have to generate new Lean types, a process susceptible to error.

Planned future directions include:

  • Enhanced perceptual-symbolic methods, such as graph neural networks for richer scene graph representations.
  • Learned heuristics or value functions for adaptive recursion pruning.
  • Extension to physics domains beyond those currently supported, e.g., fluid mechanics or general relativity.
  • Closer integration with automated provers, possibly via reinforcement learning-guided proof search.

MMFormalizer’s capacity to unify scene-based reasoning with deep formal proof synthesis positions it as a pioneering framework for multimodal mathematical and physical autoformalization in open-world settings (Xiong et al., 6 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to MMFormalizer.