StepFun-Formalizer-32B: Autoformalization LLM
- StepFun-Formalizer-32B is a large language model designed to autoformalize informal mathematical statements into rigorous Lean 4 code.
- It employs a hybrid training pipeline combining knowledge distillation, template-guided reasoning, and reinforcement learning with verifiable rewards.
- The model achieves state-of-the-art single-shot formalization accuracies of 40.5% on FormalMATH-Lite and 26.7% on ProverBench benchmarks.
StepFun-Formalizer-32B is a LLM explicitly designed for the autoformalization of natural-language mathematical statements, with a particular focus on translating complex informal mathematics into rigorous, verifiable formal code in languages such as Lean 4. The model’s development addresses two central challenges of autoformalization: comprehensive mastery of formal language domain knowledge and the reasoning capabilities required to align informal problem statements with the formal semantics and syntax needed by proof assistants. StepFun-Formalizer-32B distinguishes itself both by its hybrid training pipeline—combining knowledge distillation, template-guided informal-to-formal reasoning, and reinforcement learning with provable rewards—and by achieving state-of-the-art single-shot formalization accuracy on important benchmarks.
1. Model Architecture and Design
StepFun-Formalizer-32B builds on a general-purpose LLM architecture (transformer-based) with enhancements aimed at autoformalization. Unlike standard LLMs, which typically underperform either due to insufficient command of formal language details or inadequate informal reasoning, StepFun-Formalizer-32B fuses both capacities. It is conditioned using additional data: large-scale distilled formal examples for domain mastery, and synthesized informal-to-formal reasoning trajectories for improved alignment, each reflecting Lean 4’s syntax and semantics. The model is trained in two stages: supervised fine-tuning (SFT) followed by reinforcement learning with a verifiable reward (RLVR), encouraging not only correct syntax but also semantic equivalence to human formalizations.
This strategy contrasts with models such as Kimina-Autoformalizer, which prioritize formal language knowledge accumulation, and general-purpose LLMs (e.g., o3-pro, Claude4-thinking), which struggle to meet the rigorous requirements of proof assistants due to imprecise mapping between informal statements and formal code. StepFun-Formalizer-32B’s dual focus on formal knowledge and informal-to-formal reasoning expands its applicability and effectiveness across a wider array of mathematical domains.
2. ThinkingF Data Synthesis and Training Pipeline
The core of StepFun-Formalizer-32B’s approach is the ThinkingF pipeline, which structures the training regime across four progressive stages to ensure the resulting model can both “understand” and “formalize” mathematics:
- Knowledge Distillation With Selection: Large informal math problems are autoformalized via a capable base model (e.g., Kimina-Autoformalizer). Outputs undergo a three-layer selection: (i) Lean 4 REPL checks for syntax; (ii) majority voting using a BEq verification procedure, which clusters equivalence classes of outputs and selects a representative; and (iii) LLM-based validity filtering (e.g., DeepSeek-V3) to discard trivial or contradictory samples.
- Informal-to-Formal Reasoning Data Synthesis: Reasoning templates are applied to break down the mapping process into intermediate substeps. Using instruction-following LLMs (such as Claude 3.7 Sonnet), approximately 5.8K trajectories are synthesized, capturing rephrased problem statements, key concept extraction, and mapping to Lean 4 constructs.
- Two-Stage Supervised Fine-Tuning (SFT): The model first learns from distilled informal-formal pairs, then from synthesized reasoning trajectory data, incorporating special tokens (such as ) to enforce clear format distinctions.
- Reinforcement Learning With a Verifiable Reward (RLVR): Reinforcement learning leverages the BEq function—a bidirectional extended definitional equivalence metric—as a reward proxy, guiding the model toward provably correct outputs relative to ground-truth formalizations. The reward function is: Policy optimization relies on GRPO with DAPO enhancements to optimize BEq-verified accuracy further.
3. Performance and Metrics
Evaluation of StepFun-Formalizer-32B centers on the BEq@k metric—especially BEq@1 (the proportion of single-attempt outputs provably equivalent to the human reference) and BEq@16 (performance when given up to 16 attempts). On FormalMATH-Lite, StepFun-Formalizer-32B attains a BEq@1 of 40.5%. On ProverBench, it achieves 26.7%. These results set new state-of-the-art standards, with significant improvements over prior general-purpose and specialized models. The robustness of these scores indicates not only increased single-shot success but also enhanced alignment between informal math and formal syntax/semantics.
Model | FormalMATH-Lite BEq@1 | ProverBench BEq@1 |
---|---|---|
StepFun-Formalizer-32B | 40.5% | 26.7% |
Best prior specialized | ≤ 35.1% | lower |
Best general-purpose LLM | lower | lower |
The table shows StepFun-Formalizer-32B’s relative outperformance.
4. Core Innovations and Methodological Advances
StepFun-Formalizer-32B introduces several technical innovations:
- Dual-Capability Fusion: Simultaneous mastery of domain-specific formal knowledge (e.g., Lean 4 syntax/theory) and detailed informal-to-formal reasoning, unlike approaches that prioritize only one.
- ThinkingF Pipeline: The multi-stage pipeline, particularly its use of reasoning templates and BEq-based majority voting, strengthens both the diversity and correctness of training data and the alignment between informal input and formal output.
- Template-Guided Trajectories: Explicit instruction to “think step by step,” with intermediate reasoning checks before formal code generation, actively reduces errors deriving from misunderstanding or misalignment.
- BEq-verifiable RL Optimization: The use of bidirectional extended definitional equivalence not only as an evaluation but as the direct objective during RL, ensuring reward signals are closely tied to the requirements of actual formal code acceptance by proof systems.
Ablation experiments indicate that omission of the reasoning trajectory training notably impairs BEq@1 scores, emphasizing the necessity of this module.
5. Applications and Broader Implications
StepFun-Formalizer-32B’s improvements have direct and broad-ranging implications:
- Automated Theorem Proving and Formal Verification: The model reliably translates informal descriptions into formally checkable Lean 4 code, reducing manual effort and enabling more rapid formalization—vital for mathematical research and machine-checked proofs.
- Reliable Code Generation: Techniques developed for mathematical autoformalization are applicable to programming languages, supporting the generation of code that not only matches informal intent but is also certifiably correct under strict type and logic systems.
- Educational Tools and Research Support: Integration with educational platforms enables automated verification and feedback on student-submitted mathematics, potentially accelerating the shift toward rigorous mathematical reasoning in early education.
- Cross-Domain Formalization: The ThinkingF pipeline provides a methodological blueprint for domains demanding the translation of informal text into precise, formal systems, such as legal or clinical guidelines.
A plausible implication is increased automation in domains where formal verification is essential but expertise is scarce.
6. Comparative Analysis
When compared with both specialized and general-purpose models, StepFun-Formalizer-32B demonstrates decisive strengths:
- Against General-Purpose LLMs: Models such as o3-pro, Claude4-thinking, and Gemini-2.5-thinking often stumble on Lean 4-specific constructs or type discipline. StepFun-Formalizer-32B’s explicit domain conditioning and BEq-based feedback yield outputs that are more frequently accepted by formal proof environments.
- Versus Specialized Autoformalization Models: Whereas specialized models may excel at syntax, they often underperform on complex real-world problems due to poor informal-to-formal mapping. StepFun-Formalizer-32B’s results on both in-domain benchmarks (such as FormalMATH-Lite) and more challenging out-of-distribution tasks (ProverBench) demonstrate greater robustness and adaptability.
Model Type | Formal Command | Informal-Formal Reasoning | BEq@1 (FormalMATH-Lite) |
---|---|---|---|
StepFun-Formalizer-32B | Strong | Strong | 40.5% |
Specialized | Strong | Weak/Moderate | ≤35.1% |
General-purpose | Moderate | Moderate | lower |
This comparative table summarizes relative strengths according to claims in the data.
7. Conclusion
StepFun-Formalizer-32B advances the state of LLM-driven autoformalization by unifying comprehensive formal domain knowledge and template-guided informal-to-formal reasoning within a single model. Through the ThinkingF pipeline—a synthesis of knowledge distillation, rigorous data selection, structured trajectory generation, and reinforcement learning using a verifiable reward—the model surpasses both specialized and generalist counterparts on formalization accuracy benchmarks. This signifies a substantive move toward automated systems capable of trustworthy formal translation of mathematical text, suggesting broad future impact in theorem proving, formal verification, education, and beyond (Wu et al., 6 Aug 2025).