ThinkingF: Data Synthesis for Autoformalization
- ThinkingF is a data synthesis and training pipeline designed to fuse formal mathematical knowledge with explicit step-by-step reasoning for autoformalization.
- The approach employs a dual-dataset strategy that integrates high-fidelity informal–formal pairs with detailed reasoning trajectories, ensuring robust alignment and syntax mastery.
- Reinforcement learning with BEq verification refines output quality, yielding state-of-the-art performance on in-domain and out-of-distribution formalization benchmarks.
ThinkingF is a data synthesis and training pipeline designed to improve the autoformalization capabilities of LLMs by explicitly fusing comprehensive formal-language domain knowledge with step-by-step informal-to-formal reasoning. In the context of mathematical autoformalization, this scheme addresses key limitations of existing models, notably their insufficient mastery of formal constructs and incomplete alignment between informal problem descriptions and precise formal representations. ThinkingF orchestrates knowledge distillation, template-based reasoning trajectory generation, supervised fine-tuning, and reinforcement learning to couple both abilities, resulting in models that excel on state-of-the-art formalization benchmarks and generalize to out-of-distribution settings (Wu et al., 6 Aug 2025).
1. Autoformalization Process
Autoformalization is defined as translating natural-language mathematical statements into their equivalent formal expressions suitable for interactive theorem provers (e.g., Lean, Coq, Isabelle). Standard LLM-based approaches exhibit two primary bottlenecks: lack of deep formal-language knowledge, leading to incorrect identification of formal objects and constructions; and deficient informal–formal alignment, resulting in failures to map real-world or colloquial mathematical intent into rigorous, verifiable syntax.
ThinkingF addresses both gaps through a dual-fusion methodology:
- First, comprehensive domain expertise is distilled from specialized autoformalizers, curating a large corpus of high-quality formal statements.
- Second, explicit stepwise reasoning—mapping informal language objects to formal entities—is generated with expert-crafted templates, ensuring the model can “explain” its translation process and catch subtle misalignments.
This pipeline transforms the model from a direct translator to a reasoning agent that actively reconstructs the logic bridging informal and formal mathematics.
2. Datasets and Training Pipeline
The pipeline comprises two distinct datasets, each targeting a complementary ability:
- Knowledge Dataset:
- Sourced by collecting informal mathematical problems (e.g., from NuminaMath-1.5) and using a specialized autoformalization system (Kimina-Autoformalizer) to propose candidate formalizations.
- Quality assurance is applied in three tiers: syntax checking (Lean 4 REPL), majority voting on equivalence classes (using BEq verification), and semantic vetting (via LLM evaluation, e.g., DeepSeek-V3) to exclude oversimplified or erroneous samples. The resulting dataset contains approximately 183K high-fidelity informal–formal pairs.
- Reasoning Dataset:
- Constructed by prompting advanced instruction-following LLMs (such as Claude 3.7 Sonnet) with expert templates anchored by existing human-annotated problem-formal pairs.
- These prompts elicit step-by-step reasoning chains, explicitly documenting how natural-language premises, definitions, and context are reconciled with formal logic. Around 5.8K detailed trajectories were produced, each exposing potential reasoning pitfalls and formal translation challenges.
- Training Workflow:
- A general-purpose LLM (DeepSeek-R1-Distill-Qwen) is first fine-tuned via supervised learning (SFT) separately on knowledge and reasoning datasets—with > tokens guiding internal state handling. > - Subsequently, reinforcement learning with verifiable reward (RLVR) is applied. The BEq@1 equivalence check is used as a reward signal: > > > > where is the model output, is ground truth, and equivalence is certified by bidirectional definitional equality in Lean. > > This multi-stage process ensures the model acquires both surface-level formal knowledge and the multi-step reasoning required for robust informal–formal alignment. > > ## 3. Model Performance Evaluation > > ThinkingF yields two model sizes (7B and 32B parameters), both evaluated on formalization benchmarks: > > - FormalMATH-Lite (in-domain): StepFun-Formalizer-32B achieves 40.5% BEq@1, surpassing all prior models, with the 7B variant close behind at 38.3%. > > - ProverBench (out-of-distribution): The 32B model attains 26.7% BEq@1, demonstrating preserved generalization when reasoning with unfamiliar structures. > > BEq@1 denotes the strictest possible verification: a single generated formal statement, when submitted to the proof assistant, passes bidirectional definitional equivalence against ground truth. Achieving SOTA on these metrics empirically validates ThinkingF’s hypothesis that explicit fusion of knowledge and reasoning is critical for high-fidelity autoformalization. > > Ablation and case studies further show that reasoning trajectory data drives upper-bound chain-of-thought quality (especially with extended rollouts), while formal domain knowledge augments baseline accuracy and coverage. > > ## 4. Knowledge–Reasoning Fusion Mechanisms > > The distinguishing feature of ThinkingF is its explicit fusion of two abilities via both data and training design: > > - Separation and Integration: The knowledge dataset directly imparts formal syntax, definitions, and constructs, while the reasoning dataset teaches the model to perform explicit chain-of-thought alignment from informal prompt to formal output. > > - Stagewise SFT: Initial supervised fine-tuning on knowledge (with internal tokens for notation alignment), followed by fine-tuning on reasoning trajectories to embed the stepwise mapping process. > > - Reward-Driven RL: RLVR, using BEq verification, ensures the produced formal statements are not just syntactically sound but verifiably equivalent; the process is formalized by the reward function above. > > The resulting models demonstrate both broad formal syntax mastery and the capability to articulate detailed, context-sensitive mappings from informal language to formal specification, mitigating typical misalignment errors encountered in prior work. > > ## 5. Implications and Research Directions > > ThinkingF demonstrates that the deliberate separation and subsequent fusion of formal-language knowledge and reasoning alignment produce substantial improvements in autoformalization accuracy and robustness. The methodology suggests several forward-looking implications: > > - Operational Impact: Enables more reliable LLM-based automated theorem proving, formal verification in mathematics, and potentially, verifiable code synthesis from natural descriptions. > > - Pipeline Generalizability: The fusion scheme—distilling knowledge and synthesizing reasoning—could be adapted to other domains requiring informal-to-formal translation, such as software verification or complex scientific modeling. > > - RL Enhancements: Future work may extend RL with richer verifiability metrics or tight provers-in-the-loop, iteratively refining the alignment between informal descriptions and their formal semantics. > > - Template and Data Design: Research into optimizing reasoning trajectory templates or synthesizing more diverse symbolic reasoning data may push boundaries further. > > A plausible implication is that the ThinkingF pipeline, by tightly integrating knowledge and reasoning, sets a new foundation for robust, generalizable autoformalization and opens avenues for structured, multi-stage training paradigms across domains that demand both deep knowledge and explicit reasoning alignment. > > ## 6. Summary Table > > | Aspect | Dataset/Methodology | Key Metric/Outcome | > |-------------------------------|--------------------------------------------|--------------------------------------| > | Formal Knowledge | 183K high-quality informal–formal pairs | SOT mastery of syntax/constructs | > | Reasoning Trajectories | 5.8K stepwise informal–formal mappings | Increased generalization, robustness | > | RL with Verifier | BEq-based reward signal | Verifiable semantic alignment | > | SOTA Performance (32B model) | FormalMATH-Lite: 40.5% BEq@1 | Outperforms specialized baselines | > | Out-of-domain Generalization | ProverBench: 26.7% BEq@1 | Maintains accuracy on OOD problems | > > ## 7. Conclusion > > ThinkingF operationalizes the fusion of large-scale formal knowledge and granular informal-to-formal reasoning through carefully designed data, staged supervised training, and reward-driven refinement. The resulting models achieve benchmark-leading performance and improved idiosyncratic mapping from colloquial mathematics to formal code. By isolating and integrating competence sources, ThinkingF sets a rigorous standard for future research in automated mathematical reasoning and beyond (Wu et al., 6 Aug 2025).