Papers

Topics

Authors

Recent

View all

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 89 tok/s

Gemini 2.5 Pro 43 tok/s Pro

GPT-5 Medium 24 tok/s Pro

GPT-5 High 24 tok/s Pro

GPT-4o 112 tok/s Pro

Kimi K2 199 tok/s Pro

GPT OSS 120B 449 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

ThinkingF: Data Synthesis for Autoformalization

Updated 7 August 2025

ThinkingF is a data synthesis and training pipeline designed to fuse formal mathematical knowledge with explicit step-by-step reasoning for autoformalization.
The approach employs a dual-dataset strategy that integrates high-fidelity informal–formal pairs with detailed reasoning trajectories, ensuring robust alignment and syntax mastery.
Reinforcement learning with BEq verification refines output quality, yielding state-of-the-art performance on in-domain and out-of-distribution formalization benchmarks.

ThinkingF is a data synthesis and training pipeline designed to improve the autoformalization capabilities of LLMs by explicitly fusing comprehensive formal-language domain knowledge with step-by-step informal-to-formal reasoning. In the context of mathematical autoformalization, this scheme addresses key limitations of existing models, notably their insufficient mastery of formal constructs and incomplete alignment between informal problem descriptions and precise formal representations. ThinkingF orchestrates knowledge distillation, template-based reasoning trajectory generation, supervised fine-tuning, and reinforcement learning to couple both abilities, resulting in models that excel on state-of-the-art formalization benchmarks and generalize to out-of-distribution settings (Wu et al., 6 Aug 2025).

1. Autoformalization Process

Autoformalization is defined as translating natural-language mathematical statements into their equivalent formal expressions suitable for interactive theorem provers (e.g., Lean, Coq, Isabelle). Standard LLM-based approaches exhibit two primary bottlenecks: lack of deep formal-language knowledge, leading to incorrect identification of formal objects and constructions; and deficient informal–formal alignment, resulting in failures to map real-world or colloquial mathematical intent into rigorous, verifiable syntax.

ThinkingF addresses both gaps through a dual-fusion methodology:

First, comprehensive domain expertise is distilled from specialized autoformalizers, curating a large corpus of high-quality formal statements.
Second, explicit stepwise reasoning—mapping informal language objects to formal entities—is generated with expert-crafted templates, ensuring the model can “explain” its translation process and catch subtle misalignments.

This pipeline transforms the model from a direct translator to a reasoning agent that actively reconstructs the logic bridging informal and formal mathematics.

2. Datasets and Training Pipeline

The pipeline comprises two distinct datasets, each targeting a complementary ability:

Knowledge Dataset:
- Sourced by collecting informal mathematical problems (e.g., from NuminaMath-1.5) and using a specialized autoformalization system (Kimina-Autoformalizer) to propose candidate formalizations.
- Quality assurance is applied in three tiers: syntax checking (Lean 4 REPL), majority voting on equivalence classes (using BEq verification), and semantic vetting (via LLM evaluation, e.g., DeepSeek-V3) to exclude oversimplified or erroneous samples. The resulting dataset contains approximately 183K high-fidelity informal–formal pairs.
Reasoning Dataset:
- Constructed by prompting advanced instruction-following LLMs (such as Claude 3.7 Sonnet) with expert templates anchored by existing human-annotated problem-formal pairs.
- These prompts elicit step-by-step reasoning chains, explicitly documenting how natural-language premises, definitions, and context are reconciled with formal logic. Around 5.8K detailed trajectories were produced, each exposing potential reasoning pitfalls and formal translation challenges.
Training Workflow:
- A general-purpose LLM (DeepSeek-R1-Distill-Qwen) is first fine-tuned via supervised learning (SFT) separately on knowledge and reasoning datasets—with > tokens guiding internal state handling. > - Subsequently, reinforcement learning with verifiable reward (RLVR) is applied. The BEq@1 equivalence check is used as a reward signal: > > $R(y_k, \hat{y}_k) = \begin{cases} 1 & \text{if } y_k \sim \hat{y}_k \ 0 & \text{otherwise} \end{cases}$ > > where $y_k$ is the model output, $\hat{y}_k$ is ground truth, and equivalence $\sim$ is certified by bidirectional definitional equality in Lean. > > This multi-stage process ensures the model acquires both surface-level formal knowledge and the multi-step reasoning required for robust informal–formal alignment. > > ## 3. Model Performance Evaluation > > ThinkingF yields two model sizes (7B and 32B parameters), both evaluated on formalization benchmarks: > > - FormalMATH-Lite (in-domain): StepFun-Formalizer-32B achieves 40.5% BEq@1, surpassing all prior models, with the 7B variant close behind at 38.3%. > > - ProverBench (out-of-distribution): The 32B model attains 26.7% BEq@1, demonstrating preserved generalization when reasoning with unfamiliar structures. > > BEq@1 denotes the strictest possible verification: a single generated formal statement, when submitted to the proof assistant, passes bidirectional definitional equivalence against ground truth. Achieving SOTA on these metrics empirically validates ThinkingF’s hypothesis that explicit fusion of knowledge and reasoning is critical for high-fidelity autoformalization. > > Ablation and case studies further show that reasoning trajectory data drives upper-bound chain-of-thought quality (especially with extended rollouts), while formal domain knowledge augments baseline accuracy and coverage. > > ## 4. Knowledge–Reasoning Fusion Mechanisms > > The distinguishing feature of ThinkingF is its explicit fusion of two abilities via both data and training design: > > - Separation and Integration: The knowledge dataset directly imparts formal syntax, definitions, and constructs, while the reasoning dataset teaches the model to perform explicit chain-of-thought alignment from informal prompt to formal output. > > - Stagewise SFT: Initial supervised fine-tuning on knowledge (with internal tokens for notation alignment), followed by fine-tuning on reasoning trajectories to embed the stepwise mapping process. > > - Reward-Driven RL: RLVR, using BEq verification, ensures the produced formal statements are not just syntactically sound but verifiably equivalent; the process is formalized by the reward function above. > > The resulting models demonstrate both broad formal syntax mastery and the capability to articulate detailed, context-sensitive mappings from informal language to formal specification, mitigating typical misalignment errors encountered in prior work. > > ## 5. Implications and Research Directions > > ThinkingF demonstrates that the deliberate separation and subsequent fusion of formal-language knowledge and reasoning alignment produce substantial improvements in autoformalization accuracy and robustness. The methodology suggests several forward-looking implications: > > - Operational Impact: Enables more reliable LLM-based automated theorem proving, formal verification in mathematics, and potentially, verifiable code synthesis from natural descriptions. > > - Pipeline Generalizability: The fusion scheme—distilling knowledge and synthesizing reasoning—could be adapted to other domains requiring informal-to-formal translation, such as software verification or complex scientific modeling. > > - RL Enhancements: Future work may extend RL with richer verifiability metrics or tight provers-in-the-loop, iteratively refining the alignment between informal descriptions and their formal semantics. > > - Template and Data Design: Research into optimizing reasoning trajectory templates or synthesizing more diverse symbolic reasoning data may push boundaries further. > > A plausible implication is that the ThinkingF pipeline, by tightly integrating knowledge and reasoning, sets a new foundation for robust, generalizable autoformalization and opens avenues for structured, multi-stage training paradigms across domains that demand both deep knowledge and explicit reasoning alignment. > > ## 6. Summary Table > > | Aspect | Dataset/Methodology | Key Metric/Outcome | > |-------------------------------|--------------------------------------------|--------------------------------------| > | Formal Knowledge | 183K high-quality informal–formal pairs | SOT mastery of syntax/constructs | > | Reasoning Trajectories | 5.8K stepwise informal–formal mappings | Increased generalization, robustness | > | RL with Verifier | BEq-based reward signal | Verifiable semantic alignment | > | SOTA Performance (32B model) | FormalMATH-Lite: 40.5% BEq@1 | Outperforms specialized baselines | > | Out-of-domain Generalization | ProverBench: 26.7% BEq@1 | Maintains accuracy on OOD problems | > > ## 7. Conclusion > > ThinkingF operationalizes the fusion of large-scale formal knowledge and granular informal-to-formal reasoning through carefully designed data, staged supervised training, and reward-driven refinement. The resulting models achieve benchmark-leading performance and improved idiosyncratic mapping from colloquial mathematics to formal code. By isolating and integrating competence sources, ThinkingF sets a rigorous standard for future research in automated mathematical reasoning and beyond (Wu et al., 6 Aug 2025).

PDF Markdown Chat (Pro)

References (1)

StepFun-Formalizer: Unlocking the Autoformalization Potential of LLMs through Knowledge-Reasoning Fusion (2025)

Follow Topic

Get notified by email when new papers are published related to ThinkingF.

ThinkingF: Data Synthesis for Autoformalization

1. Autoformalization Process

2. Datasets and Training Pipeline

Follow Topic

Continue Learning

Related Topics