TheoremForge: Automated Theorem Synthesis

Updated 31 January 2026

TheoremForge is a modular framework for synthesizing, verifying, and augmenting mathematical theorems using agentic pipelines, symbolic exploration, and RL/LLM-guided proof planning.
It leverages structured sub-tasks and deductive synthesis to generate high-quality formal data and scalable benchmarks for domains such as algebra and combinatorics.
The system optimizes data yield and cost efficiency through modular task decomposition, self-improving cycles, and parameterized theorem generation.

TheoremForge refers to a broad suite of methodologies and concrete systems for the automated synthesis, verification, and augmentation of mathematical theorems for formal theorem proving, dataset generation, and program synthesis. The term encompasses agentic pipelines for large-scale formal data creation, symbolic theory exploration using deductive synthesis, neural- and RL-guided proof planning, parametric theorem generators for formal benchmarks, and theory-derived construction tools. These systems address both the bottleneck of high-quality training data for automated reasoning and the fundamental automation of theorem discovery and proof generation across domains such as algebra, combinatorics, and theoretical computer science.

1. Modular Agentic Pipelines for Data Synthesis

Recent advances have focused on decomposing the formalization process into modular, locally verified subtasks. In "TheoremForge: Scaling up Formal Data Synthesis with Low-Budget Agentic Workflow" (Tao et al., 24 Jan 2026), the formalization trajectory is structured as five sub-tasks: statement formalization, proof generation, premise selection, proof correction, and proof sketching. Each sub-task is defined by a specific data format and local validation condition:

Statement Formalization: Converts informal statements to valid, semantically aligned formal Lean 4 statements, using LLMs for synthesis and verification.
Proof Generation: Produces Lean formal proofs via expert model sampling and repair; subgoal decomposition with sketch guidance is used for difficult statements.
Premise Selection: Identifies the minimal premise set for proofs with both positive and negative training signals.
Proof Correction: Repairs failed proof attempts by conditioning on proof and error pairs.
Proof Sketching: Generates and validates intermediate proof sketches from informal outlines.

A "Decoupled Extraction Strategy" enables sub-task data to be harvested even from globally failed proof attempts, thereby maximizing the data yield per compute budget. This architecture significantly increases verified data yield (1.6× over standard complete-proof filtering) and reduces cost per verified proof (e.g., \$0.481/proof for Gemini-3-Flash). Error analysis highlights proof sketching and subgoal solving as dominant bottlenecks, and the approach supports a self-improving data flywheel for sub-task expert training (Tao et al., 24 Jan 2026).

2. Symbolic Theory Exploration and Deductive Lemma Discovery

TheoremForge also refers to bottom-up symbolic lemma discovery from inductive data types and recursive function definitions. In "Theory Exploration Powered By Deductive Synthesis" (Singher et al., 2020), the system (also called TheSy) implements an iterative-deepening symbolic framework:

Term Generation: Enumerate well-typed placeholder-terms to a depth bound, merged by e-graph congruence closure.
Conjecture Inference: Use symbolic observational equivalence (SOE) over abstract examples to conjecture universally quantified equations.
Conjecture Screening: Remove trivial or redundant candidates referencing already provable equations.
Induction Prover: Attempt first-argument structural induction with congruence closure and case-splitting.

Each newly proven lemma enriches the rewrite clo sure, facilitating bootstrapped discovery of more complex results (lemma seeding). Symbolic abstraction offers scaling and coverage benefits over concrete testing (as in Hipster, IsaCoSy), as infinite ground cases are subsumed under abstract symbolic inputs. Empirical evaluation demonstrates superior coverage and speed compared to test-based approaches (Singher et al., 2020).

3. Parameterized Theorem Generation for Benchmarking and Model Evaluation

TheoremForge pipelines support scalable synthesis of parametrized, diverse, and contamination-controlled theorem-proving benchmarks. In "Lean Meets Theoretical Computer Science: Scalable Synthesis of Theorem Proving Challenges in Formal-Informal Pairs" (Zhang et al., 21 Aug 2025), the system generates problems in TCS domains (Busy Beaver, Mixed Boolean Arithmetic) by:

Defining modular problem families with tunable difficulty parameters (e.g., Turing machine states, variables, nonlinearity).
Computing ground-truth via fast decision procedures (step simulation, normalization to Weighted-2-DNF).
Rendering perfectly aligned Lean4 (formal) and Markdown (informal) statements.
Automatically verifying generated instances using Lean4 servers.

Empirical results reveal high step-level (lemma) accuracy but low end-to-end proof success for frontier models, with strong contamination resistance due to exponential family sampling. The design is extensible to further domains (SAT/CNF, complexity reductions) and supports infinite family scaling (Zhang et al., 21 Aug 2025).

4. Domain-Specific Automated Theorem Generation and Proof Search

Multiple TheoremForge-inspired systems focus on domain-targeted theorem and proof generation:

Rectangular Standard Contradiction: In "An Automated Theorem Generator with Theoretical Foundation Based on Rectangular Standard Contradiction" (Xu et al., 6 Nov 2025), theorem generation is founded on the properties of rectangular standard contradictions—clause-sets that are minimally unsatisfiable and exhibit non-redundancy. Partitioning such sets yields logically equivalent, non-trivial theorems via unsatisfiability and premise-exclusion arguments. A template-based O(n·2ⁿ)-time generator supports batch synthesis of millions of theorems for small n with correctness and non-vacuity guaranteed (Xu et al., 6 Nov 2025).
Combinatorial Identities: "A Combinatorial Identities Benchmark for Theorem Proving via Automated Theorem Generation" (Xiong et al., 25 Feb 2025) describes the self-augmenting ATG4CI system: a self-improving LLM proposes tactics for proof search, while reinforcement-learning tree search (PUCT) constructs new theorem-proof pairs in Lean by exploring partial proof path frontiers. Iterative cycles of proposal, proof, validation, and dataset augmentation yield large, high-quality formal datasets. Performance benchmarks show +6–14% pass@1 lift after two self-augmentation rounds on combinatorial identities, together with out-of-domain transfer gains (Xiong et al., 25 Feb 2025).

5. Programmatic Generation of Algebraic Constructions and Universal Properties

TheoremForge also encompasses frameworks for deriving algebraic constructions from theory presentations, as in "Leveraging the Information Contained in Theory Presentations" (Carette et al., 2020). A pipeline parses user-presented algebraic theories (record types in a core type theory), extracts an internal representation (e.g., EqTheory), and applies derivation operators for:

Signature extraction (axiom removal).
Product algebra construction (componentwise operations).
Term algebras (free initial structures).
Homomorphism types (structure-preserving maps).
Quotients, subobjects, and further derived constructions.

The approach eliminates boilerplate (e.g., ∼10,700 LOC saved for homomorphisms in a 227-theory library), and supports rapid expansion of formal algebraic libraries, with type-checker-level correctness ensured (Carette et al., 2020).

6. RL- and LLM-Guided Proof Synthesis and Data Augmentation

Several implementations integrate reinforcement learning and LLMs to automate proof search and tactic selection:

RL-guided proof search in connection-style tableaux (e.g., "Reinforcement Learning of Theorem Proving" (Kaliszyk et al., 2018)) leverages MCTS with learned policy/value estimates to improve theorem coverage over hand-tuned heuristic baselines. On large-scale Mizar benchmarks, RL-based systems (rlCoP) demonstrate substantial (+42%) improvements in proof discovery rate over baselines. Models are trained from proof traces as feature-action/reward triples, using iterative self-improvement and XGBoost parameterization (Kaliszyk et al., 2018).
LLMs fine-tuned on partial proof traces, augmented by self-improvement loops (as in ATG4CI (Xiong et al., 25 Feb 2025)), enable data flywheels that scale both datasets and proof/statement generator capability.

7. Formal Guarantees, Scalability, and Limitations

TheoremForge systems generally guarantee:

Soundness: Only outputs formally (structurally) proven or type-checked statements/proofs (symbolic induction, congruence closure, kernel validation) are admitted (Singher et al., 2020, Xu et al., 6 Nov 2025).
Non-triviality: Use of non-redundancy and minimal unsatisfiability (e.g., in rectangular standard contradictions) ensures strictly non-vacuous theorem content (Xu et al., 6 Nov 2025).
Termination per layer: Resource-bounded enumeration and symbolic evaluation avoid non-terminating search in finite layers (Singher et al., 2020).

Scalability is achieved by modular parameterization (infinite families, variable arities), decoupled sub-task data extraction, and self-improving data loops. However, key limitations include underperformance of general LLMs on proof sketching/correction subtasks, error propagation from semantic misalignment, and the challenge of global proof planning in reinforcement learning-directed models (Tao et al., 24 Jan 2026, Zhang et al., 21 Aug 2025).

Future work is directed toward integrating local expert models, improving sub-task learners, benchmarking against alternative agentic workflows, and generalizing frameworks for new mathematical domains and higher-order theories.

Relevant Works:

"TheoremForge: Scaling up Formal Data Synthesis with Low-Budget Agentic Workflow" (Tao et al., 24 Jan 2026)
"Theory Exploration Powered By Deductive Synthesis" (Singher et al., 2020)
"Lean Meets Theoretical Computer Science: Scalable Synthesis of Theorem Proving Challenges in Formal-Informal Pairs" (Zhang et al., 21 Aug 2025)
"An Automated Theorem Generator with Theoretical Foundation Based on Rectangular Standard Contradiction" (Xu et al., 6 Nov 2025)
"A Combinatorial Identities Benchmark for Theorem Proving via Automated Theorem Generation" (Xiong et al., 25 Feb 2025)
"Leveraging the Information Contained in Theory Presentations" (Carette et al., 2020)
"Reinforcement Learning of Theorem Proving" (Kaliszyk et al., 2018)