AxiomProver Systems: Modular Proof Automation
- AxiomProver Systems are automated reasoning architectures that combine algorithmic axiom generation with formal proof-state management to support diverse logical frameworks.
- They integrate LLM-guided agents, proof assistants, and constraint/rewrite mechanisms to iteratively refine proofs while ensuring soundness and completeness.
- Benchmark evaluations demonstrate competitive performance in formal mathematics and scientific domains, validating their modular design and domain generality.
AxiomProver Systems are automated reasoning architectures that synthesize and manipulate axiomatically-defined proof systems in a modular, domain-general fashion. They integrate algorithmic axiom generation, formal proof-state management, constraint handling, and agentic or interactive proof construction. Recent implementations tightly couple LLMs, proof assistants (notably Lean), and protocol-based tool interfaces, enabling both classic logic and formal mathematics applications spanning pure mathematics, scientific domains such as quantum theory, and multi-valued or non-classical logics (Tredici et al., 14 Oct 2025, Greati et al., 2024, Liu et al., 20 Jan 2026). The following sections detail the architecture, methodologies, tool integration, benchmark evaluations, core algorithmic and proof-theoretic innovations, and representative case studies.
1. Agentic and Modular Architectures
Modern AxiomProver systems adopt a multi-agent paradigm with distinct roles such as Orchestrator, Prover, and Verifier (each typically instantiated as specialized prompts over a common LLM core) and communicate via a Model Context Protocol (MCP) (Tredici et al., 14 Oct 2025, Liu et al., 20 Jan 2026). The Orchestrator manages proof workflow, schedules subproblems, and handles feedback. The Prover synthesizes proofs in the target formalism (e.g., Lean), constructs subgoal hierarchies, retrieves library lemmas, and proposes tactics. The Verifier enforces formal correctness by performing final proof checks and emitting detailed diagnostics.
The architecture is organized around the MCP, which abstracts all interactions with the underlying proof assistant, permitting independence from the implementation specifics of language servers or library versions. This model facilitates robust tool-chaining, transparent error-handling, and seamless integration of advanced search and feedback mechanisms. Tool calls include file editing, proof state querying, library searching, tactic execution, and error reporting.
2. Formal Proof-State Models and Search Procedures
AxiomProver systems formalize proof search as transitions over explicit proof states:
where represents the content of the formal proof artifact (e.g., Lean file) at step , and is the list of active subgoals (e.g., unsolved sorry positions in Lean) (Tredici et al., 14 Oct 2025). The agentic loop implements a schema:
- For each subgoal , invoke library search to retrieve supporting lemmas.
- Synthesize a candidate tactic script for incorporating retrieved information.
- Apply the tactic, update the formal object, and re-extract the current subgoal set.
- Fetch diagnostics; if errors are present, feed error context back into the LLM for revision.
- Iterate until all subgoals are discharged and no errors remain.
Formally, agent transitions are classified as search, tactic, or check actions, with tool calls orchestrated in a closed feedback loop. High-level search pseudocode is explicitly implemented in recent systems, guaranteeing that only syntactically and semantically valid steps persist (Tredici et al., 14 Oct 2025).
3. Automated Generation of Axiom Systems
Automated axiom generation, as realized in classical and non-classical logic domains, proceeds through direct translation of semantic structures (e.g., logical matrices) into inference rules or axioms, followed by systematic streamlining (Greati et al., 2024). Core phases include:
- Matrix-to-rule translation: Each connective’s truth table is exhaustively converted into a family of inference rules. For -ary connectives, rules are produced for every truth assignment in the matrix.
- 3-labelled and Set–Set calculi: Rule schemas are instantiated in labelled sequent calculi (Baaz–Fermüller–Zach) or multiple-conclusion Hilbert style calculi (Shoesmith–Smiley).
- Semantic separator computation: For logic classes admitting monadicity, separating formulas are synthesized using SAT/SMT techniques to uniquely characterize semantic values.
- Rule minimization: Axioms and inference rules undergo streamlined pruning via subsumption relations, redundancy elimination (Tarskian closures), and cut-elimination checks. Soundness and completeness are preserved at each step.
This process leads rapidly from algebraic or tabular semantics to finite, analytic, and cut-free axiom sets for a wide variety of logics, including classical three-valued logics such as Łukasiewicz’s (Greati et al., 2024).
4. Constraint-Based and Rewrite-Oriented Extensions
AxiomProver frameworks generalize unification-based proof search to constraint-based methods, supporting arbitrary theories through abstract constraint structures:
Each proof search state maintains domain-specific constraints, propagated and combined along proof-tree edges. The kernel interacts only with the theory via projection, lifting, meet, and consistency checks, supporting robust theory plug-in mechanisms. Soundness and completeness are enforced via semantic axioms and sequential refinement properties (Rouhling et al., 2014).
Rewrite-oriented variants integrate oriented axioms as rewrite rules at the term and atomic proposition level, exploiting confluence and (where possible) termination. Proof search operates modulo the rewrite system , with sequent-calculus-modulo and resolution-modulo paradigms providing modular cut-elimination and completeness guarantees (Dowek, 2023).
| Architectural Style | Core Mechanisms | Completeness Criteria |
|---|---|---|
| Constraint-Generic | Abstract constraints, witness builders | Soundness, completeness, cut-elimination per (Rouhling et al., 2014) |
| Rewrite-Oriented | Term/proposition rewriting, narrowings | Confluence, modular cut per (Dowek, 2023) |
| Matrix-Driven (3-valued) | Matrix-to-rule, separation, streamlining | Monadicity, semantic equivalence per (Greati et al., 2024) |
5. LLM Integration and Human-in-the-Loop Capabilities
Integration with LLMs has enabled scalable agentic systems that combine creative search, robust feedback, and domain extensibility. The Prover agent synthesizes natural-language proof sketches, decomposes them into formal subgoals, retrieves contextually-relevant lemmas, and adapts tactics based on fine-grained diagnostics (Tredici et al., 14 Oct 2025, Liu et al., 20 Jan 2026). The modular MCP interface allows direct orchestration of code execution, proof-state introspection, multi-theorem retrieval, and even informal discussion or brainstorming among distinct LLMs.
Human-AI collaboration is facilitated by “hint injection” and editable blueprints, with the agentic system dynamically replanning proofs in response to expert direction. Case studies such as the formalization of the Brascamp–Lieb inequalities demonstrate iterative workflows with mechanism for detecting and correcting misstatements, refining subgoal hierarchies, and leveraging blueprint DAGs to organize proof dependencies (Liu et al., 20 Jan 2026).
Proof-of-concept systems for axiomatic geometry (Elfe) provide interfaces where high-level “cornerstone” reasoning is separated from mechanical inferences, leveraging external ATPs for routine FOL steps while affording transparency and custom notation (Doré et al., 2019).
6. Benchmarks and Comparative Performance
AxiomProver systems have demonstrated competitive and often state-of-the-art performance across a range of formal mathematics and scientific benchmarks:
- On the NuminaMath-LEAN and AbstractAlgebra benchmarks, Ax-Prover achieves 51% and 64% solution rates, exceeding specialized provers by significant margins (Tredici et al., 14 Oct 2025).
- For quantum-theory theorems, Ax-Prover achieves 96% versus 61%/57% for DeepSeek-Prover-V2/Kimina-Prover.
- On the Putnam 2025 exam, both AxiomProver and Numina-Lean-Agent attain perfect 12/12 scores, with Lean proof script lengths remaining within the compact range of 21–45 lines for representative solutions (Liu et al., 20 Jan 2026).
- Case studies validate the extensibility of the approach to complex, long-horizon formalizations, with human-expert guidance yielding iterative improvement and automatic error correction.
Performance is summarized in the following selection from (Tredici et al., 14 Oct 2025) and (Liu et al., 20 Jan 2026):
| Benchmark | AxiomProver | DeepSeek-V2 | Kimina | Standalone LLM | Numina-Lean-Agent |
|---|---|---|---|---|---|
| NuminaMath-LEAN (300) | 51% | 28% | 31% | 5% | 100% (12/12) |
| AbstractAlgebra (100) | 64% | 24% | 13% | 8% | – |
| QuantumTheorems (134) | 96% | 61% | 57% | 40% | – |
7. Theoretical Guarantees, Extensions, and Design Properties
AxiomProver frameworks emphasize formal guarantees:
- Soundness and completeness are maintained through semantic alignment between generated rules/axioms and intended models, constraint propagation, and rewrite system properties (Rouhling et al., 2014, Greati et al., 2024, Dowek, 2023).
- Cut-elimination is guaranteed for analytic and streamlined axiom sets and for sequent-calculus-modulo settings under confluence and left-linearity conditions.
- Monadic separators and SAT-based separator computation provide criteria for analytic finitely axiomatizable logics (Greati et al., 2024).
- Self-justifying axiom systems can be handled via specialized analytic-tableaux frameworks ensuring internal consistency and kernelization, even in the presence of reflection principles (within carefully restricted language fragments) (Willard, 2013).
A strong emphasis is placed on modularity, extensibility, and theory-agnosticism. New theory plug-ins, base-model LLMs, retrieval methods, or proof assistants can be incorporated with minimal adaptation, in contrast to pipeline-specialized or monolithic provers (Liu et al., 20 Jan 2026).
AxiomProver Systems, as realized in current research and deployment, represent a point of convergence for automated axiom synthesis, agent-based proof search, constraint and rewrite-theoretic proof engines, and LLM-guided tool integration. They achieve demonstrable domain generality, strong theoretical guarantees, extensibility, and leading performance in both established and novel mathematical and scientific formalization tasks (Tredici et al., 14 Oct 2025, Greati et al., 2024, Liu et al., 20 Jan 2026, Rouhling et al., 2014, Dowek, 2023, Doré et al., 2019, Willard, 2013).