Formalizing Mathematics at Scale

Published 28 May 2026 in cs.AI | (2605.29955v1)

Abstract: We present AutoformBot, a multi-agent system for building an Autoformalized Textbook Library At Scale (Atlas) in Lean 4. AutoformBot orchestrates thousands of LLM agents, equipped with formal verification tools, dependency-aware task scheduling, and collaborative version control, to translate informal textbook prose into machine-checked definitions and proofs. We apply our methods to a corpus of 26 open-access textbooks spanning analysis, algebra, topology, combinatorics, and probability, producing Atlas: a verified library of over 45,000 Lean 4 declarations and 500 thousand lines of code. We release two artifacts: (i) AutoformBot, the open-source multi-agent framework; and (ii) Atlas, the resulting formal library. Our results suggest that autoformalizing the core content of graduate-level mathematics at scale is now economically and technically feasible. This opens the door to the automated verification of both human- and machine-generated mathematics at a research level.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces AutoformBot, a multi-agent system that automates textbook formalization with a 71% success rate on large-scale Lean proofs.
It employs a three-tier architecture—comprising orchestrator, worker, and supervisor agents—to parse, verify, and coordinate formalization tasks.
Empirical evaluations on 26 textbooks demonstrate scalability benefits, parallelism gains, and challenges in achieving expert-level code quality.

Formalizing Mathematics at Scale: Automated Textbook Formalization with AutoformBot

Motivation and Context

Automated verification in mathematics is increasingly essential due to high-throughput generation of mathematical ideas and proofs by LLMs, which far outpaces human review capability. Traditional trust-based peer review, reliant on informal reasoning, becomes untenable in a computationally driven environment. The adoption of proof assistants (Lean, Coq, Isabelle/HOL, etc.) provides an objective, kernel-verified process for validating proofs, reducing trust dependence. However, formalization is a cumulative process, reliant on robust libraries such as mathlib, which only provide partial coverage over mathematical domains. Formalization of full textbooks is a formidable human task, but with recent advances in frontier LLMs, automated, scalable formalization—autoformalization—of large corpora is becoming feasible.

AutoformBot System Architecture

AutoformBot reframes large-scale mathematical formalization as multi-agent collaborative software engineering. The system's three-tier agent architecture consists of:

Orchestrator: Parses source textbooks, builds a task DAG encoding logical dependencies, and incrementally updates formalization strategy. Task granularity adheres to single statement or fix per node.
Workers and Reviewers: Agents formalize statements and perform code reviews. Parallelization and worktree isolation are enforced via git branching and fast-forward merges.
Trace Analyzer and Supervisor: Trace analyzers accumulate and propagate task-specific knowledge across rounds. Supervisors evaluate post-merge quality, coordinate fixes for failed targets, and leverage triage agents.

Comprehensive tool support is provided via Model Context Protocol (MCP), Lean REPL/LSP, mathlib search (Loogle), version control, orchestration, and resource budget management. Human-in-the-loop interaction is optional and facilitated by a visual interface summarizing task progress, dependency graphs, and agent communication.

(Figure 1)

Figure 1: The graph of tasks of a formalization attempt, as shown by AutoformBot's visual interface.

Evaluation Harness and Formalization Success Criteria

Successful formalization is defined by absence of illegitimate axioms or so-called sorry placeholders, and faithfulness to the source statement. Success is non-transitive; the chain of dependencies is recursively analyzed to flag propagation of unproven assumptions. Three rubrics are employed via independent LLM judges:

Faithfulness: Precise preservation of hypotheses, quantifiers, and mathematical content.
Proof Integrity: Mathematical correctness, with no hypothesis smuggling or weakened statements.
Code Quality: Alignment with mathlib conventions, naming, structure, and idiomatic usage.

A dependency graph is dynamically maintained with structural tags to detect vacuous bodies, weakened assumptions, and potential cheater patterns among agents.

Empirical Results across 26 Textbooks

AutoformBot, powered primarily by Opus 4.6, was applied to 26 textbooks across diverse mathematical subfields, generating the ATLAS library comprising over 45,000 verified Lean 4 declarations and ~500k LOC. The average successful formalization rate is ~71%. Full formalization is impeded by statements requiring infrastructure absent from mathlib or the source textbook, leading to diminishing returns and nontrivial gaps.

(Figure 2)

Figure 2: For each statement within each book, difficulty is estimated by the amount of missing infrastructure from mathlib required for formalization.

Resource costs are dominated by worker agents. Pipeline cost per LOC is already below expert human annotation, with substantial scalability advantages. However, code quality, as measured by human experts and the evaluation harness, remains inferior to expert-written Lean code.

Ablation Experiments and Coordination Insights

Extensive ablations on "Algebraic Combinatorics" (39 targets):

Model Variability: Claude Opus 4.6 achieves 92% completion at 1200M tokens, Gemini 3.1 Pro only 46%. Discrepancies are due to model proficiency in Lean coding.
Component Ablations: Removal of orchestrator, supervisor, or trace analyzer leads to stagnation, degraded learning, or repeated failures. Full pipeline attains 77% completion at 600M tokens.
Parallelism: Increasing agent parallelism reduces latency and improves token efficiency, especially in early stages when tasks are easier.

(Figure 3)

Figure 3: Ablation results on Algebraic Combinatorics. (a) Claude vs. Gemini, (b) Feedback component removal, (c) Worker parallelism.

Failure modes include repetitive dead-end exploration (frontal assault), adversarial verification circumvention (cheating), infrastructure panic, and orchestrator fatigue. Mitigation is achieved via specialized agent roles, layered review, and progressive-detail tools.

Practical and Theoretical Implications

The results demonstrate feasibility of scalable, LLM-driven textbook autoformalization, establishing a trajectory toward machine-generated formal libraries capable of supplementing existing repositories like mathlib. Achieving comprehensive coverage requires further infrastructure development and dependency-aware planning, best accomplished through strategic human oversight.

Experimentally, AutoformBot offers a framework for multi-agent orchestration research in collaborative mathematical codebases, and empirical ablations quantify the effectiveness of coordination mechanisms. The evaluation harness, integrating mechanical and LLM-based checks, aligns well with expert human judgment, bolstering its reliability as a grading mechanism.

On the theoretical front, the ability to systematically formalize broad mathematical domains contributes to automated reward generation for RL training on mathematical reasoning tasks, obviating reliance on brittle LLM judgment or limited datasets.

Limitations and Future Directions

Current computational costs are non-negligible due to dependency on frontier LLMs. Book-by-book formalization lacks maximized cross-textbook compatibility with mathlib, and organizational challenges persist regarding standardized project structure and bridging source conventions. Future work will focus on enhancing ATLAS completeness and standardization, transitioning toward true machine-generated mathlib extensions.

Conclusion

AutoformBot establishes a scalable pipeline for collaborative, automated textbook formalization via multi-agent systems. By leveraging software engineering best practices and rigorous evaluation, it enables verified, mechanizable mathematical libraries at a scale unattainable by human effort. The approach, while not yet matching expert code quality, is set to transform mathematical verification, enabling both automated research validation and large-scale human/machine collaboration.

Markdown Report Issue