- The paper presents multi-agent AI orchestrating the end-to-end formalization of a graduate algebraic combinatorics textbook into Lean.
- It leverages a robust software-engineering pipeline with roles like provers and maintainers to ensure independent review and high code quality.
- Empirical results show competitive cost and efficiency compared to human teams, while exposing orchestration challenges in large-scale formal verification.
Multi-Agent Automatic Formalization of a Graduate Mathematics Textbook
Introduction and Motivation
This paper presents an empirical assessment of large-scale, high-fidelity mathematical textbook formalization using multi-agent orchestrated AI, specifically a fleet of Claude 4.5 Opus coding agents, targeting the translation of a full 500-page graduate-level algebraic combinatorics textbook into the Lean theorem prover. The work directly addresses the dual bottlenecks of manual formalization: foundational content coverage and the high time/expertise cost, both of which have traditionally retarded the impact of formal methods in mainstream research mathematics.
Unlike previous efforts focused on isolated problems, synthetic competitions, or partial formalizations, this study achieves end-to-end formalization of a previously uncovered area, spanning 130K lines and 5900 declarations, with all 340 designated theorems and definitions proved. The experiment sets new scales for both automated theorem proving and collaborative multi-agent software engineering, providing critical data on the economics, orchestration, and coherence challenges that arise in the transition from experimental to production-grade formal mathematics.
System Architecture and Agentic Orchestration
The formalization pipeline leverages a simple yet robust orchestration layer embodying canonical practices from human software engineering: task decomposition, short-lived feature branches with git-based trunk development, independent code review, merge queues, and a universal issue tracker for communication and escalation. Agents are organized into distinct roles with parametrized tasks—sketchers (statement scaffolding), provers (proof construction), maintainers (issue resolution/refactoring), triage agents (issue clean-up), scan/progress agents (code base analysis), and reviewers (mathematical and engineering standards enforcement).
This structure targets the algorithmic coordination problem endemic to multi-agent coding systems, leveraging version control and file-based communication to minimize conflicts, enforce code quality guidelines, and allow aggressive parallelization with minimal bespoke centralization logic. The system demonstrates that, with appropriate abstraction and role definitions, multi-agent proof engineering is not only possible but highly effective for large-scale, high-stakes mathematical texts.

Figure 1: Progress of the automatic formalization system, showcasing distinct phases of code expansion and subsequent cleanup, culminating in the completion of all 340 formalization targets.
Empirical Results and System Dynamics
The experiment formalized all 340 targeted theorems and definitions within one week, using 30,000 agent runs distributed over eight machines, collectively producing 130K lines of Lean code. Notably, statement formalization and proof construction tasks were rigorously separated and subject to independent mathematical and engineering review, with the review process pushing coverage and code quality to human-like levels. Task granularity enforced small, atomic pull requests (typically ≈100 lines), facilitating efficient code review and minimizing coordination overhead.


Figure 2: Pull request line count histograms, demonstrating the granularity and atomicity enforced in the multi-agent setting.
Token usage analysis indicates the formalization consumed approximately 83 billion input tokens and 561 million output tokens, corresponding to an estimated \$100K compute cost—a price that is competitive with or below the compensation for a comparably sized expert human team. Instrumentation exposed system inefficiencies in early phases (e.g., network file system contention, excessive agent abortion, merge queue bottlenecks), most of which were substantially mitigated during the run. The post-hoc analysis suggests a 3–10x efficiency gain is possible by eliminating duplicate/blocked work and introducing better dependency tracking, even without further model improvements.
Figure 3: Evolution of agent outcomes, showing improved throughput and success rates over the run as systems matured.
Agent Role Interactions and Failure Modes
A detailed comparison of prover and maintainer agent outcomes reveals contrasting operational profiles: provers were primarily localized to single-file operations, while maintainers executed larger refactorings and global codebase repairs (occasionally modifying up to 14 files per PR). This division corresponds to the needs of large-scale mathematical engineering but also introduces new coherence failure modes. Multiple independent formalizations of mathematically equivalent structures (e.g., N-partitions, Bender-Knuth involutions) were produced, necessitating subsequent refactorings to unify APIs and enforce global invariants. Agents also frequently pursued non-goal “rabbit holes,” consuming resources on cited theorems or exercises, though later interventions successfully curbed this behavior.

Figure 4: Comparative statistics on file modifications: maintainers’ broader scope versus provers’ single-file focus.
This study delivers several claims with broad implications:
- The cost of high-quality, verifiable formalization of advanced mathematical texts can already approach or beat traditional human-centric approaches, even when accounting for orchestration overhead.
- The principal barriers in scaling autoformalization are no longer the logical reasoning capabilities of the models but the orchestration of distributed, heterogeneous labor at the codebase scale.
- Aggressive agentic parallelization is viable and robust given suitable coordination protocols, but project-wide coherence enforcement (definitions, conventions, API boundaries) remains a nontrivial open problem.
- The methodology is immediately transferable to larger, cascade-structured formalization efforts and is not reliant on further advances in base LLMs.
The results also provide empirical grounding for discussions of formalization as reusable infrastructure and as a generator of diversified, high-quality formal datasets for future ML training. Widespread, routine textbook formalization would support robust research verification and fuel advances in RL-based theorem proving, algorithmic mathematics discovery, and formalized mathematics as a discipline.
Future Directions
The authors articulate the necessity of a concerted focus on textbook formalization rather than specialized competition or ad hoc research results. They estimate the foundational core for modern mathematics formalization at 1,000–10,000 textbooks, which, if systematized, would drastically reduce the amortized cost of formalizing novel research. Challenges at scale include dependency ordering, convention harmonization, and community-level project governance—issues akin to but larger than those encountered here.
Automated coherence enforcement, advanced dependency analysis, and adaptive agentic orchestration (e.g., simulated annealing of project “freedom” versus “tightness”) are identified as pivotal for future scalability. The continued downward trajectory of LLM inference costs, coupled with the demonstrated efficacy of agentic architectures, implies that compute and capital—not human labor—may rapidly become the limiting factor for formalization at the scale of mathematical corpora.
Conclusion
This paper establishes that multi-agent AI systems, appropriately orchestrated, can formally verify complex, graduate-level mathematics at scale, quality, and cost point that directly challenges human-only efforts. The principal hurdles are now organizational rather than cognitive, with future progress hinging on advances in orchestration, integration of formalized infrastructure, and further automation of project-level coherence checking.
These findings catalyze the transition to a new regime of “agent-engineered” mathematics, where formalization is infrastructural and machine-checked correctness is routine. This paradigm has immediate impact on mathematical verification, ML reasoning research, and scalable collective knowledge engineering.
Reference: “Automatic Textbook Formalization” (2604.03071).