Atlas: Autoformalized Textbook Library at Scale

Updated 4 July 2026

The work demonstrates a layered framework that converts textbook mathematics into verified Lean libraries using data-centric synthesis, augmentation, and semantic alignment.
It employs verifier-in-the-loop compilation and multi-agent orchestration to manage dependencies, proof repair, and human oversight for project-scale autoformalization.
Atlas achieves strong empirical performance with thousands of proofs and over 45,000 declarations, scaling textbook formalization into a coherent, buildable library.

Searching arXiv for the cited Atlas-related papers to ground the article in current literature. Autoformalized Textbook Library At Scale (Atlas) denotes a line of research centered on converting textbook mathematics into machine-checked formal libraries at project scale. In recent work, the term is used both for concrete verified corpora and for the methodological stack needed to produce them: large-scale natural-language-to-formal-language data generation, verifier-in-the-loop compilation and proof repair, multi-agent repository orchestration, and human-guided semantic review (Rammal et al., 28 May 2026, Wang et al., 19 Feb 2026, Liu et al., 8 Feb 2025, Yanahama et al., 16 Mar 2026). The resulting program moves autoformalization from isolated theorems and short snippets toward end-to-end textbook ingestion, with explicit concern for dependency management, buildability, proof completion, provenance, and semantic fidelity.

1. Atlas as artifact, framework, and review layer

The literature does not present Atlas as a single monolithic system. Rather, separate papers instantiate different layers of the same objective. "ATLAS" in "Autoformalizing Theorems through Lifting, Augmentation, and Synthesis of Data" is a data-generation framework for constructing large NL–FL corpora in Lean4. "Atlas" in "Formalizing Mathematics at Scale" is the resulting collection of verified Lean 4 libraries produced by AutoformBot. In "M2F: Automated Formalization of Mathematical Literature at Scale," Atlas is described as something that can be concretely instantiated by adopting M2F’s project-scale, verifier-in-the-loop methodology. "Lean Atlas" is an integrated proof environment for semantic verification of large formalization projects (Liu et al., 8 Feb 2025, Rammal et al., 28 May 2026, Wang et al., 19 Feb 2026, Yanahama et al., 16 Mar 2026).

Work	Primary role	Reported artifact
ATLAS	Data generation for NL–FL theorem statements	10-iteration Lean4 pipeline with concept lifting, synthesis, and augmentation
M2F	End-to-end project-scale autoformalization in Lean	241 files, 4,116 declarations, 153,853 lines from 479 pages
AutoformBot / Atlas	Multi-agent production of textbook libraries	26 textbooks, 45,000+ declarations, 483,918 lines
Lean Atlas	Human-in-the-loop semantic review environment	Dependency-graph extraction, Lean Compass, interactive web viewer

This suggests that Atlas is best understood as a layered program of work rather than a single algorithm. The shared target is a large, reusable, formally verified library derived from textbook mathematics, but the enabling techniques differ substantially: some papers optimize training data, some optimize project compilation and proof repair, some optimize software-engineering throughput, and some optimize semantic auditing.

2. Why textbook-scale autoformalization is difficult

Textbook-scale autoformalization is harder than theorem-level translation because project-level constraints dominate. M2F characterizes the main failure sources as cross-file import choices, name resolution, namespace governance, typing cascades, and implicit dependencies, often before proof search is even meaningful. At this scale, a library must compile end-to-end; otherwise the prover cannot even load goals (Wang et al., 19 Feb 2026).

ATLAS frames the problem from a data perspective. Formal languages such as Lean, Isabelle, Coq, and HOL Light are unforgiving: a single missing type annotation or mis-specified quantifier breaks compilation, and even syntactically valid output can miss the intended semantics. Natural-language mathematics is also highly contextual. If a statement invokes a concept not present in the formal library, the missing definition must be formalized first, which the paper describes as nontrivial even for experts. This makes large, high-quality NL–FL corpora a central bottleneck (Liu et al., 8 Feb 2025).

A second difficulty is that mechanical correctness is not semantic correctness. Lean Atlas defines semantic hallucination as the case where a formalization passes the type checker and may even have a completed proof, but is not semantically equivalent to the intended mathematical content. Its example is the sentence “3/2 = 1.5”: if a type annotation is omitted, Lean may default to Nat, producing 3 / 2 = 1, which type-checks and can be proved, but is semantically wrong for the intended real-number statement. The paper identifies recurring patterns such as definition mismatch, missing or extra assumptions, goal substitution, quantifier/scope errors, and type-default semantics shifts (Yanahama et al., 16 Mar 2026).

Lean-GAP shows the same gap in a textbook exercise setting. Its pipeline finds that preprocessing and first-pass autoformalization can be largely automated, but verification remains “the most subtle and labor-intensive component,” requiring careful human oversight. The project therefore separates syntactic success from semantic fidelity and treats human sign-off, rather than elaboration alone, as the certification step (Lee et al., 20 May 2026).

3. Data-centric Atlas: lifting, synthesis, augmentation, and NL–FL corpora

"ATLAS: Autoformalizing Theorems through Lifting, Augmentation, and Synthesis of Data" addresses the corpus bottleneck by starting from a controlled concept repository lifted from Mathlib rather than from noisy web text or paraphrases of existing theorems. Its stated scope is undergraduate-level mathematics in Lean4 with Mathlib, organized into 13 domains, 55 topics, and 350 concepts curated from Mathlib’s undergraduate syllabus. The synthesis loop samples two concepts at random, asks a teacher model to generate a concise NL theorem integrating both, translates the theorem into Lean, parses the code into theorem_name, theorem_variables, theorem_hypotheses, and theorem_conclusion, compiles it with Lean’s REPL, repairs failures with a teacher model, and retains only semantically aligned pairs rated good or average by a separate alignment model. Verified pairs are then enlarged by two Lean-structure-aware augmentations: proof-step snapshots scraped from Infoview during tactic execution, and contraposition derived from contrapose! on individual hypotheses (Liu et al., 8 Feb 2025).

The training loop runs for 10 iterations. Each iteration samples 10,000 concept pairs and produces 10,000 NL statements. The student model is then fine-tuned for 3 epochs with cosine decay and learning rate 1e-5, using student sampling parameters top-p = 0.9 and temperature 0.6. The abstract reports that, with 10 iterations, the framework constructs an undergraduate-level dataset comprising 117k theorem statements; the detailed system description presents the ATLAS dataset as 300k undergraduate-level NL–FL theorem statements. In both descriptions, the core claim is that a large verified corpus can be synthesized under compiler and teacher-model control rather than collected directly from naturally occurring aligned data (Liu et al., 8 Feb 2025).

The empirical outcome is a Lean4 translator with strong benchmark performance. On ProofNet, ATLAS Translator reaches 56.87% pass@1, 80.59% pass@8, and 92.99% pass@128; on miniF2F it reaches 70.08%, 91.60%, and 96.93%; on the 530-sample MathQual test subset it reaches 38.87%, 65.47%, and 84.72%. The paper reports statistically significant improvements over both the HERALD Translator and the Kimina-Autoformalizer across all benchmarks in the abstract, and the detailed results show that synthetic-only training is already strong, while contraposition helps more than proof augmentation because proof correctness is harder to guarantee. A broader implication is that Atlas can be bootstrapped not only by direct textbook translation, but also by continually expanding verified statement corpora aligned with formal libraries.

4. Project-scale Atlas: verifier-in-the-loop compilation and proof repair

M2F reframes textbook autoformalization as project-scale knowledge compilation under a pinned environment $E$ . The environment consists of the Lean toolchain plus a pinned dependency snapshot, and verification is exposed through file-level and project-level oracles:

$\mathrm{VerifyProj}_E(\mathcal{P}) \to (ok,\Delta), \qquad \mathrm{VerifyFile}_E(\mathcal{P}, f) \to (ok,\Delta_f).$

When a file verifies, goal states can be queried at a hole $h$ via

$\mathrm{GoalState}_E(\mathcal{P}, f, h) \in \{(g,\Gamma), \bot\},$

where $g$ is the goal type and $\Gamma = (x_1:T_1,\ldots,x_m:T_m)$ is the local context. Proof repair is measured by proof success rate

$s = |H_{\mathrm{closed}}| / |H|,$

with $H$ the matched-statement proof holes and $H_{\mathrm{closed}}$ the subset solved under $E$ (Wang et al., 19 Feb 2026).

The architecture is two-stage and is built around a single refinement primitive, VeriRefine. In Stage 1, statement compilation, long-form sources are normalized into JSON atomic blocks with provenance, a dependency DAG $\mathrm{VerifyProj}_E(\mathcal{P}) \to (ok,\Delta), \qquad \mathrm{VerifyFile}_E(\mathcal{P}, f) \to (ok,\Delta_f).$ 0 is inferred, and blocks are scheduled by a topological order $\mathrm{VerifyProj}_E(\mathcal{P}) \to (ok,\Delta), \qquad \mathrm{VerifyFile}_E(\mathcal{P}, f) \to (ok,\Delta_f).$ 1. For each block, an LLM proposes a Lean declaration skeleton such as theorem ... := by sorry, lemma ... := by sorry, def ... := sorry, or instance ... := by sorry. The file is compiled, and localized repairs are accepted only if Lean’s toolchain certifies strict improvement under a lexicographic objective that first reduces global error count and then localized error count. In Stage 2, proof repair, the project starts from compilable statements with frozen signatures; edits target only proofs and optional local helpers that do not change existing signatures. Candidate proof patches are goal-conditioned local edits, and acceptance is again lexicographic, this time minimizing file errors first and then the count of sorry holes outside comments and strings (Wang et al., 19 Feb 2026).

The practical significance is that M2F enforces monotone improvement. Each attempted patch causes exactly one invocation of VerifyFileE, and patches are committed only when the verifier confirms strict progress. The framework permits typed stubs to break cycles, stabilizes imports and namespaces per file, and can split oversized files into sectionYY_partK.lean fragments connected by a linear import chain. The article describing Atlas explicitly presents this methodology as a direct foundation for ingesting multiple textbooks into a coherent Lean library while preserving buildability and provenance (Wang et al., 19 Feb 2026).

The reported scale is textbook-level. In approximately three weeks, M2F converts 479 pages into a project-scale Lean library that builds end-to-end under the pinned environment: 241 files, 4,116 declarations, and 153,853 lines of Lean. The breakdown is 312 pages of Real Analysis into 49 files, 1,195 declarations, and 34,327 lines; 140 pages of Convex Analysis into 164 files, 2,620 declarations, and 105,682 lines; and a 27-page paper corpus into 28 files, 301 declarations, and 13,844 lines. Stage 1 achieves statement compile coverage SCC = 100% on all corpora, with average repair rounds 0.42/0.08/0.20, and project buildability PB holds across corpora. Stage 2 closes 875/875 audited proof holes on the long-form corpora with [PSR](https://www.emergentmind.com/topics/parameter-shift-rule-psr) = 100%, and on FATE-H it achieves 96% PSR fully automatic, compared with 80% for Seed-Prover 1.5; with 31 declaration-level hints, it reaches 97% PSR (Wang et al., 19 Feb 2026).

5. Library-scale production: multi-agent orchestration and cross-assistant realizations

The strongest claim that Atlas can function as a genuine library-building enterprise appears in "Formalizing Mathematics at Scale." There, AutoformBot treats textbook formalization as collaborative software engineering. A long-lived orchestrator builds a task DAG from the textbook, workers formalize ready targets in short-lived isolated git worktrees, reviewers inspect diffs and source context, a trace analyzer learns from failures and writes task-specific skill guides, and a supervisor runs a target-level evaluation harness after each merge. The infrastructure includes resource budgeting, pooled stateful sessions, batched merge queues with bisection, and an escalation protocol for true infrastructure failures (Rammal et al., 28 May 2026).

The artifact produced by this system is Atlas as a collection of verified Lean 4 libraries from 26 open-access textbooks spanning analysis, algebra, topology and geometry, number theory, combinatorics, probability and statistics, PDEs, and theoretical computer science. The paper reports 2,855 successfully formalized targets out of 4,007, corresponding to 71.3%, together with 483,918 lines of Lean code and over 45,000 declarations. Compute accounting is also explicit: 183,157 million tokens, with workers consuming 76.35% ± 5.71 of the compute distribution and smaller shares assigned to reviewers, the supervisor, orchestrator, full evaluation, readers, and analyzers. The evaluation harness grades matched targets by faithfulness, proof integrity, and code quality, each with threshold ≥ 3/5, and augments build checks with structural tags such as vacuous_body, returns_assumption, field_projection_body, custom_hypothesis_in_type, trivial_constructor, and orphan_class (Rammal et al., 28 May 2026).

A closely related Lean case study, "Automatic Textbook Formalization," demonstrates the same paradigm on a single 500+ page graduate textbook in algebraic combinatorics. The reported output is approximately 130,000 lines in 52 files and approximately 5,900 declarations, with all 340 selected target theorems and definitions proved in one week. The run used 30,046 total agent runs across roles: 85 sketchers, 8,704 provers, 6,467 maintainers, 6,797 math reviewers, 6,805 engineering reviewers, 550 triage agents, 307 scan agents, and 331 progress agents. The workflow is trunk-based, with short-lived feature branches, two independent reviews per PR, and a single merge queue that serializes integration to keep main building (Gloeckle et al., 3 Apr 2026).

Atlas-like production is not confined to Lean. "Munkres’ General Topology Autoformalized in Isabelle/HOL" reports 85,472 lines of Isabelle/HOL across four chapter files, with 199 definitions, 806 lemmas/theorems/corollaries, and zero sorrys, covering all 39 sections of Munkres’ Topology in 24 active days. Its methodology is a "sorry-first" declarative proof workflow coupled with bulk use of sledgehammer, process_theories, explicit unfolding, and aggressive proof profiling via eval_at -t. A separate Megalodon experiment reports 160k lines of formalized topology by January 4, 2026, with about 130k lines produced in two weeks for an LLM subscription cost of about $100, including more than 1.5k lemmas/theorems and long proofs of Urysohn’s lemma, Urysohn’s metrization theorem, and the Tietze extension theorem. Together these results support a plausible inference: Atlas is not tied to a single proof assistant, even though current large-scale instantiations are strongest in Lean (Bryant et al., 8 Apr 2026, Urban, 6 Jan 2026).

6. Semantic alignment, governance, and persistent limitations

A central misconception in large-scale autoformalization is that kernel verification alone certifies a textbook library. The recent literature rejects this. Lean Atlas introduces aligned Lean code as code whose propositions and definitions have undergone human semantic verification in addition to type checking. Its Lean Compass algorithm operates on a project-specific dependency graph $\mathrm{VerifyProj}_E(\mathcal{P}) \to (ok,\Delta), \qquad \mathrm{VerifyFile}_E(\mathcal{P}, f) \to (ok,\Delta_f).$2 whose edges are classified by source kind, dependency site, and target kind. Compass prunes theorem-value edges, computes the affecting set

$\mathrm{VerifyProj}_E(\mathcal{P}) \to (ok,\Delta), \qquad \mathrm{VerifyFile}_E(\mathcal{P}, f) \to (ok,\Delta_f).$3

and measures reduction by

$\mathrm{VerifyProj}_E(\mathcal{P}) \to (ok,\Delta), \qquad \mathrm{VerifyFile}_E(\mathcal{P}, f) \to (ok,\Delta_f).$4

The empirical reductions are substantial for proof-heavy projects: PrimeNumberTheoremAnd achieves 99.1–99.7% per-target reductions, Carleson averages 96.2%, and Brownian Motion averages 94.4%; the FLT milestone subset averages 59.8%, PhysLib 69.0%, and the definition-heavy XMSS project 27.3% (Yanahama et al., 16 Mar 2026).

Lean-GAP operationalizes a complementary governance stack for textbook exercises. It releases 430 PhD-reviewed formalized problems from 1,966 exercises in Dummit and Foote, uses standardized naming keys DF_{sec}_{subsec}_{exercise_num}, and requires a two-stage human review in which a contributor prepares an informal rewrite plus Lean declaration and an independent maintainer performs semantic review. Automated filters remain triage tools only: CI elaboration, suspicious-pattern filters C1–C11, and an LLM judge scoring five axes S1 through S5 for objects, hypotheses, conclusion, structure, and specificity. The compiler-grounded agent loop reaches 95.5% any-pass elaboration on all 1,966 exercises, but semantic evaluation remains materially weaker, with mean LLM-judge scores of 3.56 for the Codex loop and 3.45 for GPT-5, and recurrent problems including unknownIdentifier, synthInstanceFailed, vacuous existentials, trivial True conclusions, and missing hypotheses such as the omitted condition 2 ≤ n in Exercise 2.3.21 (Lee et al., 20 May 2026).

The limitations reported across Atlas-related work are consistent. M2F notes that complex, highly parameterized proofs may require better global planning than local diagnostic repair, and that typed stubs can defer semantic misalignments into Stage 2. ATLAS remains bounded by Mathlib’s implemented definitions and by teacher-model priors in NL generation and semantic alignment. AutoformBot identifies missing foundations in Lie theory and sophisticated geometry as a major source of partial coverage and diminishing returns. Lean Atlas underperforms in definition-heavy domains because value-level semantics reside in chains of definitions that cannot be pruned aggressively. Lean-GAP emphasizes library drift and coverage gaps, especially in advanced algebra. The common conclusion is not that Atlas is infeasible, but that buildability, proof closure, and semantic faithfulness are distinct objectives that require different mechanisms and different acceptance criteria (Wang et al., 19 Feb 2026, Liu et al., 8 Feb 2025, Rammal et al., 28 May 2026, Yanahama et al., 16 Mar 2026, Lee et al., 20 May 2026).

Taken together, the current Atlas literature describes an emerging stack for large-scale mathematical formalization. Data-centric systems enlarge the supply of verified NL–FL statements; verifier-in-the-loop systems make textbook-length sources compile and then repair proofs under fixed signatures; multi-agent orchestrators turn shared repositories into high-throughput formalization environments; and semantic-review tools constrain the gap between formal correctness and intended mathematics. The mature form of Atlas, as these papers collectively imply, is a continuously growing, dependency-governed, semantically audited library of textbook mathematics rather than merely a collection of elaborating files.