LeanMarathon: Scalable Autoformalization

Updated 7 June 2026

LeanMarathon is a multi-agent framework that autoformalizes research-level mathematics by maintaining an evolving blueprint of proofs and annotations.
It employs four specialized agents—blueprinter, target-reviewer, worker, and refiner—to manage dependencies, audit fidelity, and repair errors in a coordinated CI environment.
Experimental results on Erdős benchmarks show robust formalization with hundreds of lemmas and theorems proved autonomously, ensuring zero residual errors.

LeanMarathon is a multi-agent framework designed to achieve reliable, long-horizon autoformalization of research-level mathematics in Lean. Unlike previous LLM-based agents that operate successfully only at the scale of isolated proof goals, LeanMarathon structures and coordinates mathematical formalization at the scale of entire research papers, addressing brittleness, error propagation, and context management through a combination of architectural abstractions and procedural safeguards. It is centered around a single evolving Lean file—referred to as the “blueprint”—and employs four specialized contract-scoped agents, all orchestrated via a two-stage protocol enforced in continuous integration (CI). LeanMarathon has demonstrated the fully autonomous formalization of deep research mathematics spanning hundreds of definitions, lemmas, and theorems, most notably on rapid-paced Erdős-theory benchmarks (Zhang et al., 3 Jun 2026).

1. Motivation and Challenges in Long-Horizon Lean Autoformalization

In research mathematics, a typical paper introduces numerous new definitions and a dense directed acyclic proof graph comprising dozens or hundreds of interdependent statements. Existing LLM-based agents can successfully discharge single Lean 4 proof goals but fail to construct or maintain large-scale formalizations. The LeanMarathon framework targets four characteristic failure modes observed in long-running, monolithic attempts:

Statement drift: Intermediate lemmas can deviate from their intended meaning while remaining Lean-provable, silently derailing subsequent results.
Tangled dependencies: Local modifications, such as hypothesis changes, can invalidate remote proofs, and without global awareness, repairs become difficult to isolate.
Context decay: As the proof graph evolves, obsolete or incorrect decisions may be buried, reducing agent effectiveness in maintaining global correctness.
Irreversible errors: Errant repairs can propagate errors across many nodes, with no robust rollback or containment, often requiring extensive manual correction.

These challenges shift the formalization task from isolated goal-proving to robust management of a large, interdependent mathematical software artifact, necessitating new abstractions for target fidelity, error containment, and incremental progress.

2. Core Framework: The Evolving Blueprint as System of Record

The central abstraction of LeanMarathon is the evolving blueprint: a single Lean file that is simultaneously (1) a formal proof skeleton, (2) a natural-language proof graph, and (3) the shared, authoritative system of record for all agents. Every definition, lemma, and theorem is annotated with a @[blueprint ...] attribute, which records:

LaTeX statement text (statement)
Natural-language proof sketch with directed references (proof)
One-line LaTeX title (title)
Statement environment (latexEnv, e.g., "lemma" or "theorem")

Example of a blueprint annotation:

@[blueprint "lem:weighted-tail-bound"
  (statement := /-- LaTeX statement text --/)
  (proof     := /-- LaTeX proof sketch with \cref citations --/)
  (title     := /-- one-line title --/)
  (latexEnv  := "lemma")]
lemma weighted_tail_bound (…parameters…) : …goal type… :=
by
  sorry_using [aux_lemma_one, aux_lemma_two]

Throughout the formalization, the blueprint file is never frozen: nodes may remain incomplete (by sorry), be refined, repaired, or expanded locally. At completion, all nodes are fully proved, and the file type-checks without any residual sorry.

3. Agent Architecture: Four Contract-Scoped Roles

LeanMarathon segregates duties across four contract-scoped agents, each operating in an isolated Git worktree and communicating strictly via pull requests (PRs) or issues, subject to enforcement by a central CI verifier:

Agent	Primary Role	Editable Scope
Blueprinter	Construct blueprint	Entire file
Target-Reviewer	Audit target-fidelity	Read-only
Worker	Prove node(s)	Assigned node, local region
Refiner	Repair/Refactor	Arbitrary sub-DAG

A. Blueprinter (construct agent): Ingests paper.tex and canonical target statements, generating an initial blueprint with all declarations, each annotated and stubbed by by sorry or by sorry_using. Edit scope is global, but only decomposition is enforced at this stage.

B. Target-Reviewer (audit agent): Audits only the canonical target theorems for strict agreement between Lean, LaTeX, and original statement, raising grouped issues if mismatches are detected. Does not edit the file.

C. Worker (prove agent): For each dynamic leaf (a node whose dependencies are all proved), the Worker replaces placeholders with full Lean proofs, can refine local metadata, and insert immediate helper lemmas, but cannot alter upstream types. A misformalization triggers an issue rather than a silent failure.

D. Refiner (repair agent): Repairs global blueprint defects cited by open issues, operating over minimal illness sub-DAGs. When repair affects proved nodes, proofs are forcibly downgraded to sorry, ensuring well-scoped recovery in subsequent rounds.

4. Orchestration: Two-Stage Process and Fault Isolation

The orchestrator sequence is implemented via driver script and CI logic:

Stage 1 — Cold Start and Target Review: Blueprinter submits the full blueprint PR to main, and the Target-Reviewer iterates until all top-level targets are validated, with the Refiner addressing mismatches as needed (the “Ralph-Wiggum loop”).
Stage 2 — Parallel DAG Discharge: In each round, the orchestrator (1) extracts the proof DAG, (2) identifies all dynamic leaves, (3) launches one Worker per leaf in parallel (each in its own sandbox), (4) merges only those PRs passing seven CI checks (see below), and (5) calls the Refiner on all new issues, merging exactly one repair PR per round.

This arrangement ensures that all PRs operate on disjoint, frozen regions, preventing conflicts and guaranteeing the reproducibility of order-independent merges.

CI Verifier: Seven Structural Checks

Lean 4 compiles with no errors or residual sorry.
All @[blueprint ...] attributes have filled statement, proof, and title fields.
Lean keywords (e.g., lemma, theorem) match latexEnv.
Labels normalize to Lean names.
Labels are unique.
Every \cref{...} in the proof sketch matches actual dependency edges, and vice versa.
Every non-terminal lemma must feed forward to a target theorem (no orphan lemmas).

5. Experimental Evaluation and Quantitative Results

LeanMarathon was evaluated using two 2026 research papers from Terence Tao’s group, spanning four Erdős problems (#1051, #1196, #164, #1217):

Benchmarks: Formalization of original and generalized Erdős–Graham irrationality results (#1051), Erdős–Sárközy–Szemerédi primitive-set bound (#1196), primitive-set conjecture (#164), and infinite divisibility chains (#1217).
Run configuration: Autonomous operation with all agents as GPT-5.5-xhigh, with no human intervention.
Key results:
- All seven target theorems were fully formalized with zero residual sorry.
- Aggregate output: 258 proved lemmas and theorems across runs.
- Blueprint size: 8,513 lines for #1051, 3,988 for #1196, 14,592 for #164 + #1217.
- No merge conflicts among 135 Worker PRs; all PRs merged cleanly due to strict region isolation.
- Cost: $257 (#1051),$189 (#1196), $624 (#164 + #1217) in GPT tokens.
- Critical-path completion: 11 h 38 m (#1051), 11 h 32 m (#1196), 40 h 43 m (#164 and #1217).

Baseline comparison: The Aristotle agent (closed-source IMO-level LLM) failed to discharge the two deepest results in #1051 and stalled on #1196, exhibiting inferior coverage and reliability.

6. Insights, Limitations, and Reproducibility

Insights

LeanMarathon identifies agent durability—the preservation of target fidelity over long, incremental developments—as the central bottleneck in autoformalization, rather than raw stepwise proof power. The combination of contract-scoped agents and a persistent CI gate transforms a brittle multi-day operation into a series of short, isolated, recoverable transactions. The framework also surfaced nontrivial errors or gaps in the source proofs, including missing hypotheses and analytic estimates, by enforcing Lean's semantic totality and type correctness.

Limitations

The framework's efficacy is fundamentally contingent on the maturity of background mathematical libraries: if prerequisites are missing in Mathlib, the Blueprinter may be unable to generate honest initial definitions, causing the harness to stall or create placeholder artifacts. Broader coverage thus requires continual expansion of Mathlib.

Future Directions

Anticipated extensions include integration of generative LLM-based proof agents for subtask dispatch, development of human-in-the-loop interfaces for ambiguous cases, and incremental growth of the background math library for improved domain generality.

Reproducibility

All code and orchestrations are publicly available:

Each repository includes the blueprint file, agent configurations, CI pipelines, knowledge stores, and phase specifications. Reproducing experiments requires only Lean 4, Mathlib, and execution of the orchestrator script; all agent prompts and logs are preserved for verification.

7. Significance and Outlook

LeanMarathon demonstrates that robust, research-scale autoformalization is achievable via explicit architectural separation of construction, audit, proof, and repair, mediated by a single system of record and transactionally gated continuous integration. This approach extends the practical horizon of AI co-mathematicians, enabling fully autonomous formalization of dense mathematical research at scales far beyond prior single-agent or competition-style paradigms (Zhang et al., 3 Jun 2026).

Markdown Report Issue Upgrade to Chat

References (1)

LeanMarathon: Toward Reliable AI Co-Mathematicians through Long-Horizon Lean Autoformalization (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LeanMarathon.