LeanMarathon: Multi-Agent Autoformalization Harness

Updated 7 June 2026

The paper presents LeanMarathon, a multi-agent framework that maintains target fidelity through a dynamically evolving Lean blueprint.
It details a contract-scoped orchestration protocol that minimizes failure modes such as statement drift and tangled dependencies in extensive proof graphs.
Experimental evaluations on Erdős problems demonstrate complete formalization with zero residual errors, outperforming legacy autoformalization systems.

LeanMarathon is a multi-agent harness for research-level autoformalization in the Lean theorem prover, designed to improve reliability and target fidelity during long-horizon formal mathematical developments. It addresses core failure modes encountered when scaling large autoformalization tasks, using a contract-scoped, orchestrated agent protocol around a single evolving Lean blueprint file. LeanMarathon demonstrates fully autonomous, research-scale formalization of modern mathematical results, with a robust parallelization and fault isolation methodology (Zhang et al., 3 Jun 2026).

1. Motivation and Long-Horizon Autoformalization Challenges

In research mathematics, formalization efforts span extensive interdependent proof graphs containing definitions, theorems, and supporting lemmas. Existing LLM-based prover agents, while successful at isolated or short-horizon Lean 4 tasks, exhibit brittleness in long, multi-hour runs—especially as the dependency graph deepens and proofs are repaired or extended. Four characteristic failure modes have been identified:

Statement drift: An intermediate lemma may become misformalized, remaining syntactically provable but diverging from the intended mathematical content, causing silent derailments downstream.
Tangled dependencies: Local changes can invalidate distant proofs; without global coordination, repairs are error-prone and often incomplete.
Context decay: Critical context is lost as the development grows, causing incorrect definitions or missing lemmas to be buried and hard to trace or fix.
Irreversible errors: Erroneous repairs can invalidate substantial downstream work, and monolithic agents lack mechanisms for rollback or for isolating mistakes.

These limitations make the task analogous to software engineering in large codebases, necessitating rigorous protocols to preserve "target fidelity" (every Lean statement must faithfully represent the paper’s claim), localize faults, and enable continuous progress (Zhang et al., 3 Jun 2026).

2. Core Abstraction: The Evolving Blueprint

At the center of LeanMarathon is a dynamically evolving "blueprint"—a single Lean file annotated with structured metadata:

@[blueprint "lem:weighted-tail-bound"
  (statement := /-- LaTeX statement text --/)
  (proof     := /-- LaTeX proof sketch with \cref citations --/)
  (title     := /-- one-line title --/)
  (latexEnv  := "lemma")]
lemma weighted_tail_bound (…parameters…) : …goal type… :=
by
  sorry_using [aux_lemma_one, aux_lemma_two]

Each declaration is annotated to include the LaTeX statement, a natural-language proof sketch with explicit dependency citations, a short title, and an environment type. The blueprint simultaneously serves three functions:

Formal proof skeleton: The Lean code type-checks at all times, even in partial completion.
Natural-language proof graph: Each \cref{…} in the proof metadata registers a graph edge in the dependency DAG.
Shared system of record: The blueprint is the single authoritative file for all agent transactions, under CI enforcement.

Nodes start as stubs (using by sorry or by sorry_using [...]) and are incrementally refined. The evolving proof graph is never frozen; nodes can be inserted, split, or repaired in-place, with eventual goal of a type-checking Lean file with no residual sorry (Zhang et al., 3 Jun 2026).

3. Contract-Scoped Agents and Protocols

Work is distributed across four distinct agents, each operating in sandboxed Git worktrees with all communication via GitHub PRs and issues. A global CI verifier enforces edit contracts.

Blueprinter ("construct"): Ingests paper LaTeX and canonical statements to populate the blueprint, initially with stubbed bodies and full metadata. It applies a decomposition rubric to optimize for minimal "repair radius."
Target-Reviewer ("audit"): Receives only canonical target statements and verifies, read-only, that every top-level theorem in the blueprint exactly matches its intended LaTeX form and Lean type (up to renaming/quantifier reordering). Any mismatch triggers a grouped issue before proving begins.
Worker ("prove"): For each dynamic leaf (unproved node with satisfied dependencies), replaces its sorry with a Lean proof or opens an issue if obstacles are encountered. Workers may insert local helper lemmas, edit statement and proof metadata, but cannot alter globally frozen Lean types. They perform a misformalization audit, numeric stress-testing, and normalization of statement metadata.
Refiner ("repair"): Responds to issues by repairing minimal connected sub-DAGs of the blueprint, classifying each defect (blueprint drift or source gap), and repairing as needed. Where full proofs cannot be provided, affected nodes are downgraded to sorry, to be picked up in later Worker rounds. All changes are atomically merged through CI.

The clear separation of edit scopes and contracts for each agent ensures that failure modes are locally contained and recoverable (Zhang et al., 3 Jun 2026).

4. Orchestration, Continuous Integration, and Parallelization

A two-stage orchestrator coordinates the agents. Stage 1 ("Cold Start and Target Review") iterates between Blueprinter, Reviewer, and Refiner until all top-level statements are verified. Stage 2 ("Parallel DAG Discharge") extracts the current proof graph, launches Workers in parallel on all dynamic leaves, merges their PRs if they pass seven CI checks, then invokes Refiner on accumulated issues. This repeats until all nodes are proved.

The central CI system enforces the following checks:

Type-checking: No Lean 4 errors or unresolved sorry (outside Stage 1).
Nonempty blueprint fields
Lean/LaTeX environment agreement
Label normalization
Label uniqueness
Dependency-parity: Every prose citation matches elaborator dependencies (and vice versa).
Lemma closeness: Non-terminal nodes must lead, transitively, to a target.

Thanks to carefully managed editable regions and frozen spans, parallel PRs from Workers are conflict-free and merge order is immaterial (Zhang et al., 3 Jun 2026).

5. Evaluation: Benchmarks, Metrics, and Comparative Results

LeanMarathon was evaluated on two 2026 papers from Tao’s group, spanning four Erdős problems:

Problem #1051: Erdős–Graham irrationality and generalizations
Problem #1196: Erdős–Sárközy–Szemerédi primitive-set bound
Problem #164: Erdős’s primitive-set conjecture
Problem #1217: Infinite divisibility chains

Three fully autonomous runs were conducted with GPT-5.5-xhigh agents:

Run	#Targets	Proved (lem+thm)	Lean lines	Tokens (\$)	Critical-path
ErdosGraham	4	111	8,513	257	11 h 38 m
Erdos1196	1	44	3,988	189	11 h 32 m
Prim	2	103	14,592	624	40 h 43 m

All seven target theorems were formalized with zero residual sorry in the final blueprint.
No merge conflicts occurred in 135 Worker PRs.
Baseline comparison: The closed-source Aristotle agent produced incomplete results with residual sorry on the same inputs, failing to prove the most complex results (Zhang et al., 3 Jun 2026).

6. Insights, Limitations, and Directions for Future Work

The results highlight that the core bottleneck in research-level autoformalization is not only proof strength, but agent durability—the ability to maintain target fidelity and localize faults over days-long developments. The contract-scoped, passive CI-gated approach transforms monolithic fragile runs into a large ensemble of recoverable, isolated transactions.

Machine-checked Lean semantics surfaced genuine substantive gaps or ambiguities in source proofs, including a missing summability hypothesis, real-to-ENNReal conventions, and a dropped Mertens estimate—demonstrating significant error-finding value rather than just syntactic translation.

Limitations include Lean library coverage: if essential mathematics are missing (e.g., major branches from Mathlib), the Blueprinter may be forced to "fake" definitions and the run may stall.

Anticipated future research includes broader domain coverage via formalized background math, tighter integration with generative prover LLMs, and human-in-the-loop interfaces for refining ambiguous or incomplete proofs.

Reproducibility is ensured by fully open codebases for each run, with captured blueprints, agent contracts, and orchestration scripts; setup involves cloning the repository, installing Lean 4 and Mathlib, and launching the orchestrator (Zhang et al., 3 Jun 2026).

LeanMarathon establishes new methodologies for durable, large-scale autoformalization, interfacing formal mathematics, LLM-driven proof search, and robust software engineering protocols. Its multi-agent, blueprint-driven harness supports fault-tolerant, reliable mathematical pipeline construction at research scale.

Markdown Report Issue Upgrade to Chat

References (1)

LeanMarathon: Toward Reliable AI Co-Mathematicians through Long-Horizon Lean Autoformalization (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Closed-Form Patch-Based Denoising Diffusion.