Numina-Lean-Agent: Automated Formal Reasoning
- Numina-Lean-Agent is a flexible, model-agnostic formal reasoning system that integrates an LLM coding agent with a lightweight MCP framework for formal mathematics.
- It employs a lightweight MCP framework to seamlessly coordinate interactions with the Lean proof assistant, semantic search tools, and informal verifiers for dynamic proof refinement.
- It has demonstrated top-tier performance on Putnam 2025 tasks and advanced collaborative formalizations, including the Brascamp–Lieb theorem.
Numina-Lean-Agent is an open and general agentic reasoning system designed for formal mathematics, notable for its integration of a general-purpose LLM with a lightweight Model Context Protocol (MCP) tool orchestration framework. Diverging from monolithic, task-specific neural provers, Numina-Lean-Agent positions a model-agnostic coding agent (Claude Code) at the core of its architecture, enabling autonomous interaction with the Lean proof assistant and facilitating formalization workflows ranging from theorem proving to library extension. Numina-Lean-Agent achieved a perfect score on Putnam 2025 formalization tasks and has demonstrated collaborative capability on advanced mathematical formalizations, such as the Brascamp–Lieb theorem (Liu et al., 20 Jan 2026).
1. System Paradigm and Motivations
Numina-Lean-Agent adopts the general coding agent paradigm, positioning an off-the-shelf code-generation LLM as the principal formal mathematics reasoner. Unlike classical neural provers trained end-to-end for fixed tactic languages, this approach enables the agent to directly generate and inspect Lean code, eschewing single-step tactic prediction schemes. There is no requirement for model-specific finetuning; improvements in agent capability accrue directly from swapping in stronger base models. The system leverages a compact orchestration layer (Numina-Lean-MCP), which provides transparent access to specialized tools, including Lean interaction, retrieval indices, informal verifiers, and auxiliary reasoning partners. The principal motivations for this paradigm are:
- Flexibility: General coding agents manage a diverse set of formal reasoning tasks, including definition scripting, proof construction, code refactoring, and extending formal libraries.
- Model-agnostic enhancement: Direct replacement of the underlying LLM (e.g., switching from Claude Code to Claude Opus 4.5) yields immediate gains without additional retraining.
- MCP-based orchestration: A concise collection of tool APIs exposed through a unified protocol supports seamless extension, enabling the agent to autonomously invoke Lean, semantic search, informal checking, and collaborative tools without dedicated pipeline design (Liu et al., 20 Jan 2026).
2. Architecture and Components
At the high level, Numina-Lean-Agent consists of Claude Code as the inference engine, mediated by the Numina-Lean-MCP server, which orchestrates a set of external tools:
| Agentic Reasoner | MCP Orchestrator | Tool Suite (Examples) |
|---|---|---|
| Claude Code | Numina-Lean-MCP | Lean-LSP-MCP, LeanDex, Informal Prover, Discussion Partner |
The MCP orchestrator is responsible for tool selection at each interaction step, optimizing over available tools given query and Lean proof state :
Pseudocode:
1 2 3 4 |
def select_tool(q, s): for t in available_tools: score[t] = LLM.logprob(format_invoke(t, q, s)) return argmax(score) |
Core MCP Tools
- Lean-LSP-MCP: Exposes Lean functionality via Language Server Protocol, including project outline inspection, goal querying, code checking via
lean_run_code, and simultaneous tactic attempts throughlean_multi_attempt. This supports a rapid trial→feedback loop for agent-driven proof refinement. - LeanDex: Serves as a universal semantic theorem-search index over mathlib and related packages. It enables natural-language or symbolic retrieval of relevant declarations, ensuring only valid theorems are cited.
- Informal Prover: Implements an iterative generator+verifier architecture inspired by Gemini IMO agents, permitting proof sketching with up to refinement rounds.
- Discussion Partner: Allows the agent to initiate lateral conversations with other LLMs (e.g., Gemini) when encountering proof bottlenecks or gaps in strategy, subsequently incorporating new ideas into ongoing formalization (Liu et al., 20 Jan 2026).
3. Lean Interaction Workflow
The principal interaction loop involves the agent issuing Lean syntax steps, which are checked and diagnosed by the Lean toolchain, facilitating iterative refinement:
- Agent proposes a Lean proof step.
- MCP invokes
lean_run_codeorlean_multi_attempt. - Lean compiles the snippet, returning new goals or diagnostics upon failure.
- Agent processes diagnostics with
lean_diagnosticand updates its proof-state, selecting the next tool or action.
Example: Inductive proof for the sum of squares:
1 2 3 4 5 6 7 8 9 10 |
/- The sum of the first n squares -/
theorem sum_of_squares (n : ℕ) :
∑ i in Finset.range (n+1), (i : ℚ)^2
= n*(n+1)*(2*n+1)/6 := by
induction n with
| zero => simp
| succ n ih =>
rw [Finset.sum_range_succ, ih]
field_simp
ring |
ring fails (e.g., due to denominators), the agent handles corrective actions, such as inserting field_simp, informed by Lean diagnostics.
4. Benchmark Performance and Model Ablation
On the Putnam 2025 suite (A1–A6, B1–B6), Numina-Lean-Agent using Claude Opus 4.5 solved all 12 problems, equaling the performance of the best closed-source system (AXIOMProver). Comparative results:
| System | Problems Solved (Putnam 2025) |
|---|---|
| ARISTOTLE | 10/12 |
| SEED-PROVER 1.5 | 11/12 |
| AXIOMProver | 12/12 |
| Numina-Lean-Agent | 12/12 |
Notably, Numina-Lean-Agent generated proofs significantly shorter than AXIOMProver on several problems and, despite sequential execution (i.e., no parallelism), was often faster. Model ablation studies indicate that using an earlier, smaller Claude Code model yielded 9/12 problems solved but with increased runtimes, substantiating that base-model quality translates directly to empirical performance. No model-specific finetuning was performed at any stage (Liu et al., 20 Jan 2026).
5. Collaborative Case Study: Brascamp–Lieb Theorem Formalization
Numina-Lean-Agent supported the 8,000-line Lean formalization of the Effective Brascamp–Lieb inequalities (Bénard & He 2025). The process consisted of several key steps:
Blueprint Generation
For complex theorems, the agent begins by constructing a proof "blueprint"—a DAG whose nodes correspond to formal definitions, intermediate lemmas, and dependencies. Nodes are annotated with uses {…} sets, specifying the prerequisite structure and suggesting a valid proving order. Failures in typeclass search or type mismatches automatically trigger blueprint refinements, such as splitting lemmas or adjusting assumptions.
Human–AI Collaboration
Human mathematicians may edit the blueprint, provide high-level proofs, or correct type coercions (e.g., Real → NNReal). The agent revisits the plan→formalize→refine loop with each update. In demonstrated cases, the agent autonomously detected and corrected false intermediate statements, evidencing self-correcting capabilities in extended formalization processes.
Excerpt (simplified):
1 2 3 4 5 6 7 |
theorem upperBound
{J : Type*} [Fintype J]
{E : Type*} [NormedAddCommGroup E] [InnerProductSpace ℝ E] [FiniteDimensional ℝ E]
(D : locRegDatum E (λ j => F j)) … :
M_max.toContinuousLinearMap ‖_‖_+^… := by
-- blueprint: handle empty case, nondegenerate case, glue inequalities, etc.
… |
6. Implementation and Reproducibility
Numina-Lean-Agent is accessible at https://github.com/project-numina/numina-lean-agent and includes:
- A Python MCP server (FastAPI) managing tool registration, schema specification, and handler execution.
- The
lean-lsp-mcpadapter (Dressler 2025) for Lean v4.26.0. - A LeanDex theorem index (based on LeanExplore) covering mathlib and FLT.
- Gemini-based informal-prover functionality.
- CLI and notebook interfaces for benchmark execution and interactive use.
Reproducibility involves standard open-source installation and configuration steps. Adding new MCP tools is protocol-driven (subclass a handler and register it in mcp/tools.py), reflecting the modularity of the architecture (Liu et al., 20 Jan 2026).
7. Limitations and Prospects
Numina-Lean-Agent exhibits several constraints:
- Readability: Generated proofs, while functional, may lack high-level structuring and remain overly tactic-driven compared to hand-written mathematical code.
- Type-level brittleness: Certain type conversions (e.g., Real to NNReal) may challenge the agent, necessitating occasional human intervention or pre-processing.
- Scalability: Very large formalization efforts can trigger context limitations, though the subagent mechanism partially mitigates this in practice.
Planned enhancements target high-level proof planning (to induce idiomatic Lean constructs such as @[simp] lemmas), improved type inference guidance, deeper neural subgoal decomposition for scalability, and more user-friendly interfaces for blueprint-guided workflows and MCP decision inspection. The open, protocol-driven design supports future integration of specialized tools and evolving LLM architectures (Liu et al., 20 Jan 2026).
In conclusion, Numina-Lean-Agent exemplifies a generalizable, model-agnostic approach to formal mathematical reasoning, leveraging lightweight orchestration and general code agents to solve benchmark tasks and enable collaborative, large-scale formalization. Its architecture supports extensibility and adaptability for advanced research in automated formal mathematics.