Gödel's Poetry (2512.14252v1)

Published 16 Dec 2025 in cs.AI and cs.LG

Abstract: Formal, automated theorem proving has long been viewed as a challenge to artificial intelligence. We introduce here a new approach to computer theorem proving, one that employs specialized LLMs for Lean4 proof generation combined with recursive decomposition of difficult theorems into simpler entailing propositions. These models are coordinated through a multi-agent architecture that orchestrates autoformalization (if required), proof generation, decomposition of difficult theorems into simpler entailing propositions, and recursive proof (and/or decomposition) of these propositions. Without decomposition, we achieve a 90.4% pass rate on miniF2F. With decomposition, this is significantly improved. A key technical contribution lies in our extension of the Kimina Lean Server with abstract syntax tree (AST) parsing capabilities to facilitate automated, recursive proof decomposition. The system is made available on PyPI as goedels-poetry (at https://pypi.org/project/goedels-poetry ), and the open-source implementation KellyJDavis/goedels-poetry (at https://github.com/KellyJDavis/goedels-poetry ) facilitates both adaptation to alternative LLMs and extension with custom functionality.

Summary

The paper presents a novel multi-agent system using specialized LLMs to recursively decompose and verify Lean 4 proofs.
It utilizes retrieval-augmented generation and AST-level analysis to generate, validate, and manage subgoals in formal proofs.
Empirical results demonstrate significant performance boosts and modular scalability in automated theorem proving.

Gödel's Poetry: Recursive and Multi-Agent LLM Theorem Proving for Lean 4

Overview

"Gödel's Poetry" (2512.14252) introduces a multi-agent, LLM-driven system for automated theorem proving with Lean 4. The core innovation is the orchestration of specialized agents—proof generation, autoformalization, recursive decomposition, verification, and subgoal synthesis—each leveraging the strengths of contemporary LLMs, coordinated through a modular architecture built atop LangGraph and LangChain. The system's defining feature is its capability to recursively decompose difficult theorems into simpler subproblems, guided by retrieval-augmented generation (RAG), AST-level analysis, and example-driven subgoal extraction. The technical contributions include both the design of the multi-stage agent pipeline and an extension to the Kimina Lean Server, enabling AST-based proof decomposition with programmatic subgoal extraction.

Technical Contributions

Multi-Agent, Multi-Stage Architecture

The approach instantiates dedicated LLM-based agents for the sequence of tasks that comprise end-to-end theorem proving from informal statement to verified Lean 4 proof:

Formalization and Semantic Validation: Informal statements are autoformalized by a specialized LLM agent (default: Goedel-Formalizer-V2), followed by automatic syntax checking (Kimina Lean Server) and semantic validation (by a distinct LLM, default: Qwen 3 30GB) to enforce formal correspondence.
Proof Generation and Correction: The main prover agent (default: Goedel-Prover-V2) attempts direct proof synthesis, with verifier-guided self-correction via Lean feedback. This harnesses the extensively validated Goedel-Prover-V2 models.
Recursive Decomposition and RAG: For theorems non-trivial for direct tactics, the system triggers decomposition: a query-generation agent identifies relevant lemma-types and tactics, a vector-database agent (LeanExplore) retrieves applicable Mathlib theorems, and a sketching agent (frontier LLM, e.g., GPT-5) drafts a proof plan with have-statements and explicit sorries.
AST-Based Reasoning: Kimina's AST export is extended to extract not just theorem headers but all proof-structure internals, particularly the explicit have-by-sorry subgoals created by recursive decomposition. These are reified as subtheorems and the entire process recurses until all leaves are formally proved.

Programmatic Proof Tree Management

A tree-structured explicit proof state records the full derivational structure: nodes are either leaves (atomic, verified proofs) or decomposition nodes (proof sketches, decomposition histories, and subgoals). State is managed to maintain complete provenance, enable proof reconstruction (by recursively splicing subproofs in for sorries), and support advanced search strategies—breadth-first subgoal processing, automatic backtracking on failed decompositions, and parameterized recursion depth.

Extensible and Open Infrastructure

Gödel's Poetry is implemented modularly in Python. LLM roles, endpoints, and parameters are INI/environment configured, supporting ablation studies or experimentation with alternative LLMs for each function. Kimina AST and LeanExplore servers are integrated for local or remote use, enabling scalable local provers and semantic search. The framework is released under Apache 2.0, and the project encourages custom agent development.

Empirical Results

Baseline Performance: Using Goedel-Prover-V2 alone with verifier-guided correction, the system achieves 90.4% pass@32 on miniF2F, aligning with the SoTA for open source Lean 4 provers of this size.
Impact of Decomposition: The recursive decomposition agent (leveraging RAG and AST-guided subgoal creation) yields a "significant" performance boost over direct proof only, though full benchmarking is ongoing. Analogous systems (e.g., Hilbert) demonstrate that such architectures can approach or exceed 99% on miniF2F and 70%+ on PutnamBench.
Workflow Efficiency: Subgoal-level parallelism is natively supported—once decomposed, sibling subgoals are independently dispatched for proof generation—allowing scale-up with compute resources.
Resource Constraints: The main computational bottleneck is LLM inference, especially with multiple recursion levels and, for frontier LLMs, increased latency and cost.

Architectural Comparison and Positioning

Against POETRY: Unlike POETRY (Isabelle-centric, single-model, implicit subgoal extraction), Gödel's Poetry applies explicit AST-level subgoal extraction in Lean, with modular LLM specialization for each agent role and coverage for informal-to-formal translation.
Relative to Hilbert: Both systems implement recursive proof search with RAG retrieval and model orchestration, but Gödel's Poetry emphasizes public, extensible code and interchangeability of underlying models.
Relation to Kimina/DeepSeek: Leverages SoTA encoder-decoder LLMs for Lean but adds enhanced self-correction, context-sensitive proof decomposition, and semantic theorem search.

Limitations

Model Weaknesses: For domains not well represented in Mathlib or outside the core model's training distribution, both direct proof and lemma generation may be suboptimal—even with RAG, decomposition quality can plateau.
Resource Utilization: Deep decompositions exponentially multiply LLM calls; cost management, semantic caching, or distilled specialist models are required for practical use at massive scale.
Search Completeness: Only one decomposition path is explored per failure/backtrack, and the system's default is non-beam—future work includes optimal or prioritized multi-path search.
Specialist Domains: Without domain-adaptive LLMs or lemma synthesis, performance may degrade for advanced or non-elementary mathematics.

Theoretical and Practical Implications

Gödel's Poetry substantiates a scalable blueprint for combining LLMs with formal proof environments, demonstrating that recursive, modular, agent-based architectures overcome size limitations evident in current direct proof models. The explicit AST-based proof management not only clarifies provenance for interpretability, but also opens the door for advanced meta-level learning, such as reinforcement learning for decomposition strategies, proof repair, or mixed human-AI collaboration.

Practically, the system is community-ready for integration, benchmarking, or adaptation to other proof assistants, contingent only on development of compatible AST and semantic search APIs. The modularity and open configuration enable ablation studies, specialist model drops, and extension to complex workflows (collaborative proving, mixture-of-experts, curriculum learning).

Future Directions

Key next steps include multi-path search (beam/best-first/heuristic search), reinforcement-learned or curriculum-adapted decomposition policies, dynamic proof repair, and extending beyond canonical domains (algebra, geometry, combinatorics) to broader mathematics or adjacent formal logics. Integration with human-in-the-loop interfaces, as well as meta-learning of agent-to-agent communication strategies, may yield further performance and usability benefits.

Conclusion

Gödel's Poetry demonstrates that recursive decomposition, modular agent supervision, and AST-level orchestration of LLMs form a powerful pipeline for Lean 4 theorem proving. The architecture is empirically competitive with leading academic and industry systems, and its extensibility positions it as a testbed for ongoing research into scaling, domain adaptation, and collaborative formal mathematics. The publicly released codebase ensures reproducibility, extensibility, and community engagement for the advancement of AI-driven formal reasoning.