Aristotle: IMO-level Automated Theorem Proving (2510.01346v1)
Abstract: We introduce Aristotle, an AI system that combines formal verification with informal reasoning, achieving gold-medal-equivalent performance on the 2025 International Mathematical Olympiad problems. Aristotle integrates three main components: a Lean proof search system, an informal reasoning system that generates and formalizes lemmas, and a dedicated geometry solver. Our system demonstrates state-of-the-art performance with favorable scaling properties for automated theorem proving.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper introduces Aristotle, an AI system that solves tough math problems and writes fully checked, mistake-free proofs in a formal language called Lean 4. Aristotle performed at a “gold medal” level on the 2025 International Mathematical Olympiad (IMO) by correctly solving 5 out of 6 problems with formal proofs, which is a very high standard.
What questions does the paper try to answer?
The paper explores simple, big-picture questions:
- Can an AI solve hard math problems using both human-style thinking and strict, computer-checked proofs?
- How do we make AI’s math solutions reliable, not just convincing-sounding?
- What kind of system design helps AI scale up to very challenging problems like the IMO?
How does Aristotle work? (Methods explained simply)
Think of solving a hard math problem like exploring a maze:
- You try different paths (ideas) step-by-step.
- You break the big challenge into smaller tasks.
- You check each step carefully so you don’t get lost.
Aristotle combines three parts that work together:
1) Guided proof search in Lean (the “maze explorer”)
- Lean 4 is a super strict math language. If a proof works in Lean, it’s guaranteed correct.
- Aristotle uses a smart “search” to build proofs one move at a time, like a chess player exploring future moves.
- A large AI model suggests good next steps (“tactics”), and another part estimates which paths look promising.
- If a step splits the problem into smaller subproblems, the system tries to solve them too, until nothing is left.
2) Lemma-based informal reasoning (the “planner”)
- Before diving into the formal proof, Aristotle writes an informal plan, like a paper guide: short statements called lemmas it thinks will be helpful.
- It then translates each lemma into Lean.
- If Lean complains (for example, the lemma is stated incorrectly), Aristotle revises and tries again.
- This loop repeats: plan → formalize → get feedback → fix errors → prove.
3) Fast geometry solver (the “geometry expert”)
- Geometry problems are special. Aristotle includes a dedicated tool called Yuclid that solves plane geometry with clever rule-checking and algebra.
- It’s highly optimized (up to 500× faster than some earlier systems) and solves many geometry tasks very quickly.
Training and learning along the way
- Aristotle is trained using reinforcement learning: it practices proving thousands of statements, learns from successes and failures, and improves its strategy.
- It also does “test-time training”: while working on a new problem, it learns from its own attempts and gets better in real time—like learning during an exam but only from its own scratch work.
What did Aristotle find, and why is it important?
Main results:
- Gold-level IMO performance: Aristotle produced fully checked Lean proofs for 5 out of 6 IMO 2025 problems. This is a major milestone because the IMO is very hard, and formal proofs are stricter than normal written solutions.
- Strong scaling: Using a very large model (over 200 billion parameters), running many parallel solution attempts, and iteratively fixing errors boosted performance.
- Beyond contests: During training, Aristotle:
- Added missing theorems to Lean’s math library (Mathlib), like Niven’s theorem and Gauss–Lucas.
- Helped spot subtle mistakes in a math textbook and provided counterexamples.
- Handled advanced areas (e.g., category theory, homological algebra), showing it’s not limited to high school math.
Why it matters:
- Formal proofs aren’t just “convincing”—they are verified by a computer, so they’re trustworthy.
- Combining natural reasoning with strict checking is a powerful recipe for reliability.
- This approach could help mathematicians by suggesting ideas and ensuring no steps are missing or wrong.
What’s the bigger impact?
- Reliable math assistants: Aristotle shows that AI can not only come up with ideas but also verify them in a trusted way. This could help researchers avoid mistakes and move faster.
- Better tools for learning and discovery: It can break problems into bite-sized lemmas, explain its logic, and correct itself based on feedback—useful for students, teachers, and scientists.
- A path forward: Mixing large models, guided search, and formal verification looks like a strong strategy for future AI systems that reason carefully, not just fluently.
In short, Aristotle is a promising step toward AI that can think like a mathematician and prove like a computer—creative and careful at the same time.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a concrete, actionable list of what remains missing, uncertain, or unexplored in the paper.
- Quantitative evaluation is sparse: no aggregate success rates, ablations, or statistical comparisons on standard formal benchmarks (e.g., MiniF2F, ProofNet, PutnamBench) despite citing them.
- “Favorable scaling” is asserted but not characterized: no scaling laws, search-budget/performance curves, or cost–quality trade-offs are reported.
- Compute and resource footprint are unspecified: model size (>200B) is mentioned without training/inference FLOPs, GPU hours, memory, wall-clock times, or energy/carbon metrics.
- Reproducibility is limited: the core model, search code, training data, and prompts are not released (only the geometry engine Yuclid), making it hard to replicate results.
- Test-time training (TTT) during evaluation raises comparability questions: no protocol is given for how many TTT iterations were allowed, when to stop, or how to ensure fair comparisons with systems that do not adapt at test time.
- Potential overfitting via TTT is unaddressed: no analysis of how much the model specialized to individual IMO problems, nor the impact on generalization to new tasks post-TTT.
- Data governance and leakage control are unclear: no overlap audit between training corpora (open-source/in-house) and evaluation sets (including structurally similar IMO/OLYMPIAD problems); no contamination safeguards described.
- Autoformalization fidelity is only lightly treated: the “judge” for formal–informal alignment is mentioned, but its accuracy, calibration, error modes, and acceptance thresholds are not quantified.
- Autoformalization coverage/quality metrics are missing: percent of lemma statements autoformalized correctly, kinds of formalization errors, and the downstream impact on proof success are not reported.
- Formal–informal mismatch risks remain: no procedure is given to ensure that autoformalized lemmas preserve the intended semantics of the informal plan beyond heuristic judging and REPL signals.
- Whole-proof vs step-wise trade-offs are not studied: there is no head-to-head comparison with whole-proof systems (e.g., Seed-Prover) under matched compute/search budgets.
- Search hypergraph equivalence is heuristic: actions/states deemed “equivalent” can differ due to tactic global state (e.g., aesop), but no principled guarantee or quantitative failure analysis is provided.
- Negation augmentation in search is under-specified: how negating goals interacts with Lean’s classical vs constructive logic, proof-by-contradiction workflows, or search correctness is not analyzed.
- Prompting with action history is ad hoc: no ablation quantifying its effect on loop avoidance, branching factor, or proof length; no paper of context window constraints and truncation effects.
- Value/policy sharing in a single model lacks analysis: no comparison to decoupled value/policy networks or to alternative value targets (e.g., heuristic distance-to-proof, bootstrapped critics).
- Reward shaping and nontriviality filtering are opaque: criteria for “nontrivial” traces, reward assignment for partial progress, and their effect on stability and exploration are not given.
- Postprocessing effects are unmeasured: linter-driven proof compression and tactic skipping are not quantified for reliability, speed-ups, or proof robustness across Lean/Mathlib versions.
- Failure analysis on IMO Problem 6 is absent: no diagnosis of where the system breaks (lemma generation, search depth, missing tactics, library gaps), nor targeted remedies.
- Generalization beyond contest math is anecdotal: claims of advanced-topic capability rely on a few examples; no systematic evaluation across graduate-level domains or research-grade benchmarks.
- Geometry pipeline lacks formal proof certificates: Yuclid solves problems “outside Lean,” but there is no description of certificate formats, proof reconstruction in Lean, or fully verifiable checking pipelines.
- Geometry coverage vs. AG-2 features is partial: integration of AG-2 extensions is “future work”; no plan or evaluation for auxiliary-construction search, non-Euclidean or 3D geometry, or dynamic rule selection.
- Yuclid trade-offs are underexplored: enabling the law of sines slows inference by ~10×; no adaptive strategy or learned policy for toggling heavy rules per problem instance is provided.
- Integration latency between solvers is unknown: no timing/profile of how geometry outputs feed into Lean or the global pipeline; end-to-end latency for geometry problems is not reported.
- Lemma pipeline effectiveness is not quantified: no stats on how many generated lemmas are useful, the fraction proved per round, or the incremental gains from iterative revision.
- Strategy diversification is unmanaged: there is no mechanism to explicitly maintain or re-seed diverse high-level proof plans across iterations to avoid premature convergence.
- Novel auxiliary definitions are not evaluated: Aristotle invents sets/functions (e.g., S, f(k)), but there is no measure of minimality, reusability, or refactorability across problems.
- Library/API adaptation via TTT lacks safeguards: no analysis of catastrophic forgetting, persistence of learned micro-APIs, or transfer across libraries with different abstractions.
- Robustness to alternative foundations is untested: while the system handled Tao’s textbook types, there is no paper of portability to Coq/Isabelle/HoTT or Lean kernels with different flags.
- Safety and isolation of code execution are unaddressed: informal reasoning leverages “code execution,” but sandboxing, resource limits, and side-effect controls are not discussed.
- Search-budget allocation policy is unspecified: how parallel runs, lemma proving, and main-goal attempts compete for budget is not formalized; no scheduling or prioritization heuristics are given.
- Memory and state deduplication limits are unknown: hypergraph growth, memory pressure, and policies for eviction, hashing collisions, or cycle detection are not analyzed.
- Tactic non-determinism is untreated: reproducibility under tactics with randomness/global caches (e.g., aesop) is not studied; no seeding or determinization practices are described.
- Human-in-the-loop boundaries are unclear: main problems were hand-formalized; there is no end-to-end autoformalization of full contest statements nor assessment of that bottleneck.
- Evaluation governance is not defined: since TTT changes the model during evaluation, it is unclear what “the” evaluated checkpoint is; logging/versioning required for external auditing is not described.
- Ethical and competition implications are unexamined: no discussion of fairness in using massive compute and TTT against human contestants, or of appropriate use guidelines in academic contests.
- Long-context limitations are unmeasured: no results on context length vs success, truncation effects on lemma chains, or retrieval/condensation methods for long proof histories.
- Theoretical grounding of MCGS is limited: there is no analysis of convergence/optimality in AND–OR hypergraphs under progressive widening and noisy generative priors.
These gaps suggest concrete next steps: publish ablations and scaling laws; release reproducibility artifacts; formalize geometry-proof certificates; quantify autoformalization fidelity; define fair TTT protocols; conduct systematic failure analyses; and broaden evaluations across benchmarks, libraries, and foundational systems.
Practical Applications
Immediate Applications
The following applications can be deployed now, leveraging Aristotle’s existing Lean-4-integrated proof search, lemma-based informal-to-formal pipeline, test-time training, and the Yuclid geometry solver.
- Bold application name — sector(s): brief description
- Tools/workflows that might emerge
- Assumptions/dependencies
- Aristotle “Proof Copilot” for mathematicians — academia, software
- An interactive assistant that turns proof sketches and informal notes into machine-checked Lean proofs, suggests lemmas, fills in missing steps, and lint/optimizes code.
- Tools/workflows: VS Code/Lean plugin; PDE/Group Theory lemma suggester; proof linter/optimizer; search-on-demand from a sketch.
- Assumptions/dependencies: Mathlib coverage; access to a Lean REPL; computational budget (inference can be heavy for >200B models); trust and familiarity with Lean by users.
- Autoformalization aid for papers and preprints — academia, publishing
- Drafts Lean statement formalizations from LaTeX/markdown theorems, generates candidate lemmas, and returns compile feedback to authors.
- Tools/workflows: “ArXiv/overlay” bot; journal CI hook that runs Lean checks and provides a “formally stated” badge.
- Assumptions/dependencies: Robust statement autoformalization; author consent; editorial acceptance; manual review for ambiguous translations.
- Automated textbook and content checking — education, publishing, policy
- Detects false or underspecified exercises and redundant hypotheses (as demonstrated on Tao’s analysis text), with counterexamples and Lean-backed corrections.
- Tools/workflows: “Curriculum Checker” for exercise banks; LMS plugin that flags problematic problems; counterexample generator for discrete cases.
- Assumptions/dependencies: Recall of relevant background theorems in Mathlib; consistent encoding of the textbook’s foundations (may differ from Mathlib); human-in-the-loop acceptance.
- Contest training, grading, and hinting — education, daily life
- Provides formally verified solutions, graded rubrics, and stepwise hints for Olympiad/Putnam-style problems; safeguards against hallucinated reasoning.
- Tools/workflows: “Olympiad Tutor” app; problem-to-lemma breakdown with adaptive hinting; auto-grader tied to Lean.
- Assumptions/dependencies: Access to problem repositories; policies for preventing leakage on live contests; compute cost/latency constraints.
- Yuclid integration for geometry tooling — education, engineering, software
- High-speed, Apache-2 licensed DD/AR geometry solver embedded in dynamic geometry systems (e.g., GeoGebra), CAD constraint solvers, and game engines.
- Tools/workflows: Yuclid API; “diagram-to-proof” adapters; CAD plug-ins for constraint validation; QA for geometric asset pipelines.
- Assumptions/dependencies: Numerical diagram builders; interface for rule sets; performance/latency targets; optional Law-of-Sines table (trades speed for capability).
- Formal PR assistant for math libraries — academia, open source
- Suggests PRs to Mathlib and domain repositories, adds missing theorems (e.g., Niven’s theorem, Gauss–Lucas) and micro-lemmas, with CI that verifies proofs.
- Tools/workflows: “Lean PR bot”; repository CI pipelines; proof minimizer/cleaner; reviewer dashboards.
- Assumptions/dependencies: Maintainer review bandwidth; repository licensing; steady REPL compatibility across versions.
- Lean proof search-as-a-service/API — software
- A cloud/on-prem API that completes Lean proof sketches, with problem-specific test-time training (TTT) that personalizes to a team’s style/library.
- Tools/workflows: Batch proof completion; TTT personalization pipelines; usage analytics; privacy-safe on-prem deployments.
- Assumptions/dependencies: Resource scaling; governance for on-prem vs. cloud; stable Lean/Mathlib versions aligned with client stacks.
- Crypto/math lemma verification in CI — finance, cybersecurity
- Verifies number-theoretic and algebraic properties used in cryptographic specs or R&D prototypes (e.g., modular arithmetic invariants) at commit time.
- Tools/workflows: “Crypto-lemma pack” for Lean; CI guards on specs; mapping of spec DSLs to Lean models.
- Assumptions/dependencies: Limited to theorems formalizable in current Mathlib; careful abstraction of implementation details; human review for spec-code gap.
- Educational courseware with formal-checking feedback — education
- Interactive proof labs where students write proofs that must compile; Aristotle supplies granularity-controlled hints and detects common misconception patterns.
- Tools/workflows: Course modules; analytics on proof attempts; adaptive hint budgets; LMS integration.
- Assumptions/dependencies: Teacher training; ease-of-use for beginners; rate-limiting compute for classrooms.
- Proof critique for human-written solutions — academia, publishing
- Reads a human’s natural-language proof, decomposes into lemmas, attempts formalization, and flags logical gaps or ambiguous steps with concrete Lean counterexamples.
- Tools/workflows: “Proof Critique” panel in editorial systems; structured discrepancy reports; side-by-side Lean proof and comments.
- Assumptions/dependencies: Accuracy of natural-language-to-lemma mapping; customizable strictness; reviewer oversight.
- Geometric constraint validation in content pipelines — graphics, AR/VR, manufacturing
- Uses Yuclid to certify geometric relations in assets (e.g., orthogonality, parallelism, tangency) before deployment to production or hardware.
- Tools/workflows: Offline QA batch validation; asset CI checks; repair suggestions using auxiliary construction rules.
- Assumptions/dependencies: Robust conversion from asset formats to Yuclid primitives; speed at scale; custom rule packs per domain.
Long-Term Applications
These applications require further research, scaling, and/or cross-domain formal libraries and tooling beyond Lean’s current math ecosystem.
- End-to-end formal verification for safety-critical systems — robotics, aerospace, automotive, healthcare
- Verified control invariants, perception-to-actuation safety cases, and redundancy proofs using lemma-based search over hybrid-system models and differential equations.
- Tools/workflows: Libraries for ODEs, hybrid automata, and control barriers; plant/environment formal models; certification reports.
- Assumptions/dependencies: Mature formal libraries for continuous dynamics; model-code correspondence; regulator acceptance.
- Verified cryptographic protocols and smart contracts — finance, blockchain, cybersecurity
- Automated security reductions and contract invariants with machine-checked proofs integrated into deployment pipelines.
- Tools/workflows: Formal semantics for VMs/DSLs; standard cryptographic proof libraries; end-to-end CI gates.
- Assumptions/dependencies: Alignment between formal spec and code; performance of autoformalization on complex protocols; legal/commercial incentives.
- AI systems with provable safety properties — software, AI safety
- Proving properties of RL/control policies, bounded behaviors, and alignment invariants with formal witnesses; proof-carrying agents.
- Tools/workflows: Formal models of learning dynamics; proof obligations tied to training logs; proof-carrying artifacts for deployment.
- Assumptions/dependencies: Adequate formal semantics for ML; scalable proof search over stochastic systems; verification-friendly training practices.
- Formal scientific publication pipeline — academia, policy
- Journals and funders encourage or require machine-checkable proofs and reproducible formal artifacts for math-heavy papers.
- Tools/workflows: Submission portals with autoformalization, reviewers’ proof dashboards, “formally verified” badges.
- Assumptions/dependencies: Community standards; funding for formalization; improved autoformalization coverage and ergonomics.
- Industrial verification of numerical and optimization software — software, energy, logistics
- Proof-backed correctness for solvers (convex optimization, root-finding), sensitivity bounds, and termination guarantees.
- Tools/workflows: Floating-point and interval arithmetic libraries; spec-to-code refinement proofs; verification-aware solver APIs.
- Assumptions/dependencies: Mature analysis libraries; floating-point semantics; performance-safe proof instrumentation.
- CAD/CAE with proof-backed design validation — manufacturing, civil engineering, electronics
- Automated proofs for geometric, tolerance, and load-bearing constraints; auxiliary constructions to repair invalid designs.
- Tools/workflows: Yuclid-like engines extended to 3D; coupling with finite element assumptions; “design proof reports.”
- Assumptions/dependencies: 3D geometry/algebraic reasoning tables; scalable constraint encodings; interoperability with CAD standards.
- Autonomous theorem discovery and research collaboration — academia
- Model proposes novel conjectures, decomposes into lemmas, and iterates formal search with test-time specialization across topics (e.g., algebraic geometry, PDE).
- Tools/workflows: Conjecture generators with falsification loops; research notebooks that track formal progress; community curation.
- Assumptions/dependencies: Larger formal corpora; strong autoformalization of new definitions; credit and authorship norms.
- Regulatory “provable compliance” frameworks — policy, regtech
- Formalized regulatory logics encoded as proof obligations (e.g., capital adequacy, safety rules) with organization-specific proof attestations.
- Tools/workflows: Formal ontologies for regulations; compliance proof compilers; auditor dashboards with machine-checkable evidence.
- Assumptions/dependencies: Legally recognized formal encodings; versioning of laws; acceptable abstraction gaps.
- Verified data transformations and pipelines — healthcare, finance, public sector
- Machine-checked lineage and invariants in ETL/analytics pipelines (e.g., privacy guarantees, monotonicity, conservation constraints).
- Tools/workflows: DSLs with Lean backends; proof-carrying data artifacts; CI for schema/invariant changes.
- Assumptions/dependencies: Formal semantics for data systems; mapping to real-world pipelines; adoption costs.
- Large-scale educational transformation via proof assistants — education
- Widespread curricula where students build proofs that must compile, with adaptive lemma scaffolding and mastery tracking.
- Tools/workflows: Low-friction UIs; device-friendly Lean kernels; teacher-facing analytics.
- Assumptions/dependencies: Significant UX simplification; professional development; equitable access.
- Natural-language “proof critique” at journal scale — publishing
- Automated peer-review assistance that traces claims to formal lemmas, proposes repairs, or produces counterexamples across thousands of submissions.
- Tools/workflows: High-recall NL-to-lean mappers; claim-level provenance; triage queues for editors.
- Assumptions/dependencies: Very high precision to avoid reviewer overload; robust handling of domain-specific notation.
- Cross-assistant formal ecosystem interoperability — software, academia
- Seamless transfer of proofs between Lean, Coq, Isabelle, and domain DSLs; shared search and lemma engines.
- Tools/workflows: Interchange formats; proof translation layers; unified search over multi-proof-assistant graphs.
- Assumptions/dependencies: Community consensus on standards; nontrivial translation correctness; maintenance burden.
Notes on cross-cutting dependencies
- Compute and latency: High-parameter models and Monte Carlo Graph Search can be resource-intensive; budget and batching strategies are needed.
- Library coverage: Success depends on the breadth/depth of formal libraries (e.g., Mathlib, analysis, hybrid systems, crypto).
- Autoformalization fidelity: Mapping natural language to precise formal statements remains a bottleneck; human oversight mitigates risks.
- Tooling stability: Lean/Mathlib versioning and REPL compatibility affect reproducibility and CI.
- Governance and trust: Adoption in policy/industry needs standards, auditability, provenance, and clear failure modes.
- Data and IP: Use of textbooks/problem sets/papers requires licensing and data-governance practices.
- Human-in-the-loop: For high-stakes settings, workflows should include expert review, especially where domain abstractions are novel.
Glossary
- aesop tactic: A Lean automation tactic that applies heuristic rule-based reasoning and uses global state internally. "a common counterexample is the aesop tactic, which uses global state internally"
- AlphaGeometry: A prior geometry automated theorem proving approach used as a basis for the paper’s geometry solver. "A geometry solver which solves plane geometry problems outside of Lean using an approach based on AlphaGeometry \cite{Trinh2024AlphaGeometry}."
- AlphaZero: A deep reinforcement learning algorithm combining MCTS with policy/value networks for search and planning. "in the spirit of Expert Iteration \cite{exit} and AlphaZero \cite{alphazero}."
- autoformalization: Automatically translating informal mathematical statements or proofs into a formal language like Lean. "we developed a statement autoformalization system, which consists of initial autoformalization, judging using signals from the Lean REPL, and correction."
- DD/AR (deductive database and algebraic reasoning): A hybrid geometry-solving framework that combines rule-based inference with algebraic equation reasoning. "a very fast C++ DD/AR (deductive database and algebraic reasoning) engine."
- derangements: Permutations with no fixed points, often used in combinatorial arguments. "studied certain derangements (permutations without fixed points) which implicitly appear in the problem."
- echelon form: A row-reduced matrix form used to efficiently solve and update linear systems. "we store the current echelon form of the linear system, updating it every time we establish a new statement."
- Eisenstein series: Special modular forms central to number theory and complex analysis. "two instances of it proving basic statements involving homological algebra and Eisenstein series, respectively."
- Expert Iteration: An iterative training paradigm that alternates between search-generated expert trajectories and policy/value learning. "in the spirit of Expert Iteration \cite{exit} and AlphaZero \cite{alphazero}."
- filter_upwards: A Lean tactic for manipulating mathematical filters (structures underpinning limits) to derive convergence properties. "Another instance was its surprising use of the {filter_upwards} tactic."
- Gaussian elimination: A standard algorithm for solving systems of linear equations via row operations. "The algebraic reasoning is implemented using Gaussian elimination, with a few optimizations allowing us to avoid repeated work."
- Gauss-Lucas theorem: A theorem in complex analysis describing where the zeros of a polynomial’s derivative lie relative to the zeros of the polynomial. "including Niven’s theorem \cite{nivenPR}, the Gauss-Lucas theorem, the fact that eigenvalues are the roots of the characteristic polynomial \cite{charpolyPR}, and other technical lemmas \cite{limsupPR}."
- homological algebra: An advanced area of algebra studying chain complexes and derived functors. "two instances of it proving basic statements involving homological algebra and Eisenstein series, respectively."
- hypergraph: A generalization of graphs where edges can connect multiple nodes; here, a shared-state search representation. "turn the search hypertree into a search hypergraph."
- hypertree: A search structure where actions can produce multiple successor states (hyper-edges), generalizing a tree. "Executing an action may result in multiple states, which results in a ``hypertree" as described in \cite{htps}."
- law of sines: A trigonometric relation between sides and angles in a triangle used in geometric reasoning. "Optionally, Yuclid can add sines of the angles to the ratios AR table, together with the law of sines."
- Lean 4: The version of the Lean interactive theorem prover used for formal verification throughout the paper. "a proof written in a machine-verifiable language like Lean 4"
- Lean REPL: The interactive Read–Eval–Print Loop for Lean, used to check and give feedback on formalizations. "After sending the formalizations from (3) to the Lean REPL, we communicate any error messages back and ask for corrections."
- Mathlib: The main mathematical library for Lean, containing formalized definitions and theorems used in proofs. "using the Lean 4 proof language and its mathematical library Mathlib, without gaps or unsound axioms like sorryAx."
- MCGS (Monte Carlo Graph Search): A graph-based variant of Monte Carlo search that reuses equivalent states and actions. "The search algorithm is a highly parallel Monte Carlo Graph Search (MCGS) using a large transformer as its policy and value function."
- MCTS (Monte Carlo Tree Search): A simulation-based planning algorithm that balances exploration and exploitation in a tree. "This builds on Monte Carlo Tree Search (MCTS) with a learned value function, in the spirit of Expert Iteration \cite{exit} and AlphaZero \cite{alphazero}."
- metavariables: Unresolved placeholders in Lean goals/terms that defer instantiation during elaboration. "States are split by goals up to metavariables"
- minimax: An optimization principle for AND/OR search where success requires all subgoals of actions but any proving action suffices for a state. "This AND/OR structure makes finding a proof equivalent to a minimax problem."
- nlinarith: A Lean tactic for nonlinear arithmetic reasoning, often reducing goals to linear forms. "Later, however, it replaced this calculus-based proof with a more direct proof using the {nlinarith} tactic."
- PUCT (Predictor Upper Confidence bound applied to Trees): A tree policy that adds an exploration bonus weighted by a prior policy. "Our search algorithm uses a variant of the PUCT (Predictor Upper Confidence bound applied to Trees) formula \cite{pucb,alphazero}, where the exploration bonus is weighted by a prior policy."
- Pythagorean Theorem: The relation between the squares of side lengths in a right triangle, used within an AR table of squared lengths. "This table allows us to efficiently state the Pythagorean Theorem, as well as its more general form"
- sorryAx: An unsafe Lean axiom that allows gaps in proofs; disallowed for verified solutions. "without gaps or unsound axioms like sorryAx."
- test-time training (TTT): Updating the model during inference using newly collected search traces to specialize performance. "we also use a form of test-time training (TTT) \cite{akyurek2024_testtime_fewshot}"
- value function: A learned estimator that guides search by predicting the difficulty or probability of success from a state. "This builds on Monte Carlo Tree Search (MCTS) with a learned value function"
Collections
Sign up for free to add this paper to one or more collections.