Lean Benchmarks in Systems & Theorem Proving

Updated 18 October 2025

Lean Benchmarks are systematic standards that measure efficiency, scalability, and reliability of lean models in both database systems and formal theorem proving environments.
They utilize concrete metrics such as regret for learned optimizers and pass rates for proof tactics to drive methodical improvements and comparative evaluations.
Benchmarking frameworks integrate advanced techniques including multi-armed bandit strategies, neural networks, and SMT solver integrations to enhance validation and system performance.

Lean Benchmarks characterize systematic approaches for measuring, comparing, and advancing the performance, scalability, and reliability of systems and algorithms within Lean-related environments. The term is seen both in database systems, where "lean" denotes efficient, lightweight learned index or optimizer models, and in the Lean theorem-proving community, where benchmarking encompasses datasets, model integrations, proof search frameworks, translation pipelines, and infrastructure metrics. These benchmarks are central to guiding research and development, ensuring robust and reproducible evaluation of formal tools, LLMs, and system components.

1. Efficiency in Learned Indexes and Optimizers

Lean benchmarks for learned index structures focus on quantifying performance improvements in data systems using machine learning models as index replacements. The central metric is regret, defined as the difference in execution time between the plan chosen by a learned optimizer and the best available simple algorithm. For example, in database systems, minimization of regret ( $\delta = T_{\text{chosen}} - T_{\text{optimal}}$ ) provides a direct measure of how closely a learned model approximates optimal plan selection. Such benchmarks test the model over a diversity of scenarios—including variable query selectivity, real-world queries (e.g., JOB dataset), and system robustness to schema changes, data shifts, caching, and pipelining effects.

Performance profiles are visualized using boxplots to expose both average and tail behaviors. Empirical evidence shows that even with modest data (∼100 queries), learned index models can outperform classical optimizers like those in PostgreSQL, provided that underlying simple optimizers are well-engineered to avoid catastrophic failures. The use of contextual multi-armed bandit formulations and tree convolutional neural networks (TCNN) supports modularity and fast adaptation, yielding robust, "lean" optimization with low training and tuning overhead.

2. Benchmarks for Formalization in Lean Theorem Proving

Benchmarks in Lean theorem proving are constructed around datasets that facilitate autoformalization, proof synthesis, model evaluation, and translation between natural and formal languages.

Lean Workbook (Ying et al., 6 Jun 2024): An iterative synthetic data pipeline produces ∼57K formal–informal question pairs, including contest-level problems, IMO questions, and formal proofs. Compilation and natural language inference correctness (CPN, NPN) serve as key evaluation metrics, with substantial performance boosts observed after several autoformalization cycles.
LEAN-GitHub (Wu et al., 24 Jul 2024): Aggregated from 147 repositories, containing 28,597 theorems and >218,000 tactic invocations, this dataset enables benchmarking of theorem provers on diverse problem fields. Models fine-tuned on LEAN-GitHub achieve state-of-the-art pass rates on miniF2F (54.5% Pass@64), ProofNet, and Putnam benchmarks.
Herald (Gao et al., 9 Oct 2024): A parallel natural language–formal language corpus (∼580K pairs) with hierarchical, dependency-aware translation from Mathlib4, excels in both routine and graduate-level applications (93.2% Pass@128 on miniF2F).

These datasets not only benchmark accuracy (via problem pass rates, compilation success, and translation fidelity), but also serve as resources for model training and cross-system comparison.

3. Evaluation of Proof Search and Tactic Suggestion Models

Recent benchmarks assess the integration and performance of LLMs and proof search agents within Lean.

LLMSTEP (Welleck et al., 2023) provides real-time Lean tactic suggestions via a Lean 4 tactic calling a backend server hosting a LLM. Proofstep suggestions are benchmarked on mathlib4 and miniF2F datasets, with success rates such as 47.6–50.1% (mathlib4-test) and 27.9% (miniF2F-test) indicating competitive baseline performance. The modular, server-based architecture supports model plugging and rapid experimentation across compute modes (CPU, GPU, vLLM).
Lean-STaR (Lin et al., 14 Jul 2024) introduces joint generation of informal thoughts and formal proof tactics. State-of-the-art benchmark performance is achieved (e.g., 46.3% Pass@64 on miniF2F) by interleaving natural language rationale before tactic prediction and leveraging expert iteration for continual improvement. Benchmarking reveals that thought augmentation and increased sampling budgets directly enhance theorem-proving robustness and efficiency.
InternLM2.5-StepProver (Wu et al., 21 Oct 2024) employs large-scale expert iteration on datasets like Lean-workbook-plus (>20,000 CPU days), incorporating critic models to guide search. Notable benchmark gains include 65.9% pass rate on miniF2F and significant improvements in proof rates on other formal datasets. The process uncovers log-linear trends between proof length, CPU usage, and solved problem count, which provide further quantitative insight into the scaling properties of Lean-based automated theorem proving.

4. Infrastructure Benchmarks for Lean Verification and Search

Lean system benchmarks also cover infrastructure components for verification, batch processing, and search:

Lean4Lean (Carneiro, 21 Mar 2024) documents a verified Lean typechecker implemented in Lean itself, benchmarking raw speed against the native C++ kernel (Lean4Lean runs 20–50% slower but scales to verify all of mathlib). Formal verification bolsters trust in kernel soundness and facilitates future evolution of Lean's type theory.
Kimina Lean Server (Santos et al., 29 Apr 2025) provides batch verification of Lean proof scripts via a REST API, scaling across 8–60 CPUs. CPU scaling and LRU caching metrics show near-linear iteration rate gains and 41% per-iteration time decrease due to caching. The server supports mass extraction and infotree-based proof analysis, serving as a benchmark for verification throughput and interaction latency in reinforcement learning pipelines.
LeanExplore (Asher, 4 Jun 2025) benchmarks semantic retrieval, dependency tracing, and hybrid ranking for Lean 4 declarations, supporting AI-driven workflows. The system’s scoring algorithm combines embedding-based semantic similarity, BM25+ keyword relevance, and PageRank connectivity, enabling benchmarking of retrieval accuracy, ranking performance, and LLM context integration.

5. SMT Solver Integration and Automated Proof Checking

Automation benchmarks in Lean are advanced through the integration of proof-producing SMT solvers:

Lean-SMT (Mohamed et al., 21 May 2025) provides a Lean tactic that translates Lean goals into SMT problems (via preprocessing and translation to SMT-LIB), consumes proof-generating SMT solvers (e.g., cvc5), and reconstructs CPC proofs into Lean kernel-checked native proofs. Benchmark evaluations on Sledgehammer (∼5,000 problems) and SMT-LIB (∼24,000 benchmarks) report sub-second replay times in 98% of cases, competitive standalone proof checking, and a minimal expansion of the trusted core. Performance metrics (proof check times, coverage rates) and cactus plots detail comparative scalability and trade-offs.

6. Practical Applications and Industry Benchmarks

Lean benchmarks extend into industry–academia collaborations through Lean R&D frameworks (Kalinowski et al., 20 Jan 2025). Phased, structured development processes are evaluated in projects with Petrobras and Americanas, using questionnaires and ROI metrics to assess effectiveness. Benchmarks include speed and value of MVP delivery (energy savings, patent awards, operational improvements) and structured phase engagement rates as evidence of Lean methodology's business impact.

7. Implications and Directions

Aggregate benchmarking in Lean—spanning learned indexes, formal datasets, proof tactics, infrastructure, SMT automation, and industry practices—clarifies scaling limits, enables reproducible evaluation, informs methodological refinement, and accelerates progress in both academic and enterprise settings. Trends such as decreasing performance gaps (via lean learned models, hybrid LLM augmentations, and verified infrastructure) suggest ongoing improvement toward efficient, trustworthy, and universally applicable Lean systems. Open-source releases of datasets, models, and tools underpin continual benchmarking for the community, supporting rapid iterations in research and deployment.

In sum, Lean Benchmarks provide essential reference points for evaluating and advancing efficient, reliable computational reasoning in both database and formal mathematical contexts.