APE-Bench I: Towards File-level Automated Proof Engineering of Formal Math Libraries

Published 27 Apr 2025 in cs.CL | (2504.19110v2)

Abstract: Recent progress in LLMs has shown promise in formal theorem proving, yet existing benchmarks remain limited to isolated, static proof tasks, failing to capture the iterative, engineering-intensive workflows of real-world formal mathematics libraries. Motivated by analogous advances in software engineering, we introduce the paradigm of Automated Proof Engineering (APE), which aims to automate proof engineering tasks such as feature addition, proof refactoring, and bug fixing using LLMs. To facilitate research in this direction, we present APE-Bench I, the first realistic benchmark built from real-world commit histories of Mathlib4, featuring diverse file-level tasks described in natural language and verified via a hybrid approach combining the Lean compiler and LLM-as-a-Judge. We further develop Eleanstic, a scalable parallel verification infrastructure optimized for proof checking across multiple versions of Mathlib. Empirical results on state-of-the-art LLMs demonstrate strong performance on localized edits but substantial degradation on handling complex proof engineering. This work lays the foundation for developing agentic workflows in proof engineering, with future benchmarks targeting multi-file coordination, project-scale verification, and autonomous agents capable of planning, editing, and repairing formal libraries.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

Insights on APE-Bench I: Emergence of Automated Proof Engineering

The paper "APE-Bench I: Towards File-level Automated Proof Engineering of Formal Math Libraries" proposes an innovative shift in the evaluation of automated theorem proving by introducing the concept of Automated Proof Engineering (APE). This ambitious framework aims to broaden existing models from mere theorem-solving tasks to encompassing comprehensive proof engineering challenges typical of real-world formal mathematics libraries like Lean's Mathlib4. Not confined to static benchmarks, APE emphasizes emulating iterative, lifelike workflows characteristic of software engineering disciplines.

Core Contributions

The authors present APE-Bench I, the inaugural benchmark tailored for assessing this broader scope of proof engineering. This benchmark is constructed from genuine Mathlib4 commit histories to capture the diverse range of proof tasks, thereby offering a more nuanced view of LLM capabilities. The benchmark encompasses file-level tasks, with challenges not just in solving individual theorems, but in managing feature addition, refactoring, and bug resolution.

Key infrastructure developments support this initiative, notably Eleanstic, a parallel verification tool that efficiently manages proofs across different Mathlib versions. This tool underscores the benchmark's realistic nature by emphasizing tasks reflective of actual development cycles, validated through Lean's compiler alongside LLM-based semantic checking.

Empirical Evaluation and Findings

The evaluation rigorously tests leading LLMs on APE-Bench I for their ability to manage structurally complex proof tasks. Among the models, OpenAI's o3-mini demonstrates noticeable proficiency, especially in handling larger edits, despite a marked semantic drop rate suggesting room for growth in instruction fidelity. In contrast, Claude Sonnet 3.7 (thinking mode) excels in semantic precision but covers a smaller task range due to its constraints.

A critical observation is that models currently exhibit substantial degradation with increased task complexity, reflecting a general challenge in scaling proof tasks beyond localized modifications. This indicates that, while capable of producing syntactically correct outputs, LLMs often fall short in encapsulating the complex, multi-step reasoning required for comprehensive proof engineering.

Implications and Future Directions

This work lays the groundwork for a transformative approach to theorem proving, one that bridges the gap between isolated mathematical reasoning and holistic formal library maintenance. By fostering agentic workflows, it aligns LLM capabilities with real-world needs, supporting scalable, automated formal proof upkeep.

Forward-looking, the APE framework paves the way toward progressively autonomous systems capable of planning and executing across extensive formal libraries. Anticipated future benchmarks will incorporate multi-file and project-scale tasks, pushing LLMs toward more integral roles in formal library evolution.

This realignment toward realistic proof engineering in LLM evaluation invites a renewed focus on model development, aiming for integrated systems that blend symbolic reasoning with the nuanced structural refinements expected in professional proof environments. In conclusion, APE-Bench I marks a pivotal step in leveraging machine learning to meet the expansive demands of formal mathematics engineering.

Markdown Report Issue