FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models (2505.02735v1)

Published 5 May 2025 in cs.AI and cs.LG

Abstract: Formal mathematical reasoning remains a critical challenge for artificial intelligence, hindered by limitations of existing benchmarks in scope and scale. To address this, we present FormalMATH, a large-scale Lean4 benchmark comprising 5,560 formally verified problems spanning from high-school Olympiad challenges to undergraduate-level theorems across diverse domains (e.g., algebra, applied mathematics, calculus, number theory, and discrete mathematics). To mitigate the inefficiency of manual formalization, we introduce a novel human-in-the-loop autoformalization pipeline that integrates: (1) specialized LLMs for statement autoformalization, (2) multi-LLM semantic verification, and (3) negation-based disproof filtering strategies using off-the-shelf LLM-based provers. This approach reduces expert annotation costs by retaining 72.09% of statements before manual verification while ensuring fidelity to the original natural-language problems. Our evaluation of state-of-the-art LLM-based theorem provers reveals significant limitations: even the strongest models achieve only 16.46% success rate under practical sampling budgets, exhibiting pronounced domain bias (e.g., excelling in algebra but failing in calculus) and over-reliance on simplified automation tactics. Notably, we identify a counterintuitive inverse relationship between natural-language solution guidance and proof success in chain-of-thought reasoning scenarios, suggesting that human-written informal reasoning introduces noise rather than clarity in the formal reasoning settings. We believe that FormalMATH provides a robust benchmark for benchmarking formal mathematical reasoning.

Summary

An Overview of FormalMATH: Benchmarking Formal Mathematical Reasoning of LLMs

The paper presents FormalMATH, a comprehensive benchmark designed to evaluate the formal mathematical reasoning capabilities of LLMs. Constructed in Lean4, the benchmark contains 5,560 formally verified problems encompassing a wide range of mathematical domains from high-school Olympiad problems to undergraduate-level theorems. FormalMATH addresses the existing limitations of both scope and scale present in current benchmarks, providing a substantial challenge for evaluating LLM performance in formal reasoning.

The authors introduce a novel human-in-the-loop autoformalization pipeline aimed at reducing the manual effort required to formalize mathematical statements while ensuring high fidelity with the original natural-language problems. This pipeline integrates specialized LLMs to perform initial formalization, followed by a multi-LLM semantic verification and a negation-based disproof strategy utilizing LLM-based provers. The pipeline retains 72.09% of statements before manual verification, significantly lowering expert annotation costs.

Experimental evaluation of state-of-the-art LLM-based theorem provers on the FormalMATH benchmark reveals considerable limitations of current systems. The models achieve a maximum success rate of only 16.46% under feasible sampling configurations, highlighting significant domain bias and over-reliance on automated tactics. Furthermore, the paper identifies an inverse relationship between natural-language guidance and proof success in chain-of-thought reasoning contexts, suggesting that informal reasoning often introduces ambiguity rather than aiding formal reasoning processes.

In terms of implications, the research emphasizes the need for improving cross-domain generalizability and deductive rigor within LLM-based mathematics reasoning systems. Future work might explore more sophisticated reinforcement learning approaches, improve algorithmic search strategies, and advance model architecture to bridge the gap between human-like informal reasoning and machine-oriented formal reasoning. Ultimately, the FormalMATH benchmark is posited as a robust tool for pushing the boundaries of formal reasoning in artificial intelligence research.

Related Papers

Tweets

https://twitter.com/ZhouliangY/status/1929066980289446155

https://twitter.com/Jose_A_Alonso/status/1920404428340731961

https://twitter.com/papers_anon/status/1919588429874266473

https://twitter.com/arxivsanitybot/status/1920474014444634297