CombiBench: Benchmarking LLM Capability for Combinatorial Mathematics (2505.03171v1)

Published 6 May 2025 in cs.AI

Abstract: Neurosymbolic approaches integrating LLMs with formal reasoning have recently achieved human-level performance on mathematics competition problems in algebra, geometry and number theory. In comparison, combinatorics remains a challenging domain, characterized by a lack of appropriate benchmarks and theorem libraries. To address this gap, we introduce CombiBench, a comprehensive benchmark comprising 100 combinatorial problems, each formalized in Lean~4 and paired with its corresponding informal statement. The problem set covers a wide spectrum of difficulty levels, ranging from middle school to IMO and university level, and span over ten combinatorial topics. CombiBench is suitable for testing IMO solving capabilities since it includes all IMO combinatorial problems since 2000 (except IMO 2004 P3 as its statement contain an images). Furthermore, we provide a comprehensive and standardized evaluation framework, dubbed Fine-Eval (for $\textbf{F}$ill-in-the-blank $\textbf{in}$ L$\textbf{e}$an Evaluation), for formal mathematics. It accommodates not only proof-based problems but also, for the first time, the evaluation of fill-in-the-blank questions. Using Fine-Eval as the evaluation method and Kimina Lean Server as the backend, we benchmark several LLMs on CombiBench and observe that their capabilities for formally solving combinatorial problems remain limited. Among all models tested (none of which has been trained for this particular task), Kimina-Prover attains the best results, solving 7 problems (out of 100) under both with solution'' andwithout solution'' scenarios. We open source the benchmark dataset alongside with the code of the proposed evaluation method at https://github.com/MoonshotAI/CombiBench/.

Summary

The paper introduces CombiBench, a benchmark of 100 formal combinatorial problems in Lean 4, and Fine-Eval, a new evaluation framework for assessing LLM capabilities on fill-in-the-blank math problems.
Experimental results using CombiBench and Fine-Eval reveal that current LLMs solve only a small fraction of complex combinatorial problems formalized in Lean 4, highlighting significant domain challenges.
The difficulty LLMs face is attributed to the scarcity of formal combinatorial content in theorem libraries, prompting the authors to contribute formalized definitions to Mathlib to support future research.

This paper introduces CombiBench (2505.03171), a new benchmark designed to evaluate the capabilities of LLMs in solving combinatorial mathematics problems within a formal theorem proving environment. The research highlights that while neurosymbolic approaches have shown promise in areas like algebra and geometry, combinatorics remains a challenging domain for LLMs, partly due to a lack of suitable benchmarks and formal theorem libraries.

CombiBench is a comprehensive benchmark containing 100 combinatorial problems. Each problem includes both its informal natural language statement and a corresponding formalization in Lean 4. The problems cover a wide range of difficulty levels, from middle school exercises to those from prestigious competitions like the International Mathematical Olympiad (IMO) and university-level challenges. The problem sources are diverse, including Hackmath, exercises from the textbook "Introductory Combinatorics" by Brualdi, IMO problems (excluding those requiring images), and other math competitions like APMO, Balticway, and USAMO. The problems span over ten distinct combinatorial topics, mirroring the structure of Brualdi's textbook, ensuring broad coverage of the field.

A significant practical aspect of CombiBench is its formalization in Lean 4, a modern interactive theorem prover. The authors note the considerable challenge of formalizing combinatorial problems, even for experienced Lean users. This difficulty is attributed to the complexity of combinatorial concepts and, importantly, the limited availability of relevant theorems and definitions in Lean's mathematical library, Mathlib. Formalizing IMO-level problems often takes several hours, sometimes exceeding 8 hours for a single problem, highlighting the current gap in formalizing this domain. The formalizations in CombiBench tend to be longer than those in previous benchmarks like miniF2F and FIMO, further illustrating this challenge.

CombiBench also addresses the issue of evaluating LLMs on "fill-in-the-blank" style problems, which require generating a solution rather than just proving a given proposition. Unlike traditional benchmarks focused solely on proof completion, CombiBench includes problems where the solution needs to be determined first. To handle this, the paper proposes a new evaluation framework called Fine-Eval (Fill-in-the-blank in Lean Evaluation).

Fine-Eval operates in a standardized manner by interacting with an LLM and a Lean server. For both proof-based and fill-in-the-blank problems (formalized in a style similar to PutnamBench, where the solution is a placeholder like sorry to be filled), the LLM is tasked with generating the complete Lean 4 code, including the solution and its proof. The generated code must meet strict criteria: it cannot contain sorry, cannot define new axioms, must compile without errors, and must exactly match the input formal statement except for the filled-in blanks.

For fill-in-the-blank problems, if the generated solution matches the ground truth exactly and the proof compiles, it's considered solved in the first stage. If the solution differs but the proof compiles, a second stage is initiated. In this stage, the system constructs a formal statement asserting the equivalence of the LLM's predicted solution and the ground truth (LLM_solution = ground_truth) and asks the LLM to prove this equivalence. If this second proof is successful (and meets length constraints to prevent trivial outputs), the problem is also considered solved. A simplified one-stage evaluation using rfl for definitional equivalence is also proposed for practical applications.

Experimental results using Fine-Eval on various LLMs (including general reasoning models and specialized theorem provers) demonstrate that CombiBench poses a significant challenge. All tested models, none specifically trained on combinatorial formalization or fill-in-the-blank tasks, solved only a small fraction of the problems. Kimina-Prover Preview, a larger 72B parameter model fine-tuned for theorem proving, achieved the best results, solving 7 out of 100 problems under both "with solution" (where the ground truth is provided) and "without solution" settings. Smaller models and general reasoning models performed worse, highlighting the difficulty and the current limitations of LLMs in this domain. The limited success is primarily attributed to the scarcity of formal combinatorial content in existing theorem libraries and the substantial gap between informal problem statements and their formal counterparts.

Practical implementation of Fine-Eval requires a Lean 4 server backend (like Kimina Lean Server used in the paper) and a system to manage interaction with the LLM, enforce code constraints, and handle the two-stage verification process for fill-in-the-blank problems. The paper's authors plan to address the challenges by contributing formalized definitions from CombiBench to Mathlib and developing a dedicated combinatorics theorem library to better support future LLM training and development in this area. The benchmark dataset and evaluation code are open-sourced to facilitate further research and development.