Benchmarks for ATPs using ETP data

Develop well-calibrated benchmarking suites for automated theorem provers that leverage the Equational Theories Project dataset; specify evaluation protocols, metrics, and datasets that meet community standards for assessing ATP performance on equational reasoning at scale.

Background

A secondary aim of the ETP was to use its large, formally verified corpus of implications and non-implications to create benchmarks for evaluating automated theorem provers (ATPs).

Despite extensive use of ATPs in the project, the authors did not develop such benchmarks to community standards and explicitly identify this as an open problem, while providing an informal field report to guide future work.

References

The objective of using the data from the ETP to establish well-calibrated benchmarks to evaluate ATPs remains an interesting open problem; the participants of this project did not have the required expertise to develop and test such benchmarks to the standards expected in the area.

The Equational Theories Project: Advancing Collaborative Mathematical Research at Scale (2512.07087 - Bolan et al., 8 Dec 2025) in Section Outcomes (Introduction, Subsection Outcomes)