MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics (2109.00110v2)

Published 31 Aug 2021 in cs.AI, cs.FL, and cs.LG

Abstract: We present miniF2F, a dataset of formal Olympiad-level mathematics problems statements intended to provide a unified cross-system benchmark for neural theorem proving. The miniF2F benchmark currently targets Metamath, Lean, Isabelle (partially) and HOL Light (partially) and consists of 488 problem statements drawn from the AIME, AMC, and the International Mathematical Olympiad (IMO), as well as material from high-school and undergraduate mathematics courses. We report baseline results using GPT-f, a neural theorem prover based on GPT-3 and provide an analysis of its performance. We intend for miniF2F to be a community-driven effort and hope that our benchmark will help spur advances in neural theorem proving.

Citations (97)

View on Semantic Scholar

Summary

The paper introduces miniF2F, a dataset designed as a unified benchmark for assessing neural theorem proving across multiple formal systems.
It presents baseline evaluations using GPT-f in Lean and Metamath, highlighting performance differences linked to high-level tactics.
The study advocates expanding the benchmark to cover additional math areas and enhancing collaborative formalization for improved AI-driven proofs.

An Analysis of the miniF2F Benchmark for Neural Theorem Proving

The paper introduces \textsf{miniF2F}, a dataset designed to facilitate a cross-system benchmark for neural theorem proving in the context of Olympiad-level mathematics. The benchmark targets multiple formal systems such as Metamath, Lean, Isabelle, and HOL Light, positioning itself as a unified evaluation framework to investigate the mathematical reasoning capabilities of neural models.

Benchmark Overview

\textsf{miniF2F} comprises 488 problem statements drawn from well-known competitions like AIME, AMC, and the International Mathematical Olympiad, along with content from high school and undergraduate mathematics. These problems have been meticulously formalized and distributed across various systems, supporting both cross-platform and difficulty-gradient evaluations.

Baseline Results

The paper provides baseline evaluations using GPT- $f$ , a theorem prover based on GPT-3, adapted for formal mathematics. In Lean, GPT- $f$ achieved $\text{Pass}@1$ rates of 24.6% and $\text{Pass}@8$ rates of 29.2%, indicating a notable difference compared to Metamath, where results were significantly lower due to the absence of high-level tactics.

Implications and Discussion

The results demonstrate the distinct advantage of using formal systems like Lean that incorporate advanced tactics, allowing neural models to efficiently guide automation in proofs. A key insight is the importance of high-level tactics in tackling more complex mathematical statements, which aren't as readily manageable in systems like Metamath.

The inclusion of \textsf{miniF2F} in the broader ecosystem of benchmarks seeks to standardize evaluation metrics and foster collaborative efforts across different formal systems. It paves the way for a systematic comparison of neural theorem proving strategies and provides a step towards achieving the IMO Grand Challenge.

Future Directions

Moving forward, expanding \textsf{miniF2F} to cover additional areas of mathematics, such as geometry and combinatorics, could enhance its comprehensiveness and applicability. The paper also advocates for communal efforts to contribute further formalizations across diverse systems like Coq, promoting an ecosystem of shared resources.

In conclusion, the establishment of \textsf{miniF2F} marks a significant advancement for benchmarking neural theorem proving, offering a robust platform for the development and comparison of AI-driven mathematical reasoning capabilities in formal environments. This initiative holds potential for driving further innovation in automating the resolution of intricate mathematical problems.

PDF Markdown

Related Papers