First Proof: Testing AI on Real Mathematical Research
This presentation examines a groundbreaking initiative to evaluate AI performance on authentic, unpublished research-level mathematics problems. Unlike traditional benchmarks that rely on competition-style questions or publicly available data, First Proof introduces ten carefully curated problems spanning diverse mathematical domains—from algebraic combinatorics to symplectic geometry—that have never appeared in any public forum. The work addresses a fundamental question: can contemporary AI systems independently solve genuine research problems without human oversight? Preliminary results reveal that even the most advanced language models consistently fail to solve these problems in single attempts, highlighting both the challenge of mathematical reasoning and the necessity of contamination-free benchmarks for measuring real progress in AI capabilities.Script
Can artificial intelligence truly solve mathematical problems that no one has seen before? That question sits at the heart of a challenge facing the entire field, because almost every benchmark we use might already be contaminated by training data.
To understand why this matters, we need to examine what existing benchmarks actually measure.
Building on that concern, existing benchmarks suffer from four critical limitations. They rely on competition problems that bear little resemblance to actual mathematical research, they draw from public sources that models may have already seen during training, they use artificially constructed questions that miss the complexity of real problems, and they emphasize symbolic answers over the rigorous proof-building that defines mathematics.
The authors designed a fundamentally different methodology to address these gaps.
The solution is elegant: ten genuinely new mathematical problems, each drawn from the authors' own unpublished research work. These questions have never appeared online, they emerged as real technical lemmas during active research, their proofs are compact enough for current model context windows, and they span seven diverse domains including algebraic combinatorics, spectral graph theory, and symplectic geometry.
This comparison makes the distinction clear: where traditional benchmarks use competition problems with public exposure and automatic grading of answers, First Proof demands complete proof construction for never-before-seen research questions, with correctness assessed by human mathematical experts.
So what happened when the authors tested state-of-the-art models on these problems?
The preliminary findings are sobering. When tested using strict one-shot prompting without any human guidance or iterative refinement, the most advanced publicly available models consistently failed to solve these research-level problems, and all correct answers remain encrypted with scheduled public release to maintain experimental integrity.
These results illuminate a crucial boundary in AI capabilities. The consistent failures demonstrate that current systems cannot autonomously tackle novel research problems, which means any future success on these questions would represent authentic problem-solving rather than retrieval, establishing a legitimate benchmark for measuring progress in mathematical reasoning.
Looking ahead, the authors commit to releasing a second set of questions and welcome community participation through shared experimentation transcripts. They also acknowledge that as AI capabilities mature, future iterations could expand beyond proof verification to encompass the creative dimensions of mathematical research, including problem formulation and theory innovation.
First Proof establishes that measuring real AI progress in mathematics requires genuinely new problems that no system could have seen before. Visit EmergentMind.com to explore more cutting-edge research and stay at the frontier of AI development.