Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
36 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
37 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
4 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

DeepMath-Creative: A Benchmark for Evaluating Mathematical Creativity of Large Language Models (2505.08744v1)

Published 13 May 2025 in cs.AI

Abstract: To advance the mathematical proficiency of LLMs, the DeepMath team has launched an open-source initiative aimed at developing an open mathematical LLM and systematically evaluating its mathematical creativity. This paper represents the initial contribution of this initiative. While recent developments in mathematical LLMs have predominantly emphasized reasoning skills, as evidenced by benchmarks on elementary to undergraduate-level mathematical tasks, the creative capabilities of these models have received comparatively little attention, and evaluation datasets remain scarce. To address this gap, we propose an evaluation criteria for mathematical creativity and introduce DeepMath-Creative, a novel, high-quality benchmark comprising constructive problems across algebra, geometry, analysis, and other domains. We conduct a systematic evaluation of mainstream LLMs' creative problem-solving abilities using this dataset. Experimental results show that even under lenient scoring criteria -- emphasizing core solution components and disregarding minor inaccuracies, such as small logical gaps, incomplete justifications, or redundant explanations -- the best-performing model, O3 Mini, achieves merely 70% accuracy, primarily on basic undergraduate-level constructive tasks. Performance declines sharply on more complex problems, with models failing to provide substantive strategies for open problems. These findings suggest that, although current LLMs display a degree of constructive proficiency on familiar and lower-difficulty problems, such performance is likely attributable to the recombination of memorized patterns rather than authentic creative insight or novel synthesis.

Summary

Evaluating Mathematical Creativity in LLMs: The DeepMath-Creative Benchmark

The advent of LLMs has marked a significant stride in computational capabilities, particularly regarding mathematical problem-solving. Nevertheless, the evaluation of these models' mathematical creativity has remained an understudied dimension. Addressing this gap, the paper "DeepMath-Creative: A Benchmark for Evaluating Mathematical Creativity of LLMs" proposes a framework to systematically assess and quantify the creative abilities of LLMs within mathematical domains.

Mathematical Creativity and Benchmark Design

In delineating mathematical creativity, the authors identify three key dimensions: (1) the generation of novel concepts, (2) the invention of novel methodologies, and (3) the construction of novel examples. These dimensions are pivotal, representing substantive leaps in mathematical understanding and reasoning. To embody these dimensions in a practical evaluation tool, the authors constructed the DeepMath-Creative benchmark, which includes innovative and high-quality problems across several branches of mathematics such as algebra, geometry, and analysis. This benchmark comprises two types of inquiry-oriented tasks: proving formal statements and constructing counterexamples to refute propositions.

Empirical Evaluation

The paper includes a comprehensive evaluation of mainstream LLMs using the DeepMath-Creative benchmark. Notably, even under lenient scoring conditions, the best-performing model, O3 Mini, reached only 70% accuracy primarily on basic undergraduate constructive tasks, raising concerns about the models' genuine creative capacity. Models exhibited pronounced difficulties on more complex open-ended questions, failing to devise substantive solutions or novel strategies.

Analysis of Novelty and Construction

The authors maintain that the proficiency observed in models is generally a byproduct of recombining known patterns rather than authentic creative synthesis. In evaluations, models frequently defaulted to error-prone constructions, inadequate reasoning processes, or verbose derivations lacking convergence toward a correct solution. These findings underscore a significant shortcoming: existing models have yet to exhibit true mathematical creativity necessary for effectively tackling open problems or generating novel mathematical insights.

Implications and Future Directions

The implications of this paper are twofold. Practically, it highlights the need for enhanced methodologies and training paradigms to foster deeper creative problem-solving capabilities in LLMs. Theoretically, it invites further exploration into modeling human-like creativity within computational frameworks, encouraging the development of more sophisticated learning algorithms that can simulate innovative human thought processes. The authors propose leveraging reinforcement learning and other advanced techniques to refine and train the DeepMath-Creative Model, potentially paving the way for more robust applications in advanced mathematical research.

In conclusion, while current LLMs have made notable progress in structured problem-solving, they are yet to achieve the nuanced creativity aspired within professional mathematical research. The proposed benchmark serves as a foundational tool, setting the stage for future developments aimed at bridging this divide. The gradual enhancement of LLM capabilities on challenging and creative tasks could significantly extend their applicability in the field, fostering new avenues for mathematical discovery and innovation.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.