Brains vs. Bytes: Evaluating LLM Proficiency in Olympiad Mathematics
(2504.01995v2)
Published 1 Apr 2025 in cs.AI and cs.LG
Abstract: Recent advances in LLMs have shown impressive progress in mathematical reasoning tasks. However, current evaluation benchmarks predominantly focus on the accuracy of final answers, often overlooking the crucial logical rigor for mathematical problem solving. The claim that state-of-the-art LLMs can solve Math Olympiad-level problems requires closer examination. To explore this, we conducted both qualitative and quantitative human evaluations of proofs generated by LLMs, and developed a schema for automatically assessing their reasoning capabilities. Our study reveals that current LLMs fall significantly short of solving challenging Olympiad-level problems and frequently fail to distinguish correct mathematical reasoning from clearly flawed solutions. Our analyses demonstrate that the occasional correct final answers provided by LLMs often result from pattern recognition or heuristic shortcuts rather than genuine mathematical reasoning. These findings underscore the substantial gap between LLM performance and human expertise in advanced mathematical reasoning and highlight the importance of developing benchmarks that prioritize the soundness of the reasoning used to arrive at an answer rather than the mere correctness of the final answers.
This paper investigates the proficiency of contemporary LLMs on complex mathematical reasoning tasks, specifically problems sourced from the International Mathematics Olympiad (IMO) Shortlist (Mahdavi et al., 1 Apr 2025). The core argument posits that prevailing evaluation benchmarks, which often prioritize the accuracy of the final numerical answer, are inadequate for assessing genuine mathematical understanding, particularly the logical rigor required for constructing valid proofs. The paper employs both qualitative human expert evaluation and quantitative analysis, introducing a novel schema for categorizing logical fallacies in LLM-generated proofs, to demonstrate a significant deficit in the reasoning capabilities of current state-of-the-art models on these challenging problems.
Evaluation Methodology
The evaluation framework was designed to move beyond simple answer verification and scrutinize the logical structure and correctness of the mathematical arguments produced by LLMs.
Dataset and Models
The problem set comprised 455 problems from the IMO Shortlist covering the years 2009-2023, spanning Algebra (108), Combinatorics (117), Geometry (116), and Number Theory (114). IMO Shortlist problems were selected due to their originality, requirement for intricate multi-step reasoning using relatively elementary mathematics, and resistance to simple pattern matching or knowledge retrieval strategies. The evaluated LLMs included frontier models such as OpenAI's o1, o1-mini, o3-mini, DeepSeek R1, and Gemini 2.0 (using Flash Thinking mode).
Human Evaluation and Fallacy Schema
A team of seven human evaluators, possessing backgrounds as former national Olympiad medalists or advanced degrees (PhD level) in mathematics or computer science, conducted the qualitative assessment. Initial analysis of LLM outputs revealed recurring patterns of flawed reasoning. This led to the development of a systematic Fallacy Classification Schema to categorize these errors:
Proof by Example: Illegitimately generalizing from one or a few specific instances.
Proposal Without Verification: Asserting a strategy, definition, or intermediate step without justifying its validity or applicability.
Inventing Wrong Facts: Fabricating or incorrectly stating mathematical theorems, definitions, or properties.
Begging the Question (Circular Reasoning): Assuming the truth of the conclusion, or a statement equivalent to it, within the argument.
Solution by Trial-and-Error: Relying primarily on testing specific values or cases without a comprehensive deductive argument.
Calculation Mistakes: Significant arithmetic or algebraic errors that invalidate the reasoning chain.
Annotation and Quantitative Analysis
Evaluators annotated each LLM-generated solution using a structured checklist. Solutions were classified as Correct, Partially Correct (containing some valuable steps but flawed or incomplete), or Incorrect (lacking substantive progress). For incorrect solutions, the specific fallacies present were identified using the schema. Consistency was maintained through oversight and discussion of ambiguous cases.
Quantitative analyses included:
Overall Proof Correctness: Calculating the distribution of Correct, Partially Correct, and Incorrect ratings per model.
Final Answer Accuracy vs. Proof Correctness: For problems admitting a concrete final answer, the paper compared the percentage of correctly identified final answers ("Final Answer Accuracy") with the conditional probability that the underlying proof was correct given a correct final answer ("Correct | Correct Final Answer"). This directly tested whether correct answers masked flawed reasoning.
Fallacy Frequency: Analyzing the relative prevalence of different fallacy types across models and stratified by problem characteristics (e.g., presence of a final answer, mathematical topic area).
Automated Verification Assessment
To assess the LLMs' capability to verify mathematical correctness, two experiments were conducted using the LLMs themselves as judges:
Individual Judgment: LLMs were presented with either a known correct solution (sourced from the Art of Problem Solving community) or a known incorrect, fallacious solution generated by an LLM during the main evaluation phase. They were prompted to classify the solution as 'correct' or 'wrong'.
Pairwise Comparison: LLMs were given pairs of solutions for the same problem – one correct (AoPS) and one incorrect (LLM-generated) – and tasked with identifying the correct one.
Key Findings
The paper revealed significant limitations in the mathematical reasoning capabilities of the evaluated LLMs when confronted with Olympiad-level problems.
Low Proof Generation Accuracy
State-of-the-art LLMs exhibited extremely low success rates in generating correct proofs. The percentage of fully Correct solutions was minimal across all models, ranging from 0% to 3.8%. The vast majority of generated outputs were classified as Incorrect, indicating a fundamental inability to construct rigorous arguments for these problems.
Model
Correct
Partially Correct
Incorrect
o1
3.8%
6.3%
89.9%
o1-mini
1.0%
2.0%
97.0%
o3-mini
0.0%
2.0%
98.0%
DeepSeek R1
0.4%
4.2%
95.4%
Gemini 2.0
0.0%
0.0%
100.0%
(Adapted from Table 1 in the paper)
Disconnect Between Final Answer and Proof Validity
A stark discrepancy was observed between the ability to produce a correct final answer and the ability to provide a valid supporting proof. While models sometimes generated correct numerical answers, the underlying reasoning was frequently flawed. The conditional probability P(Correct Solution | Correct Final Answer) was notably low, even reaching 0% for several models, indicating that correct final answers often resulted from mechanisms other than sound deduction, such as pattern recognition or heuristics.
"Inventing Wrong Facts" and "Proposal Without Verification" emerged as the most frequent fallacies committed by most LLMs. The authors hypothesize that the former might stem from training objectives overly focused on final answers, while the latter reflects difficulties in constructing and justifying multi-step arguments. Fallacy distribution varied: "Proof by Example" and "Solution by Trial and Error" were more common in problems requiring a specific numerical answer, while "Inventing Wrong Facts" and "Proposal Without Verification" dominated in pure proof tasks. Variations were also observed across mathematical domains (Algebra, Combinatorics, Geometry, Number Theory).
Poor Verification Capabilities
The LLMs demonstrated very limited ability to act as reliable verifiers of mathematical proofs. In the individual judgment task, models frequently misclassified incorrect, fallacious solutions as correct. In the pairwise comparison task, where models had to choose the correct proof from a correct/incorrect pair, performance was often near random chance (around 50% accuracy), indicating a lack of robust capability to discern logical validity even when presented with contrasting examples.
Model
Accuracy (Pairwise Comparison)
o1
59.8%
o1-mini
47.0%
o3-mini
54.7%
DeepSeek R1
54.5%
Gemini 2.0
52.1%
(Adapted from Table 4 in the paper)
Implications
The findings carry significant implications for the assessment and development of mathematical reasoning in LLMs.
Capability Gap: There exists a substantial gap between the performance of current LLMs and the requirements of advanced mathematical reasoning, particularly the construction of rigorous proofs characteristic of human expert problem-solving at the Olympiad level.
Benchmark Limitations: The paper critically undermines the validity of evaluation benchmarks focusing solely on final answer accuracy for complex mathematical tasks. Such metrics fail to capture the essential component of logical soundness and can provide a misleading picture of model capabilities.
Need for Rigor-Focused Evaluation: The results emphasize the necessity of developing and adopting evaluation methodologies and benchmarks that prioritize the assessment of logical coherence, step-by-step validity, and the overall rigor of generated mathematical arguments. The proposed fallacy schema offers one potential component for such evaluations.
Training and Verification Challenges: The prevalence of logical fallacies and the poor performance in verification tasks suggest that current training paradigms, possibly including reinforcement learning from human feedback (RLHF) or outcome-based reward models, may be insufficient for instilling deep mathematical reasoning. Furthermore, using LLMs themselves as judges for complex mathematical correctness appears unreliable.
Conclusion
This work provides a critical evaluation of LLM proficiency in Olympiad-level mathematics, demonstrating through rigorous human assessment and a novel fallacy classification schema that current models struggle significantly with constructing valid proofs (Mahdavi et al., 1 Apr 2025). The findings highlight the inadequacy of final-answer-centric benchmarks and underscore the considerable gap remaining between LLM performance and genuine mathematical reasoning. It calls for a shift towards evaluation methods that prioritize logical rigor to accurately gauge and foster progress in this domain.