Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad (2503.21934v4)

Published 27 Mar 2025 in cs.CL

Abstract: Recent math benchmarks for LLMs such as MathArena indicate that state-of-the-art reasoning models achieve impressive performance on mathematical competitions like AIME, with the leading model, Gemini-2.5-Pro, achieving scores comparable to top human competitors. However, these benchmarks evaluate models solely based on final numerical answers, neglecting rigorous reasoning and proof generation which are essential for real-world mathematical tasks. To address this, we introduce the first comprehensive evaluation of full-solution reasoning for challenging mathematical problems. Using expert human annotators, we evaluated several state-of-the-art reasoning models on the six problems from the 2025 USAMO within hours of their release. Our results reveal that all tested models struggled significantly: only Gemini-2.5-Pro achieves a non-trivial score of 25%, while all other models achieve less than 5%. Through detailed analysis of reasoning traces, we identify the most common failure modes and find several unwanted artifacts arising from the optimization strategies employed during model training. Overall, our results suggest that current LLMs are inadequate for rigorous mathematical reasoning tasks, highlighting the need for substantial improvements in reasoning and proof generation capabilities.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Ivo Petrov (5 papers)
  2. Jasper Dekoninck (8 papers)
  3. Lyuben Baltadzhiev (1 paper)
  4. Maria Drencheva (2 papers)
  5. Kristian Minchev (3 papers)
  6. Mislav Balunović (22 papers)
  7. Nikola Jovanović (21 papers)
  8. Martin Vechev (103 papers)

Summary

This research evaluates the capabilities of state-of-the-art LLMs on rigorous mathematical proof generation, moving beyond benchmarks that focus solely on final numerical answers. It utilizes the 2025 USA Mathematical Olympiad (USAMO) problems to assess the full-solution reasoning of several prominent models.

Methodology for Evaluating Full-Solution Reasoning

The paper aimed to evaluate LLMs on tasks requiring detailed, natural language mathematical proofs, characteristic of high-level competitions like the USAMO and IMO. The methodology involved several key steps:

  1. Problem Corpus: The six problems from the 2025 USAMO were selected immediately following their release to ensure novelty and prevent data contamination. These problems demand comprehensive proofs and cover various mathematical domains (algebra, combinatorics, geometry, number theory).
  2. Model Selection: Six contemporary LLMs known for strong reasoning performance were evaluated: R1 (DeepSeek R1), Flash-Thinking (Gemini-2.0-Flash-Thinking-Exp), Claude 3.7 (Claude-3.7-Sonnet-Thinking), QwQ (QwQ-32B), o1-pro (high), and o3-mini (high).
  3. Solution Generation: Each model was prompted to generate four independent, detailed solutions in LaTeX format for every USAMO problem, aiming to capture variability and provide sufficient data for analysis. The full reasoning traces (thought processes) were collected but excluded from the grading material to avoid biasing the judges.
  4. Expert Human Annotation: A panel of four expert human graders, all former national IMO team members or finalists, was employed. Solutions were anonymized and rendered as PDFs.
  5. Grading Protocol: A rigorous grading scheme awarding 0-7 points per problem, consistent with official IMO grading practices allowing partial credit, was established for each problem based on community solutions (e.g., AoPS forums) and validated by the judges. Each solution was independently graded by two judges. Inter-grader discrepancies were resolved through discussion to ensure consistency. Judges documented the points awarded and identified the first instance of flawed reasoning, classifying it into predefined categories.
  6. Failure Mode Taxonomy: Errors were systematically categorized into:
    • Logic: Errors involving logical fallacies, unjustified claims, or gaps in reasoning.
    • Assumption: Introducing unproven statements or incorrect assumptions as facts.
    • Creativity: Failure stemming from an inability to devise a viable solution strategy or identify key insights.
    • Algebra/Arithmetic: Critical errors in calculation or algebraic manipulation.
  7. Automated Grading Experiment: An auxiliary experiment explored the feasibility of using LLMs (o3-mini, Claude 3.7) as automated graders, providing them with the problem, rubric, a known correct solution, and an example grading.

Performance Shortcomings and Reasoning Analysis

The evaluation revealed significant deficiencies in the mathematical reasoning capabilities of all tested LLMs when subjected to the rigors of Olympiad-level proof generation.

  • Overall Scores: Performance was extremely poor across the board. The average score achieved by the models was less than 5% (< 2.0 points out of a total possible 42). The highest-scoring model, R1, only managed an average of 2.0 points. This stands in stark contrast to the high scores reported for models like o3-mini on benchmarks such as MathArena, which primarily evaluate final numerical answers.
  • Absence of Correct Solutions: Critically, not a single solution generated by any model for any problem received a perfect score of 7 points. This underscores the difficulty these models face in constructing complete and flawless proofs.
  • Failure Mode Distribution:
    • Flawed Logic was the predominant failure mode. Models frequently made unwarranted logical leaps, misinterpreted intermediate results, or dismissed non-trivial steps without justification. The tendency of o3-mini to label crucial steps as "trivial" was particularly noted.
    • Lack of Creativity was another major obstacle. Models struggled to formulate correct high-level strategies. Often, they would perseverate on incorrect approaches across multiple generation attempts. Flash-Thinking exhibited attempts at exploring multiple strategies within a single generation but failed due to insufficient depth in exploration.
    • A consistent finding was the models' propensity to "bluff"—confidently asserting claims of having solved the problem even when the provided reasoning was fundamentally flawed. This contrasts sharply with human expert behavior and raises concerns about the trustworthiness of LLM outputs in mathematical contexts.
    • Algebra/Arithmetic errors were less common but still present, with R1 showing a relatively higher frequency compared to other models.
  • Solution Quality and Structure: Variability was observed in solution clarity. o3-mini and o1-pro generally produced more structured and readable LaTeX outputs, potentially reflecting specific fine-tuning for coherence. In contrast, Flash-Thinking and QwQ often generated more disorganized and chaotic reasoning steps.
  • Automated Grading Infeasibility: The experiment using o3-mini and Claude 3.7 as graders demonstrated their inability to accurately assess solution quality. These LLM graders consistently overestimated scores, often by large margins (up to 20x inflation), failing to identify critical flaws in reasoning and awarding points incorrectly. Their performance seemed sensitive to superficial characteristics, such as the number of attempts within a solution (penalizing Flash-Thinking) or apparent simplicity (favoring QwQ).

Issues Arising from Optimization Strategies

The analysis identified specific failure patterns potentially linked to the optimization techniques used during LLM training, particularly those focused on maximizing reward based on final answer extraction.

  • Answer Boxing Artifact: Models trained with techniques like Reinforcement Learning from Human Feedback (RLHF) or specific implementations like GRPO often learn to enclose final numerical or symbolic answers within \boxed{} commands to facilitate automated reward calculation. This behavior persisted inappropriately in the context of proof problems lacking a single definitive answer to box. For USAMO 2025 Problem 5, which asks for a characterization of integers satisfying a property (all even integers), QwQ correctly identified the set of solutions but then proceeded to incorrectly box only the single integer 2, apparently overriding its correct broader conclusion due to the ingrained pattern of producing a boxed answer. This suggests optimization pressures can actively hinder correct reasoning on problems requiring more nuanced outputs.
  • Heuristic Overgeneralization: Models frequently relied on pattern spotting from small numerical examples, incorrectly extrapolating these patterns into general claims without rigorous proof. While potentially useful as a heuristic during exploration, this tendency, possibly reinforced during training, is insufficient and often misleading when formal proof is required.

Implications for LLMs in Mathematical Reasoning

The paper's findings indicate that current state-of-the-art LLMs, despite successes on benchmarks evaluating numerical answers, remain significantly inadequate for tasks demanding rigorous, step-by-step mathematical proof generation at the level of difficulty represented by the USAMO.

The results highlight the need for substantial advancements in core reasoning and proof-generation capabilities. Evaluation methodologies must evolve beyond simple answer checking to rigorously assess the logical soundness of the entire reasoning process. Furthermore, the observed negative artifacts suggest that current optimization paradigms, particularly those narrowly focused on answer extraction, may be insufficient or even counterproductive for developing deep, flexible reasoning. New training approaches might be necessary to instill more robust logical deduction and creative problem-solving skills. The pervasive "bluffing" behavior necessitates continued reliance on expert human verification for any critical mathematical tasks involving LLMs.

Conclusion

This evaluation of LLMs on the 2025 USAMO problems provides compelling evidence that contemporary models struggle profoundly with generating rigorous mathematical proofs. Achieving less than 5% on average and failing to produce any perfect solutions, the models exhibited frequent logical flaws, lack of creativity, and detrimental artifacts potentially stemming from training optimizations. These results underscore a significant gap between current LLM capabilities and the requirements of advanced mathematical reasoning, indicating critical areas for future research and development.

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com