Gemini 2.5 Pro Capable of Winning Gold at IMO 2025 (2507.15855v3)
Abstract: The International Mathematical Olympiad (IMO) poses uniquely challenging problems requiring deep insight, creativity, and formal reasoning. While LLMs perform well on mathematical benchmarks like AIME, they struggle with Olympiad-level tasks. We use Google's Gemini 2.5 Pro on the newly released IMO 2025 problems, avoiding data contamination. Using a self-verification pipeline with careful prompt design, 5 (out of 6) problems are solved correctly. This result underscores the importance of developing optimal strategies to harness the full potential of powerful LLMs for complex reasoning tasks.
Summary
- The paper introduces an iterative verifier-guided pipeline that generates complete, mathematically rigorous proofs for IMO-level problems.
- The methodology employs multi-stage self-improvement and verification to overcome context window limitations and minimize logical errors.
- Empirical results show that Gemini 2.5 Pro solved 5 of 6 IMO 2025 problems, highlighting both its potential and current challenges in automated reasoning.
Gemini 2.5 Pro and the IMO 2025: A Structured Pipeline for Automated Mathematical Reasoning
Introduction and Motivation
The International Mathematical Olympiad (IMO) is a rigorous benchmark for evaluating the advanced reasoning capabilities of AI systems, particularly LLMs. Unlike standard mathematical benchmarks such as GSM8K or MATH, IMO problems require deep abstraction, multi-step logical deduction, and creative proof construction. Prior work has shown that even state-of-the-art LLMs struggle to produce rigorous, error-free proofs for Olympiad-level problems, often succumbing to logical fallacies or superficial heuristics. This paper investigates whether the latent capabilities of a leading LLM—Google’s Gemini 2.5 Pro—can be harnessed to solve the newly released IMO 2025 problems, using a carefully engineered, iterative self-verification pipeline.
Pipeline Architecture and Methodology
The core contribution is a multi-stage, verifier-guided pipeline designed to elicit rigorous, complete solutions from a general-purpose LLM. The pipeline consists of the following steps:
- Initial Solution Generation: The model is prompted to produce a detailed, step-by-step proof, with explicit instructions to prioritize rigor and to avoid unjustified leaps or guesses. The prompt enforces a strict output format, including a summary, method sketch, and a fully detailed solution in TeX.
- Self-Improvement: The model is instructed to review and refine its own output, leveraging an additional token budget. This step is critical due to the finite context window (32,768 tokens) of Gemini 2.5 Pro, which is often insufficient for a complete IMO proof in a single pass.
- Verifier Loop: A separate verifier prompt is used to scrutinize the solution, classifying issues as either critical errors (logical/factual mistakes) or justification gaps (insufficiently justified steps). The verifier produces a bug report, which is then used to guide further refinement.
- Iterative Correction: The model iteratively corrects its solution based on the bug report, with optional human review of the bug report to filter out spurious findings. This loop continues until the solution passes five consecutive verification runs or is rejected after persistent major errors.
- Acceptance Criteria: A solution is accepted only if it passes the verifier five times consecutively, ensuring robustness against both model and verifier errors.
This architecture is designed to overcome the limitations of single-pass LLM outputs, particularly the context window constraint and the tendency to overlook subtle logical errors.
Experimental Setup
- Model: Gemini 2.5 Pro, with a maximum context window of 32,768 tokens.
- Temperature: Set to 0.1 to minimize stochasticity and random errors.
- No External Tools: No web search, code execution, or external verification tools are used.
- Prompt Engineering: Prompts are meticulously crafted to enforce mathematical rigor and explicit self-correction.
To avoid data contamination, only the newly released IMO 2025 problems are used, ensuring that the evaluation is on genuinely unseen data.
Results and Empirical Findings
The pipeline successfully generated complete, mathematically rigorous solutions for 5 out of the 6 IMO 2025 problems. Notably:
- Problems 3, 4, 5: Solved without any problem-specific hints.
- Problems 1, 2: Solved both with and without general strategic hints (e.g., "try induction" or "try analytic geometry"). The hint-free solutions required more sampling but did not fundamentally alter the model’s capabilities, supporting the claim that hints primarily improve efficiency rather than enable new reasoning abilities.
- Problem 6: The pipeline failed to produce a correct solution. The model’s proof contained a critical error in the combinatorial argument, specifically an incorrect assumption about the partitioning of tiles, which was not repairable through iterative refinement.
Key empirical claim: The iterative, verifier-guided pipeline is essential for converting the latent mathematical reasoning capabilities of LLMs into rigorous, trustworthy proofs. Direct, single-pass generation—even with a strong model—remains insufficient for IMO-level tasks.
Analysis of the Verifier
The verifier is highly reliable in detecting critical errors, with rare misses that are mitigated by repeated runs. Justification gaps are sometimes over-reported, but the system is robust to such false positives due to the review and correction steps. The pipeline’s acceptance criterion (five consecutive passes) provides strong empirical assurance of solution validity.
Implementation Considerations
- Token Budget: The context window is a hard constraint; multi-stage refinement is necessary for problems exceeding the model’s single-pass capacity.
- Parallelization: Multiple solution candidates can be generated and refined in parallel, increasing the probability of success.
- Verifier Robustness: The iterative loop and optional human review mitigate both false positives and false negatives from the verifier.
- Prompt Design: Explicit, structured prompts are critical for both the solver and verifier roles, enforcing output discipline and reducing hallucinations.
Theoretical and Practical Implications
Theoretical: The results demonstrate that, with appropriate system-level scaffolding, current LLMs can achieve performance on par with IMO gold medalists on genuinely unseen problems. This narrows the gap between numerical answer generation and the construction of logically sound, human-verifiable proofs.
Practical: The pipeline architecture is directly applicable to other domains requiring formal reasoning, such as scientific discovery, formal verification, and theorem proving. The modularity of the approach allows for integration with multi-agent systems, ensemble methods, or hybrid human-in-the-loop workflows.
Limitations: The approach is computationally intensive, requiring multiple iterations and substantial token usage. The failure on Problem 6 highlights that certain combinatorial or construction-based problems remain challenging, especially when the model’s initial abstraction is fundamentally flawed.
Future Directions
- Model Diversity: Incorporating multiple LLMs (e.g., Grok 4, OpenAI-o series) in a multi-agent system could further improve coverage and solution quality.
- Solution Aggregation: Combining partial solutions or leveraging ensemble voting may address cases where no single model instance produces a complete proof.
- Verifier Enhancement: Training specialized verifiers or integrating formal proof assistants could further increase reliability.
- Scaling: Efficient batching, parallelization, and adaptive sampling strategies are necessary for practical deployment at scale.
Conclusion
This work demonstrates that a structured, iterative, verifier-guided pipeline enables a general-purpose LLM (Gemini 2.5 Pro) to solve 5 out of 6 IMO 2025 problems with mathematically rigorous proofs, on genuinely unseen data. The results underscore the necessity of system-level scaffolding—beyond raw model capability—for complex mathematical reasoning. The approach provides a blueprint for deploying LLMs in high-stakes, formal reasoning domains, while also highlighting current limitations and avenues for further research.
Follow-up Questions
- How does the verifier-guided pipeline enhance the reliability of generated mathematical proofs?
- What specific challenges arise when scaling the iterative self-improvement strategy for complex reasoning tasks?
- In what ways does Gemini 2.5 Pro outperform or differ from other LLMs in solving IMO-level problems?
- How can the pipeline be modified to better address issues such as flawed combinatorial reasoning in complex problems?
- Find recent papers about AI-driven mathematical reasoning.
Related Papers
- Wu's Method can Boost Symbolic AI to Rival Silver Medalists and AlphaGeometry to Outperform Gold Medalists at IMO Geometry (2024)
- Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models (2024)
- Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models (2025)
- Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad (2025)
- Brains vs. Bytes: Evaluating LLM Proficiency in Olympiad Mathematics (2025)
- AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset (2025)
- DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition (2025)
- Solving Inequality Proofs with Large Language Models (2025)
- LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming? (2025)
- OMEGA: Can LLMs Reason Outside the Box in Math? Evaluating Exploratory, Compositional, and Transformative Generalization (2025)
Authors (2)
Tweets
YouTube
HackerNews
- Gemini 2.5 Pro Capable of Winning Gold at IMO 2025 (4 points, 0 comments)
- Gemini 2.5 Pro Capable of Winning Gold at IMO 2025 with Prompting (4 points, 3 comments)
- Gemini 2.5 Pro Capable of Winning Gold at IMO 2025 (1 point, 1 comment)
alphaXiv
- Gemini 2.5 Pro Capable of Winning Gold at IMO 2025 (430 likes, 0 questions)