Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
GPT-4o
Gemini 2.5 Pro Pro
o3 Pro
GPT-4.1 Pro
DeepSeek R1 via Azure Pro
2000 character limit reached

Gemini 2.5 Pro Capable of Winning Gold at IMO 2025 (2507.15855v3)

Published 21 Jul 2025 in cs.AI

Abstract: The International Mathematical Olympiad (IMO) poses uniquely challenging problems requiring deep insight, creativity, and formal reasoning. While LLMs perform well on mathematical benchmarks like AIME, they struggle with Olympiad-level tasks. We use Google's Gemini 2.5 Pro on the newly released IMO 2025 problems, avoiding data contamination. Using a self-verification pipeline with careful prompt design, 5 (out of 6) problems are solved correctly. This result underscores the importance of developing optimal strategies to harness the full potential of powerful LLMs for complex reasoning tasks.

Summary

  • The paper introduces an iterative verifier-guided pipeline that generates complete, mathematically rigorous proofs for IMO-level problems.
  • The methodology employs multi-stage self-improvement and verification to overcome context window limitations and minimize logical errors.
  • Empirical results show that Gemini 2.5 Pro solved 5 of 6 IMO 2025 problems, highlighting both its potential and current challenges in automated reasoning.

Gemini 2.5 Pro and the IMO 2025: A Structured Pipeline for Automated Mathematical Reasoning

Introduction and Motivation

The International Mathematical Olympiad (IMO) is a rigorous benchmark for evaluating the advanced reasoning capabilities of AI systems, particularly LLMs. Unlike standard mathematical benchmarks such as GSM8K or MATH, IMO problems require deep abstraction, multi-step logical deduction, and creative proof construction. Prior work has shown that even state-of-the-art LLMs struggle to produce rigorous, error-free proofs for Olympiad-level problems, often succumbing to logical fallacies or superficial heuristics. This paper investigates whether the latent capabilities of a leading LLM—Google’s Gemini 2.5 Pro—can be harnessed to solve the newly released IMO 2025 problems, using a carefully engineered, iterative self-verification pipeline.

Pipeline Architecture and Methodology

The core contribution is a multi-stage, verifier-guided pipeline designed to elicit rigorous, complete solutions from a general-purpose LLM. The pipeline consists of the following steps:

  1. Initial Solution Generation: The model is prompted to produce a detailed, step-by-step proof, with explicit instructions to prioritize rigor and to avoid unjustified leaps or guesses. The prompt enforces a strict output format, including a summary, method sketch, and a fully detailed solution in TeX.
  2. Self-Improvement: The model is instructed to review and refine its own output, leveraging an additional token budget. This step is critical due to the finite context window (32,768 tokens) of Gemini 2.5 Pro, which is often insufficient for a complete IMO proof in a single pass.
  3. Verifier Loop: A separate verifier prompt is used to scrutinize the solution, classifying issues as either critical errors (logical/factual mistakes) or justification gaps (insufficiently justified steps). The verifier produces a bug report, which is then used to guide further refinement.
  4. Iterative Correction: The model iteratively corrects its solution based on the bug report, with optional human review of the bug report to filter out spurious findings. This loop continues until the solution passes five consecutive verification runs or is rejected after persistent major errors.
  5. Acceptance Criteria: A solution is accepted only if it passes the verifier five times consecutively, ensuring robustness against both model and verifier errors.

This architecture is designed to overcome the limitations of single-pass LLM outputs, particularly the context window constraint and the tendency to overlook subtle logical errors.

Experimental Setup

  • Model: Gemini 2.5 Pro, with a maximum context window of 32,768 tokens.
  • Temperature: Set to 0.1 to minimize stochasticity and random errors.
  • No External Tools: No web search, code execution, or external verification tools are used.
  • Prompt Engineering: Prompts are meticulously crafted to enforce mathematical rigor and explicit self-correction.

To avoid data contamination, only the newly released IMO 2025 problems are used, ensuring that the evaluation is on genuinely unseen data.

Results and Empirical Findings

The pipeline successfully generated complete, mathematically rigorous solutions for 5 out of the 6 IMO 2025 problems. Notably:

  • Problems 3, 4, 5: Solved without any problem-specific hints.
  • Problems 1, 2: Solved both with and without general strategic hints (e.g., "try induction" or "try analytic geometry"). The hint-free solutions required more sampling but did not fundamentally alter the model’s capabilities, supporting the claim that hints primarily improve efficiency rather than enable new reasoning abilities.
  • Problem 6: The pipeline failed to produce a correct solution. The model’s proof contained a critical error in the combinatorial argument, specifically an incorrect assumption about the partitioning of tiles, which was not repairable through iterative refinement.

Key empirical claim: The iterative, verifier-guided pipeline is essential for converting the latent mathematical reasoning capabilities of LLMs into rigorous, trustworthy proofs. Direct, single-pass generation—even with a strong model—remains insufficient for IMO-level tasks.

Analysis of the Verifier

The verifier is highly reliable in detecting critical errors, with rare misses that are mitigated by repeated runs. Justification gaps are sometimes over-reported, but the system is robust to such false positives due to the review and correction steps. The pipeline’s acceptance criterion (five consecutive passes) provides strong empirical assurance of solution validity.

Implementation Considerations

  • Token Budget: The context window is a hard constraint; multi-stage refinement is necessary for problems exceeding the model’s single-pass capacity.
  • Parallelization: Multiple solution candidates can be generated and refined in parallel, increasing the probability of success.
  • Verifier Robustness: The iterative loop and optional human review mitigate both false positives and false negatives from the verifier.
  • Prompt Design: Explicit, structured prompts are critical for both the solver and verifier roles, enforcing output discipline and reducing hallucinations.

Theoretical and Practical Implications

Theoretical: The results demonstrate that, with appropriate system-level scaffolding, current LLMs can achieve performance on par with IMO gold medalists on genuinely unseen problems. This narrows the gap between numerical answer generation and the construction of logically sound, human-verifiable proofs.

Practical: The pipeline architecture is directly applicable to other domains requiring formal reasoning, such as scientific discovery, formal verification, and theorem proving. The modularity of the approach allows for integration with multi-agent systems, ensemble methods, or hybrid human-in-the-loop workflows.

Limitations: The approach is computationally intensive, requiring multiple iterations and substantial token usage. The failure on Problem 6 highlights that certain combinatorial or construction-based problems remain challenging, especially when the model’s initial abstraction is fundamentally flawed.

Future Directions

  • Model Diversity: Incorporating multiple LLMs (e.g., Grok 4, OpenAI-o series) in a multi-agent system could further improve coverage and solution quality.
  • Solution Aggregation: Combining partial solutions or leveraging ensemble voting may address cases where no single model instance produces a complete proof.
  • Verifier Enhancement: Training specialized verifiers or integrating formal proof assistants could further increase reliability.
  • Scaling: Efficient batching, parallelization, and adaptive sampling strategies are necessary for practical deployment at scale.

Conclusion

This work demonstrates that a structured, iterative, verifier-guided pipeline enables a general-purpose LLM (Gemini 2.5 Pro) to solve 5 out of 6 IMO 2025 problems with mathematically rigorous proofs, on genuinely unseen data. The results underscore the necessity of system-level scaffolding—beyond raw model capability—for complex mathematical reasoning. The approach provides a blueprint for deploying LLMs in high-stakes, formal reasoning domains, while also highlighting current limitations and avenues for further research.

Youtube Logo Streamline Icon: https://streamlinehq.com

alphaXiv