Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2 (2502.03544v2)

Published 5 Feb 2025 in cs.AI and cs.LG

Abstract: We present AlphaGeometry2, a significantly improved version of AlphaGeometry introduced in Trinh et al. (2024), which has now surpassed an average gold medalist in solving Olympiad geometry problems. To achieve this, we first extend the original AlphaGeometry language to tackle harder problems involving movements of objects, and problems containing linear equations of angles, ratios, and distances. This, together with support for non-constructive problems, has markedly improved the coverage rate of the AlphaGeometry language on International Math Olympiads (IMO) 2000-2024 geometry problems from 66% to 88%. The search process of AlphaGeometry2 has also been greatly improved through the use of Gemini architecture for better LLMing, and a novel knowledge-sharing mechanism that enables effective communication between search trees. Together with further enhancements to the symbolic engine and synthetic data generation, we have significantly boosted the overall solving rate of AlphaGeometry2 to 84% for $\textit{all}$ geometry problems over the last 25 years, compared to 54% previously. AlphaGeometry2 was also part of the system that achieved silver-medal standard at IMO 2024 https://dpmd.ai/imo-silver. Last but not least, we report progress towards using AlphaGeometry2 as a part of a fully automated system that reliably solves geometry problems directly from natural language input.

Summary

The paper introduces AlphaGeometry2, a neuro-symbolic system that surpasses average gold medalist performance on IMO geometry problems.
It expands the domain-specific language by incorporating new predicates and automated diagram generation using a two-stage optimization approach.
Innovations include a Shared Knowledge Ensemble of Search Trees, an enhanced DDAR engine, and a sparse mixture-of-expert Transformer model, solving 42 out of 50 IMO problems.

The paper introduces AlphaGeometry2 (AG2), an enhanced neuro-symbolic system for solving Olympiad geometry problems. AG2 builds upon its predecessor, AlphaGeometry (AG1), and achieves a substantial performance boost, surpassing the average gold medalist in solving International Mathematical Olympiad (IMO) geometry problems from 2000-2024.

A key improvement in AG2 is an expanded domain-specific language. The original AG1 language had limitations in addressing problems involving movements of objects, linear equations of angles/ratios/distances, and "find the angle..." type questions. To address these limitations, AG2 introduces new predicates:

{acompute a b c d}: Find the angle between $AB$ $A B$ and $CD$ $C D$ .
- $a$ : Point A.
- $b$ : Point B.
- $c$ : Point C.
- $d$ : Point D.
{rcompute a b c d}: Find the ratio $AB/CD$ $A B / C D$ .
- $a$ : Point A.
- $b$ : Point B.
- $c$ : Point C.
- $d$ : Point D.
{distmeq a1 b1 a2 b2 ... an bn t1 t2 ... tn y}: $t_1 \log(A_1B_1) + t_2 \log(A_2B_2) + ... + t_n \log(A_nB_n) + y = 0$ .
- $a_i$ : Point $A_i$ .
- $b_i$ : Point $B_i$ .
- $t_i$ : Coefficient for the log-distance between $A_i$ and $B_i$ .
- $y$ : Constant term in the equation.
{distseq a1 b1 a2 b2 ... an bn t1 t2 ... tn}: $t_1 A_1B_1 + t_2 A_2B_2 + ... + t_n A_nB_n = 0$ $t_{1} A_{1} B_{1} + t_{2} A_{2} B_{2} + ... + t_{n} A_{n} B_{n} = 0$ .
- $a_i$ : Point $A_i$ .
- $b_i$ : Point $B_i$ .
- $t_i$ : Coefficient for the distance between $A_i$ and $B_i$ .
{angeq a1 b1 a2 b2 ... an bn t1 t2 ... tn y}: $t_1 d(A_1B_1) + t_2 d(A_2B_2) + ... + t_n d(A_nB_n) + y = 0$ , where $d(AB)$ $d (A B)$ is the angle between the undirected line $AB$ $A B$ and the horizontal line.
- $a_i$ : Point $A_i$ .
- $b_i$ : Point $B_i$ .
- $t_i$ : Coefficient for the directed angle of line $A_iB_i$ .
- $y$ : Constant term in the equation.

AG2 introduces syntax to handle locus problems, which involve the movement of geometric objects. The syntax utilizes a placeholder token {*} to denote the fixed-point, allowing AG2 to reason about geometric constraints that hold true as points, lines, or circles move according to certain rules. AG2 also incorporates predicates for diagram checks, ensuring topological consistency and non-degeneracy:

{sameclock a b c d e f}: Direction $A \rightarrow B \rightarrow C$ $A \to B \to C$ has the same clockwise direction to $D \rightarrow E \rightarrow F$ $D \to E \to F$ .
- $a$ : Point A.
- $b$ : Point B.
- $c$ : Point C.
- $d$ : Point D.
- $e$ : Point E.
- $f$ : Point F.
{noverlap a b}: Points $A$ $A$ and $B$ $B$ are distinct.
- $a$ : Point A.
- $b$ : Point B.
{lessthan a b c d}: $AB < CD$ $A B < C D$ .
- $a$ : Point A.
- $b$ : Point B.
- $c$ : Point C.
- $d$ : Point D.
{overlap a b}: Points $A$ $A$ and $B$ $B$ are overlapping points.
- $a$ : Point A.
- $b$ : Point B.
{cyclic_with_center a1 a2 ... an x}: $a_1 = a_2 = \cdots = a_x$ $a_{1} = a_{2} = \dots = a_{x}$ is the center of the circle that goes through $a_{x+1} ... a_n$ $a_{x + 1} ... a_{n}$ .
- $a_i$ : Point $A_i$ .
- $x$ : Index up to which points $A_i$ are the same (center of the circle).

Furthermore, AG2 relaxes the constraint of AG1, which requires each point to be defined by at most two predicates. In AG2, points can be defined by three or more predicates, enabling the system to handle non-constructive problems where diagram construction is not straightforward. The domain language coverage is improved from 66% to 88% on IMO geometry problems from 2000 to 2024 because of these changes.

The paper also discusses automating problem formalization and diagram generation. AG2 employs Gemini to translate problems from natural language into the AlphaGeometry language using a few-shot prompting approach. For automated diagram generation, the approach builds upon existing methodologies, using a two-stage optimization method involving ADAM gradient descent and the Gauss-Newton-Levenberg method to find numerical solutions for the system of non-linear equations that define the geometric constraints. The system was able to generate diagrams for 41 out of 44 IMO problems formalized in the AG language.

The Deductive Database Arithmetic Reasoning (DDAR) symbolic engine in AG2 has been enhanced with three main improvements: the ability to handle double points, a faster algorithm, and a faster implementation.

Handling double points: AG2 can handle scenarios where two points with different names have the same coordinates. This is achieved by constructing a new point, proving its equivalence to an existing point, and then using this equivalence to deduce new facts.
Faster algorithm: Essential rules are hard-coded to reduce query times for the Arithmetic Reasoning (AR) sub-engine to at most cubic complexity. Explicit rules for angles and distances are discarded, with all deductions handled automatically in the AR engine. An improved algorithm (DDAR2) is designed for identifying similar triangles and cyclic quadrilaterals.
Faster implementation: The core computation (Gaussian Elimination) is implemented in C++, resulting in a significant speedup compared to AG1.

AG2 employs synthetic data generation. Compared to AG1, AG2 explores random diagrams at twice the size, produces theorems at up to 2x more complexity, produces proofs at up to 10x more complexity, has a more balanced data distribution between question types, and has a more balanced data distribution between problems with and without auxiliary points. The data generation algorithm also produces problems of "locus" type. To achieve this, the movement dependency for each point $X$ is recorded during random diagram generation through a function $P(A)$ , which represents the set of points that control the movements of $A$ .

AG2 features a novel search algorithm called Shared Knowledge Ensemble of Search Trees (SKEST). SKEST involves running several differently configured beam searches in parallel, allowing them to share knowledge through a shared facts database. The following types of search trees are used to make sure different parts of the search space are explored effectively:

"Classic" search tree: The same beam tree search used in AG1, where a LLM is asked to produce one auxiliary point at each node.
Tree predicting multiple auxiliary points at each node: A LLM is allowed to produce as many auxiliary points as it wants at each tree node.
Tree predicting different types of aux points uniformly.
Deep but narrow tree (e.g. beam size 64 and depth 10).
Shallow but wide tree (e.g. beam size 512 and depth 4).

The final key improvement in AG2 is a new LLM, which builds on Gemini. The LLM is a sparse mixture-of-expert Transformer-based model. The LLMs are trained with a custom tokenizer in the domain-specific language, or by fine-tuning pre-trained custom math specialized Gemini models in natural language, or by multimodal training with an additional image input. During inference, AG2 enriches the neuro-symbolic interface by letting the LLM know about the deductions made by DDAR before proposing auxiliary constructions. The LLM receives the original problem statement and an "analysis string" that contains the set of facts deducible by DDAR given the original problem premises, the set of facts deducible by DDAR given the original problem premises and assuming the goal predicate is also true, and the set of facts that are correct numerically.

The paper presents results demonstrating AG2's performance on IMO geometry problems. AG2 solves 42 out of 50 of all 2000-2024 IMO geometry problems. Ablation studies on inference settings reveal that the optimal configuration for a single search tree involves a beam size of 128, a beam depth of 4, and 32 samples. Additional evaluations were performed on 30 of the hardest IMO shortlist problems, which are formalizable in the AG2 language, and AG2 solves 20 out of 30 problems.

The paper concludes by discussing future work, including expanding the domain language to cover a variable number of points, non-linear equations, and problems involving inequalities, breaking problems into subproblems and applying Reinforcement learning approaches, and improving auto-formalization.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/lmthang/status/1887928665100665111

https://twitter.com/deedydas/status/1887758665110986864

https://twitter.com/_akhaliq/status/1887718109068284339

https://twitter.com/TechXplore_com/status/1889102715503989192

https://twitter.com/tsarnick/status/1888024398973780165

https://twitter.com/AndrewCritchPhD/status/1887963193718554972

Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2 (2502.03544v2)

Summary

Related Papers

Tweets

YouTube

HackerNews

Reddit