InternGeometry: Agent-Based Olympiad Solver

Updated 4 July 2026

InternGeometry is a system that uses an LLM agent integrated with a symbolic engine and dynamic memory to address olympiad-level geometry challenges.
It iteratively proposes auxiliary constructions and propositions, using long-horizon feedback to refine geometric proofs and overcome weak local cues.
Its curriculum-based reinforcement learning approach and rigorous ablation studies demonstrate state-of-the-art performance on IMO benchmarks with efficient data usage.

Searching arXiv for InternGeometry and closely related geometry-solving systems. InternGeometry is a LLM agent for olympiad geometry that is designed around iterative interaction with a symbolic engine, long-horizon memory, and reinforcement learning over synthesized problems of increasing difficulty. It is presented as the first medalist-level LLM agent for geometry, with the stated aim of addressing a domain in which progress had remained dominated by expert systems because auxiliary constructions are hard to discover and useful local heuristics are weak (Zhao et al., 11 Dec 2025).

1. Definition and scope

InternGeometry is introduced as an agentic system for International Mathematical Olympiad geometry problems, built on InternThinker-32B and coupled to a symbolic geometry engine called InternGeometry-DDAR (Zhao et al., 11 Dec 2025). Its target problem class is olympiad geometry proof generation in settings where the main obstacle is not merely chaining known theorems, but identifying useful auxiliary constructions such as extra points, lines, circles, or symmetric configurations.

The paper frames geometry as unusually difficult for general LLM agents because auxiliary constructions often have weak local cues, the search space is highly underdetermined, and complete solutions may require long exploratory trajectories. This motivates a design in which the model does not attempt to solve a problem in a single pass. Instead, it repeatedly proposes intermediate propositions and auxiliary constructions, submits them to a symbolic engine for verification, and reflects on the engine’s feedback before choosing the next action (Zhao et al., 11 Dec 2025).

A central practical claim is that this interaction process is genuinely long-horizon. The system is described as supporting more than two hundred interactions with the symbolic engine for a single problem through a dynamic memory mechanism that compresses prior history while preserving key actions and recent feedback (Zhao et al., 11 Dec 2025).

2. Core architecture

InternGeometry is organized around three components: the agent $\mathbb{G}$ , the symbolic engine $\mathfrak{E}$ , and a dynamic memory module $\mathfrak{W}$ (Zhao et al., 11 Dec 2025). For a geometry problem $X$ , at step $t$ the agent takes the problem and the compressed history representation $\mathfrak{W}(H_{t-1})$ , produces natural-language reasoning $P_t$ and a formal action $A_t$ , and sends that action to the symbolic engine. The engine executes the action, returns feedback $O_t$ , and updates the environment state. The interaction history is then extended by $[P_t, A_t, O_t]$ (Zhao et al., 11 Dec 2025).

The formal protocol uses a domain-specific language for geometry. The appendix describes three interaction types with InternGeometry-DDAR: obtaining the initial state through a <build> action, adding auxiliary constructions through an <add> action, and proposing proof steps or propositions through a <propose> action (Zhao et al., 11 Dec 2025). This separation is important because the paper treats successful geometry solving as an alternation between discovering consequences of the current configuration and enriching the configuration itself.

The symbolic component, InternGeometry-DDAR, is built on the open-source Newclid DDAR system and combines a deductive database with algebraic reasoning implemented through Gaussian elimination (Zhao et al., 11 Dec 2025). The engine is described as maintaining the evolving geometric configuration, the auxiliary objects added so far, and the propositions already proved. The paper reports several extensions beyond the base open-source engine, including dynamic diagram adjustment by gradient descent, support for double points via the predicate idc x y, and an expanded theorem library that includes Power of a Point and Menelaus’ theorem (Zhao et al., 11 Dec 2025).

A notable design choice is that the agent is not limited to auxiliary-construction search. Proposition proposal is treated as equally central. This gives the agent a way to probe what is already derivable, expose hidden structure, and use successful subproofs as signals for future construction choices. The paper’s ablations identify this as a major contributor to performance (Zhao et al., 11 Dec 2025).

3. Interaction loop, memory, and search control

InternGeometry’s inference process is explicitly feedback-driven. After each formal action, the symbolic engine indicates whether a proposition is provable or whether a proposed construction is valid, and the agent uses that information to revise its search. The paper characterizes this as a geometry analogue of human trial-and-error exploration: ideas are tested, rejected, refined, and reused as new facts accumulate (Zhao et al., 11 Dec 2025).

The dynamic memory mechanism is introduced to make long interaction histories tractable. Rather than feeding the full history back into the model, $\mathfrak{E}$ 0 summarizes earlier exchanges, preserves core actions and key environment feedback, and keeps the most recent turn intact so that the current symbolic state remains explicit (Zhao et al., 11 Dec 2025). This memory design is presented as what enables more than two hundred agent–engine interactions on a single problem.

To prevent repetitive failure modes, the system also uses prior-guided rejection sampling. A candidate pair $\mathfrak{E}$ 1 is accepted only if it passes rule-based checks; otherwise it is resampled (Zhao et al., 11 Dec 2025). The paper states that these checks exclude repeated actions relative to history, excessively long thinking without termination, malformed or missing actions, and repeated use of the same action type over too many consecutive turns. This mechanism is intended to prevent action collapse, a phenomenon in which long-horizon agents degenerate into repetitive or unproductive behavior (Zhao et al., 11 Dec 2025).

The paper further argues that interaction length is itself a scaling dimension. Increasing the number of allowed steps improves success rates, and, under a fixed total inference budget, lengthening trajectories is reported to be more effective than increasing the number of samples alone (Zhao et al., 11 Dec 2025). This claim aligns with the system’s geometry-specific premise that good heuristics may emerge only after extended interaction.

4. Complexity-Boosting Reinforcement Learning

The training pipeline consists of a supervised cold-start phase followed by Complexity-Boosting Reinforcement Learning, or CBRL (Zhao et al., 11 Dec 2025). In the supervised phase, the model is trained on examples $\mathfrak{E}$ 2, where $\mathfrak{E}$ 3 is a geometry problem, $\mathfrak{E}$ 4 is compressed history, and $\mathfrak{E}$ 5 is the combined natural-language reasoning and formal action sequence. The loss is the standard autoregressive negative log-likelihood over the output tokens (Zhao et al., 11 Dec 2025).

The cold-start dataset is produced by first fine-tuning InternThinker-32B into an InternGeometry-Formalizer and then using it to convert large-scale natural-language geometry material into formal problem-and-solution trajectories. The paper reports 7K examples for cold start (Zhao et al., 11 Dec 2025).

CBRL is the system’s main training innovation. The paper defines problem complexity $\mathfrak{E}$ 6 as the DDAR proof step count and argues that geometry learning is inefficient if tasks are either too easy or too difficult. It therefore synthesizes tasks at controllable complexity and adapts the complexity level during RL so that the policy is trained near its current capability frontier (Zhao et al., 11 Dec 2025).

The curriculum objective is to choose $\mathfrak{E}$ 7 to maximize the expected absolute advantage under the current policy, while updating model parameters $\mathfrak{E}$ 8 to maximize the RL objective on tasks sampled from $\mathfrak{E}$ 9 (Zhao et al., 11 Dec 2025). For binary rewards, the appendix derives

$\mathfrak{W}$ 0

where $\mathfrak{W}$ 1 is the success probability, and shows that this is maximized at $\mathfrak{W}$ 2 (Zhao et al., 11 Dec 2025). This yields the paper’s central curriculum principle: the most useful training problems are those of moderate difficulty, not those that are almost always solved or almost never solved.

The reward is intentionally sparse and rule-computable. It is written as

$\mathfrak{W}$ 3

where $\mathfrak{W}$ 4 if the full proof is complete and $\mathfrak{W}$ 5 otherwise, while $\mathfrak{W}$ 6 if the current step is effective: a proposition must actually be proved, and an auxiliary construction must both be successfully added and be used in the final proof (Zhao et al., 11 Dec 2025). This design ties credit directly to symbolic verification rather than to a learned reward model.

The paper reports approximately 13K total training examples, comprising 7K supervised examples and 6K synthesized RL problems, with roughly $\mathfrak{W}$ 7 training tokens (Zhao et al., 11 Dec 2025). It contrasts this with the much larger token budgets reported for expert systems, arguing that InternGeometry demonstrates substantial data efficiency.

5. Empirical performance

InternGeometry is evaluated primarily on the IMO-50 benchmark, consisting of geometry problems from the International Mathematical Olympiad from 2000 through 2024 (Zhao et al., 11 Dec 2025). Under pass@256 evaluation, the paper reports that InternGeometry solves 44 of 50 problems, compared with 42 for AlphaGeometry 2 and 43 for SeedGeometry (Zhao et al., 11 Dec 2025). The paper further states that this exceeds the average gold medalist score of 40.9 on the benchmark (Zhao et al., 11 Dec 2025).

The same section emphasizes the training-data contrast: InternGeometry uses 13K training examples, which the paper states is only $\mathfrak{W}$ 8 of the data used by AlphaGeometry 2 (Zhao et al., 11 Dec 2025). The resulting interpretation is that agentic interaction, symbolic verification, and complexity-controlled RL can partly substitute for extremely large-scale data synthesis.

The ablation studies are central to the paper’s argument. Removing proposition proposal and leaving only auxiliary construction reduces performance from 44/50 to 35/50. Removing slow thinking reduces it to 23/50, removing context compression to 20/50, and removing rejection sampling to 38/50 (Zhao et al., 11 Dec 2025). These results are used to support the claim that InternGeometry’s gains arise from the combination of long-horizon memory, verified subgoal discovery, and action-quality control rather than from model scale alone.

The CBRL ablation is similarly strong. The paper reports 22/50 after supervised cold start alone, 29/50 when trained only on easy data, 24/50 when trained only on challenging data, 38/50 when trained on the same synthesized data without complexity scheduling, and 44/50 with full CBRL (Zhao et al., 11 Dec 2025). This is presented as evidence that the curriculum, not merely the data generator, is essential.

Qualitative case studies are also emphasized. The paper states that InternGeometry can discover auxiliary constructions that do not appear in human solutions, citing IMO 2018 Problem 6 as an example in which the system constructs points $\mathfrak{W}$ 9 and $X$ 0, identifies an isogonal-conjugate structure in quadrilateral $X$ 1, and proceeds via a synthetic route unlike inversion- or trigonometry-based human solutions (Zhao et al., 11 Dec 2025). This is used to argue that the system is not merely retrieving standard templates.

6. Significance, limitations, and relation to prior geometry systems

InternGeometry belongs to a line of symbolic geometry solvers exemplified by AlphaGeometry-style systems, but it differs in emphasis. Whereas earlier expert systems are described as relying heavily on large-scale synthetic data and specialized search, InternGeometry is presented as an LLM-centered agent that acquires geometry capability through iterative proposition testing, auxiliary construction, and verified reflection (Zhao et al., 11 Dec 2025). This suggests a shift from expert construction heuristics toward agentic long-horizon interaction as a primary source of performance.

The system’s significance lies in three linked claims. First, olympiad geometry can be treated as an interactive problem-solving process rather than a single-shot prediction task. Second, long-horizon reasoning and symbolic feedback form a distinct scaling axis in geometry. Third, a curriculum based on proof-complexity can make reinforcement learning viable in a domain with sparse rewards and weak local heuristics (Zhao et al., 11 Dec 2025).

The paper also leaves clear limitations. Performance depends on the expressive scope of InternGeometry-DDAR, and the remaining unsolved problems are described as involving numerical, computational, or non-pure-geometric aspects that are not well captured by the current DDAR formalism (Zhao et al., 11 Dec 2025). Inference is also expensive: the reported setup uses a 32B base model, up to 200 steps per run, pass@256 sampling, and an average of 89.6K output tokens per trajectory on IMO-50 (Zhao et al., 11 Dec 2025). In addition, some subsystems, especially dynamic memory, are described operationally rather than as fully formalized algorithms.

Even with those caveats, InternGeometry is positioned as a substantial development in AI for geometry. It combines a formal DSL, an interactive symbolic engine, long-context agent control, and curriculum RL into a unified system that is reported to achieve olympiad-level results with comparatively modest data scale (Zhao et al., 11 Dec 2025). A plausible implication is that future progress in automated geometry may depend less on ever-larger construction-specific heuristics and more on verified, agentic exploration over symbolic states.

Markdown Report Issue Upgrade to Chat

References (1)

Achieving Olympia-Level Geometry Large Language Model Agent via Complexity Boosting Reinforcement Learning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to InternGeometry.