Diagram-grounded Geometry Problem Solving

Updated 25 December 2025

Diagram-grounded geometry problem solving is a multimodal approach that integrates diagram parsing and textual formalization to derive valid geometric proofs.
It leverages advanced neural-symbolic frameworks to align visual geometric primitives with language-based constraints, enhancing inference accuracy.
Recent methods like Pi-GPS significantly improve performance by resolving referential ambiguities and ensuring stepwise, interpretable reasoning.

Diagram-grounded geometry problem solving is the study and automation of mathematical reasoning tasks that require joint processing of both a geometry diagram—typically a rendered image containing points, lines, circles, angle marks, and textual or symbolic annotations—and an associated textual description, with the goal of producing a mathematically valid solution (numerical, symbolic, or proof-based) (Zhao et al., 3 Jun 2025). Unlike algebraic or text-only mathematical problem solving, this task demands multimodal alignment, diagram parsing, formal language translation, and rigorous deductive or programmatic inference over geometric constructs. Recent advances in diagram-grounded frameworks have fundamentally shifted the capacities of AI-based geometry solvers by explicitly modeling and leveraging diagrammatic information at every stage of the problem-solving pipeline.

1. Formal Definition and Core Challenges

Diagram-grounded geometry problem solving (GPS) is defined as the class of reasoning tasks that require the extraction and integration of both low-level geometric primitives (points, lines, circles, intersection points, parallel/perpendicular indicators) from images and textual constraints (incidence, measures, target quantities) from natural language, followed by deductive or programmatic inference to obtain the answer (Zhao et al., 3 Jun 2025, Lu et al., 2021). Formally, the system must:

Parse a diagram $D$ to recover a symbolic structure encompassing labeled points, adjacency, metric relations, and geometric markings.
Parse a problem text $T$ to yield a set of formalized propositions, including unspecified or ambiguous objects (e.g., "the shaded region", "point $P$ ") and the explicit target.
Fuse $D$ and $T$ into a coherent formal language $L$ (e.g., a first-order predicate logic with geometric function symbols (Lu et al., 2021), or a context-free grammar-based logic (Ping et al., 29 May 2025)) which is sufficiently precise for symbolic reasoning.

Core challenges include the disambiguation of textual referents via diagrammatic grounding, robust extraction of geometric relations from possibly noisy or complex diagrams, alignment between linguistic and visual entities, and the construction of an inference chain that corresponds to a valid geometric proof (Zhao et al., 7 Mar 2025, Ping et al., 29 May 2025).

2. Diagram Parsing, Grounding, and Disambiguation

Accurate diagram parsing is indispensable for robust geometry reasoning. Classical methods (Hough transforms, edge detection) yielded limited symbolic structures from images, but modern systems employ neural diagram parsers such as PGDPNet (Zhao et al., 7 Mar 2025), which performs instance segmentation and relation extraction, or SigLIP-based encoders that produce semantically aligned representations (Zhang et al., 2024).

A seminal advancement is diagram-grounded text disambiguation. In Pi-GPS (Zhao et al., 7 Mar 2025), a "micro module" consisting of a rectifier (an MLLM disambiguating textual placeholders like "$") and a verifier (a geometric rule-checker) injects diagrammatic clarity into the formal language prior to reasoning. The rectifier fills in missing or ambiguous object names by contextually analyzing point lists and connectivity from the diagram; the verifier ensures existence, shape closure, and geometric consistency, feeding back to the rectifier as needed.

Explicit grounding not only mitigates model hallucination but also resolves complex referential ambiguity—such as matching unnamed polygons to labeled diagram vertices or determining the region intended as the area to be computed (Zhao et al., 7 Mar 2025). This approach distinguishes Pi-GPS from prior neural-symbolic and end-to-end models (e.g. Inter-GPS, LANS), which often underperform on examples with latent references or topology-dependent meaning.

3. Reasoning and Theorem Application Mechanisms

Once diagram and text parses are fused and disambiguated, contemporary solvers employ a range of reasoning engines:

Symbolic engines (e.g., Inter-GPS (Lu et al., 2021), AutoGPS (Ping et al., 29 May 2025)) operate via forward/backward chaining over a formal theorem base. A neural or LLM-based theorem predictor guides which axioms to apply, pruning the combinatorial search space. Reasoning proceeds in a stepwise manner, often with explicit human-interpretable proof traces.
Neural program generators (e.g., PGPSNet-v2 (Zhang et al., 2024), LANS (Li et al., 2023)) generate a sequence of geometric operators and operands—a "solution program"—by decoding over a multimodal fusion of diagram and text features. These are executed in an external or internal program interpreter. Decoders are typically constrained by problem context and diagram structure, sometimes employing verifier modules to eliminate physically or mathematically invalid programs.
Neuro-symbolic collaborative frameworks (e.g., AutoGPS (Ping et al., 29 May 2025)) combine a multimodal problem formalizer with a symbolic hypergraph expansion reasoner, yielding stepwise, minimal subgraph proofs.
Chain-of-theorem LLM predictors (e.g., Pi-GPS (Zhao et al., 7 Mar 2025)) use LLMs to plan a sequence of theorem applications based on the cleaned, disambiguated formal language.

The disambiguated formal language typically uses first-order predicates such as $\mathrm{Parallel}(Line(A,B), Line(C,D))$ , $\mathrm{OnLine}(C,A,B)$ , or $\mathrm{Find}(AreaOf(...))$ , enabling precise invocation and verification of geometric theorems, such as angle-chasing, similarity, and Pythagoras (Lu et al., 2021).

4. Empirical Benchmarks and Performance Metrics

The field has established a suite of benchmarks with varying complexity and annotation detail:

Dataset	Size (problems)	Modalities	Annotation	Notable Features
GeoS	186	Diagram, text	SAT style	Early, limited
Geometry3K	3,002	Diagram, text	Formal language	Choice/Completion mode
PGPS9K	9,022	Diagram, text	Clauses/programs	Fine-grained programs
GeoQA, GeoQA+	4,998, 7,528	Diagram, text	Key points, steps	Chinese/EN, proof steps
UniGeo	14,541	Diagram, text	Calculation, proof	Inference chains
FormalGeo7K	~7,000	Diagram, text	Full formalization	IMO-level subset

Metrics include end-to-end solution accuracy in multiple-choice (choice) and numeric/free-form (completion) modes, equation/program matching, parser detection/recall, relation F1, and stepwise logical trace coherence (Zhao et al., 7 Mar 2025, Ping et al., 29 May 2025, Zhang et al., 2024). Step-level evaluation protocols, such as that in AutoGPS, test for geometric comprehension, validity of theorem application, and algebraic correctness (Ping et al., 29 May 2025).

Pi-GPS achieves 77.8% choice accuracy on Geometry3K (+10 pp over prior neural-symbolic models), with especially large gains on area questions (27% to 59%) (Zhao et al., 7 Mar 2025). Ablations attribute most of the improvement to the diagram-grounded text disambiguation (+8 pp), with verified theorem prediction contributing a further +2 pp.

5. Advanced Architectures and Representations

Recent research has produced specialized architectures specifically designed for diagram-grounded reasoning:

Layout-aware architectures (LANS (Li et al., 2023)): Exploit point-to-patch alignment and enforce cross-modal spatial correspondence via layout-aware attention masks. Structural-semantic and point-match pre-training boosts accuracy by focusing attention along geometric adjacency and alignment.
Hologram representations (HGR (Huang et al., 2024)): Construct a unified heterogeneous graph encoding all primitives and relations derived from both text and image, enabling graph-model-driven theorem application and property instantiation, under RL-based model selection.
Unified formalization pipelines (GeoX (Xia et al., 2024), DFE-GPS (Zhang et al., 2024)): Employ unimodal and cross-modal pre-training, geometry-language alignment, and instruction tuning to generate not only accurate solutions but also verifiable, interpretable step traces.
Neuro-symbolic collaborative engines (AutoGPS (Ping et al., 29 May 2025)): Integrate a multimodal formalizer that produces a self-consistent, diagram-integrated formal representation with a symbolic hypergraph expansion reasoner which ensures minimality and stepwise logical reliability.

A critical pattern is the explicit use of diagram formalization steps—transitioning between visual features, formal geometric languages (e.g., ConsCDL, ImgCDL), and intermediate symbolic clauses. These formalizations address modality gaps, enable human-verifiable reasoning chains, and support rigorous logical integrity checks.

6. Theoretical and Practical Significance; Future Directions

Diagram-grounded approaches have demonstrated that integration of precise, diagram-extracted context into the geometric reasoning process is both necessary and effective at eliminating failure modes associated with unresolved textual references, visual ambiguity, or over-generalizing LLMs (Zhao et al., 7 Mar 2025, Ping et al., 29 May 2025). The modular architecture of micro-modules (rectifier + verifier), agentic multi-stage pipelines (Interpreter-Solver), and graph-model expansion frameworks ensures interpretability as well as mathematical soundness.

Limitations persist in handling crowded or highly stylized diagrams, scalability to solid or non-Euclidean geometry, and the full integration of hybrid symbolic-numeric solvers (Zhao et al., 3 Jun 2025). Scaling diagram parsers, enriching the theorem base, and advancing self-verification modules represent promising future directions. Automated benchmark generation (control over geometry domains), adaptive agentic frameworks (dynamic selection of single vs. multi-agent pipelines), and learning-based verifier refinement are proposed next steps in the trajectory toward human-level geometric reasoning (Zhao et al., 7 Mar 2025, Sobhani et al., 18 Dec 2025, Zhao et al., 3 Jun 2025).

The empirical success and architectural diversity of recent diagram-grounded GPS research mark a decisive shift in multimodal mathematical reasoning, with impacts extending to intelligent education, computer-aided design, and explainable AI (Zhao et al., 7 Mar 2025, Ping et al., 29 May 2025, Zhao et al., 3 Jun 2025).