GGBench: Unified Geometric Reasoning Benchmark

Updated 17 November 2025

GGBench is a comprehensive benchmark that evaluates geometric generative reasoning in unified multimodal models using integrated language, code, and image outputs.
It employs a rigorous tri-modal protocol that aligns natural language prompts, executable GeoGebra code, and rendered diagrams to ensure precise geometric construction.
Designed to address limitations in previous benchmarks, GGBench offers detailed evaluations of logical, procedural, and spatial accuracy in AI-generated geometric reasoning.

GGBench is a comprehensive benchmark explicitly designed to evaluate geometric generative reasoning in unified multimodal models (UMMs). Unlike prior datasets, which predominantly measure discriminative understanding or unconstrained image generation, GGBench targets the integrated fusion of language comprehension, formal reasoning, and precise visual generation. It establishes a rigorous protocol for diagnosing a model’s ability not merely to understand but also to actively construct geometric solutions, with a focus on verifiability through synchronized text, code, and image modalities (Wei et al., 14 Nov 2025).

1. Motivation and Distinctive Scope

GGBench was developed in response to limitations of existing math-vision and multimodal benchmarks. Traditional datasets such as MathVista, MATH-V, and GeoEval emphasize answer selection rather than explicit construction. Other recent multimodal benchmarks (e.g., MM-MATH, MathVerse) have initiated process evaluation but lack explicit code-level supervision, which impairs result verifiability. No prior resource achieves complete alignment among detailed natural language reasoning, executable construction code, and reference diagram.

GGBench addresses this gap by providing:

Tasks that require deep parsing of abstract natural language constraints.
Multi-step planning via formal geometric operations (e.g., "draw the perpendicular bisector of $\overline{AB}$ ").
Precise and constrained generation of visual evidence/diagrams.
Full triplet representations: descriptive reasoning, executable code (GeoGebra), and rendered images.

This tri-modal verifiability sets a new standard for benchmarks aimed at the evaluation of next-generation UMMs.

2. Dataset Composition and Annotation Pipeline

GGBench comprises 1,411 problems, systematically distributed by construction type and difficulty:

Category	Count
Straightedge-and-compass	798
Geometric transformations	426
Analytic constructions	187
Easy / Medium / Hard	298/816/297

Each problem is annotated with 3–7 diagrams (mean 5.08; total 7,165 images) capturing construction steps or key configuration states. Problems are further tagged by skill domains (total tags = 3,097), including basic constructions (1,063 tags), circle properties (931), geometric transformations (376), triangle constructions (280), theorem applications (218), polygons, measurement/ratios, and locus constructions.

The data collection pipeline involves multistage filtering and synthesis:

Manual web pooling for public geometric construction problems.
Filtering and annotation via GPT-5 to ensure unambiguous, actionable prompts.
Prompt adaptation into rigorously "construction-oriented" statements specifying explicit givens and targets, assigned with difficulty/type/skill metadata.
Synchronized solution triplets generated via LLMs:
1. Stepwise natural language construction plan.
2. Executable GeoGebra code.
3. Corresponding rendered diagram.
Automated screening for code executability, logical and diagrammatic consistency; finalized by expert mathematical review to ensure soundness and strict tri-modal alignment.

3. Formal Task Definition and Modalities

Each GGBench problem is formally characterized by:

Text prompt $P$ with explicit syntactic and semantic constraints.
Optional reference shapes (for initial configurations).
Required output: a set of geometric objects $\{O_1, \ldots, O_k\}$ that strictly satisfy a relational set $R$ .

Three principal task families are explicitly defined:

Straightedge-and-Compass Construction: Given initial geometric objects, construct auxiliary lines/circles to satisfy specific relationships (e.g., bisectors, tangents). Example: “Construct the perpendicular bisector of segment $\overline{AB}$ .”
Geometric Transformation Construction: Execute rigid or similarity transformations (rotations, reflections) on specified objects, followed by further geometric queries or constructions (e.g., intersection with transformed structures).
Analytic Construction: Place objects using algebraic or quantitative constraints, sometimes involving coordinate geometry, exemplified by construction of a circle through specified points with given radius.

Problem modalities are strictly synchronized among:

Text: natural language, formalized prompt.
Code: explicit GeoGebra script.
Image: rendered output corresponding to code and plan.

4. Evaluation Protocol and Metrics

GGBench employs a multi-stage, granular evaluation protocol:

Planning (VLM-T): Natural language planning is scored for logical coherence, step completeness, and geometric correctness, each criterion rated 1–5 (normalized to [0,100]).
Intermediate Process (VLM-I-Mid):
- Step Accuracy: Each intermediate construction step is aligned with an image and graded.
- Process Consistency: Assesses the coherence of the sequence.
Final Result (VLM-I-Res):
- Geometric consistency: Rating of final diagram’s compliance with task constraints.
- Perceptual metrics: LPIPS (distance, lower is better), PSNR, SSIM (higher is better).
Overall Score (VLM-I): Mean of VLM-I-Mid and VLM-I-Res, with strong human-VLM metric correlation ( $r = 0.9295$ ).
Code-Based Track: Additional metrics for GeoGebra script generation:
- Pass@1: Execution success rate.
- BLEU, ROUGE-L, chrF: String similarity.
- Edit Distance: To reference scripts.

This rigorous scoring framework enables both process- and outcome-level assessment of generative geometric reasoning in UMMs.

5. Baseline Model Performance

GGBench baseline experiments covered two model tracks:

Track A (End-to-End UMMs): Models include Qwen-Image, Seedream 4.0, Janus, BAGEL, Nano Banana.
Track B (Planning → Code → Render): Models include GPT-4o, GLM-4.5V, Qwen3-14B, GPT-4, Gemini 2.5 Pro, DeepSeek R1/V3.1, Qwen3-VL, Claude Sonnet 4.5, GPT-5.

Summary of core results:

Model	VLM-I	Pass@1 (Code)
Qwen-Image	22.75	–
Seedream 4.0	24.45	–
Janus	20.73	–
BAGEL	20.91	–
Nano Banana	33.82	–
GPT-4o	14.43	7.87%
GLM-4.5V	–	14.25%
GPT-5	57.08	79.02%

End-to-end UMMs demonstrate superior perceptual image synthesis but systematically fail to enforce precise geometric constraints (VLM-I < 45). In contrast, code-grounded LLMs (GPT-5, Claude 4.5, DeepSeek V3.1) achieve far higher accuracy on logical and geometric correctness (VLM-I up to 76), with GPT-5 achieving high code executability (Pass@1 = 79.02%) and greater structural similarity to references.

Frequent failure modes include:

Geometric logic errors (misapplied theorems).
Structural/contextual reversals (e.g., inscribe vs. circumscribe).
Confusion between numeric values and construction steps.
Code-level syntax errors (e.g., undefined objects).

6. Analysis, Limitations, and Cross-Benchmark Comparison

Key findings from GGBench studies:

Perceptual metrics (LPIPS, PSNR, SSIM) exhibit low correlation with geometric correctness, indicating that visual similarity is insufficient to assess task success.
UMMs handle basic and procedural constructions relatively well but remain challenged by tasks demanding application of geometric theorems or quantitative measurement.
The strict alignment and full verifiability across text, code, and diagram modalities expose brittleness in current model reasoning and symbolic-to-spatial translation.
Existing alternative benchmarks such as GeoGramBench (Luo et al., 23 May 2025) share a focus on geometric reasoning but emphasize program-to-geometry mapping, with a three-level taxonomy (primitive, compositional, global abstraction). GeoGramBench's results—substantially lower model accuracy at more abstract levels—underscore the persistent challenge in integrating spatial and symbolic reasoning, a point reinforced by GGBench’s findings.

A plausible implication is that while state-of-the-art models can handle perceptual synthesis, they are not yet capable of the full geometric generative reasoning required for mathematically rigorous construction, especially for complex, abstract, or theorem-driven tasks.

7. Future Directions

Identified directions for research informed by GGBench include:

Integrating explicit symbolic reasoning components with visual generation modules to combine strengths of both model architectures.
Expanding code-based verifiability frameworks beyond geometry to fields such as CAD, electronic circuit layout, and architectural planning.
Developing closed-loop feedback systems where intermediate diagrams inform subsequent reasoning or correction steps.
Extending benchmarks to include higher-dimensional, multi-modal, and hybrid symbolic-spatial challenges.

GGBench thus functions as a reference point for the evolving landscape of geometric reasoning evaluation in AI, emphasizing the need for benchmarks that test not only understanding but also active symbolic construction and integrated multi-modal output verification (Wei et al., 14 Nov 2025).