Geometry3K Multimodal Benchmark

Updated 4 August 2025

Geometry3K Benchmark is a multimodal evaluation suite designed to assess geometric reasoning with high school to competition-level problems paired with precise diagrammatic representations.
It employs a strict zero-shot protocol with a two-stage answer extraction process and rephrased, diagram-enhanced items to address pretraining data leakage and improve robustness.
Performance insights reveal significant gaps between specialized and generalist models, highlighting challenges in backward reasoning, out-of-distribution tasks, and multi-step chain-of-thought evaluations.

The Geometry3K Multimodal Reasoning Benchmark refers both to a lineage of geometry-focused multimodal datasets and to a specific set of evaluation tasks used to probe LLMs and multimodal LLMs (MLLMs) for their capabilities in geometric problem-solving, spatial reasoning, and the integration of textual and diagrammatic information. Serving as a foundational element in recent benchmarking efforts, Geometry3K and its derivatives are central to empirical assessments of multimodal mathematical reasoning. The Geometry3K family has also been subsumed, reused, or referenced in subsequent comprehensive benchmarks—such as GeoEval, GPSM4K, GeoSense, SOLIDGEO, and others—which together define the de facto standard for systematic evaluation of multimodal geometric reasoning.

1. Benchmark Motivation and Construction

Geometry3K was originally constructed to assess the geometric problem-solving capabilities of LLMs and MLLMs by compiling a diverse suite of high school to competition-level geometry problems. The main subset consists of approximately 3,000 items [“3K” in the label], each paired with precise diagrammatic representations and text statements. Problems were aggregated from multiple established datasets including PGPS9K, UniGeo, GeoQA+, GeometryQA, MATH, and MathQA (Zhang et al., 15 Feb 2024).

Subsequent benchmarks such as GeoEval (Zhang et al., 15 Feb 2024) draw heavily from Geometry3K and similar corpora, producing four principal subsets:

GeoEval-2000: 2,000 problems, broad coverage, text + diagram inputs.
GeoEval-backward: 750 transformed problems requiring backward (goal-driven) reasoning.
GeoEval-aug: 2,000 rephrased items to mitigate pretraining data leakage and investigate robustness.
GeoEval-hard: 300 hand-curated solid/analytic geometry problems focusing on template novelty and OOD (out-of-distribution) reasoning.

Problems encompass a broad spectrum: plane geometry, solid geometry, and analytic geometry. Inputs may be textual, diagrammatic, or both; in advanced variants, explicit diagram descriptions (via captioning) are appended to the prompt to support models lacking direct visual input channels.

2. Evaluation Protocols and Metrics

Evaluations are conducted in a strict zero-shot regime: models receive only a single prompt per item, with no fine-tuning on the benchmark itself (Zhang et al., 15 Feb 2024).

Answer extraction follows a two-stage post-processing protocol:

An answer extraction prompt (often handled by an advanced LLM like GPT-4) parses the model’s raw output for a final answer (numeric or multiple-choice).
Regular expressions are used to handle residual ambiguities.

A response is considered correct if the extracted answer matches the gold standard solution (numerical value or answer option). Additional variants separately evaluate “Text-Only” (T) performance by excluding diagrammatic information.

Accuracy is reported as overall correct responses divided by total problems. For some datasets (e.g., GPSM4K (Anand et al., 1 Dec 2024)), further detailed scoring is performed:

Step-by-step scoring: Each intermediate reasoning step is analyzed for correctness using a chain-of-thought (CoT) protocol.
Automated evaluation by LLMs: Both final answers and critical reasoning steps may be extracted and judged automatically.

In GeoSense (Xu et al., 17 Apr 2025), two dedicated metrics assess specific reasoning capacities:

Geometry Principle Identification (GPI):

$S_i^I = \frac{\sum_{j=1}^{n} F(p_{i,j})}{n}$

where $F(p_{i,j}) = 1$ if the $j$ -th required principle is correctly identified.

Geometry Principle Application (GPA):

$S_i^A = \frac{\sum_{j=1}^{n} F(p_{i,j}) \cdot \frac{2 |\hat{m}_{i,j} \cap m_{i,j}|}{|\hat{m}_{i,j}| + |m_{i,j}|}}{\sum_{j=1}^n F(p_{i,j})}$

assessing if the principle is properly and contextually mapped onto the diagrammatic elements.

3. Model Performance and Analysis

Models evaluated on Geometry3K (and benchmark supersets such as GeoEval and SOLIDGEO) reveal several consistent patterns:

Specialized mathematical LLMs (e.g., WizardMath-70B) edge out generalist models (e.g., GPT-3.5/4), attaining 55.67% accuracy on the main GeoEval-2000 subset but dropping to 6% accuracy on the hard subset.
GPT-series models show large performance increases on rephrased (“augmented”) problem variants, with GPT-3.5 accuracy improving from 24.71% to 41% post-rephrasing (Zhang et al., 15 Feb 2024). This suggests that linguistic reformulation can clarify reasoning pathways and reduce pattern-matching artifacts.
Diagram description (via captioning) markedly boosts performance for MLLMs lacking direct visual input pathways. WizardMath-7B accuracy increases by ~18.7% with explicit diagrammatic description appended.
Across all tested systems, performance on backward or OOD (out-of-distribution) problems is systematically degraded, highlighting limited generalization and insufficient deep reasoning.

A representative performance spectrum is summarized below:

Problem Subset	Top Model	Top Model Accuracy	Baseline (Generalist)	Hard Subset (SOTA)
GeoEval-2000	WizardMath-70B	55.67%	<30% (GPT-3.5/4)	6.00% (WizardMath)
GeoEval-aug	GPT-3.5 (rephrased)	>41%	24.71% (original)	–
GPSM4K	LLaVA-finetuned	>20–25%	Single-digit (%)	–

4. Methodological Advances and Variants

The Geometry3K ecosystem has catalyzed several innovations in benchmark construction and evaluation, including:

Backward reasoning tasks: Masking solution numbers to assess multi-step, retroductive capabilities (Zhang et al., 15 Feb 2024).
Chain-of-Thought (CoT) step evaluation: Scoring intermediate steps, not just outcomes, using both templates and LLM-based comparison (Anand et al., 1 Dec 2024).
Retrieval-Augmented Generation (RAG): Leveraging a vector database of solved Q&A pairs—including diagrams and textual explanations—to inform in-context solution construction in GPSM4K (Anand et al., 1 Dec 2024).
Augmentation with image captioning: Incorporating high-fidelity diagram descriptions from tools like Gemini Pro or GIT (Anand et al., 1 Dec 2024) to supply models with detailed semantic representations of visuals.

Derived and related datasets (GPSM4K, SOLIDGEO, GeoSense) extend Geometry3K’s core with:

Multi-step, theorem-oriented, and diverse multimodal items.
Systematic difficulty annotation and fine-grained category labels, especially for solid geometry and spatial vector reasoning (Wang et al., 27 May 2025).

5. Insights, Limitations, and Bottlenecks

Evaluation results collectively demonstrate several salient limitations:

Insufficient OOD Generalization: All models—regardless of pre-training corpus size—exhibit performance collapse on hard, OOD, and solid/analytic geometry problems (6–10% accuracy even for SOTA models) (Zhang et al., 15 Feb 2024, Wang et al., 27 May 2025).
Reliance on Pattern Matching: Models are sensitive to superficial changes in tokenization or phrasing, as evidenced by accuracy gains with rephrased questions (Zhang et al., 15 Feb 2024).
Multistep Reasoning Deficits: In the backward and multi-step subsets, performance drops markedly, reflecting persistent challenges in tracking complex solution chains (Zhang et al., 15 Feb 2024, Anand et al., 1 Dec 2024). Even chain-of-thought promptings are only partially remedial.
Weakness in Principle Identification and Application: GeoSense documents that models often fail either at selecting the relevant geometric principles (low GPI scores) or at mapping those principles into the diagram context (low GPA scores) (Xu et al., 17 Apr 2025).
Balance between Computation and Abstraction: Formal reasoning steps (formulas, simple calculations) are handled better than conceptual abstraction (definitional/theorem-based reasoning) (Xu et al., 17 Apr 2025).

6. Future Directions and Benchmark Extensions

Ongoing research and dataset design are directly influenced by the findings from Geometry3K and its derivatives, suggesting several pathways:

Enhanced backward and multistep reasoning assessment: Models require improved architectures or methods—possibly explicit intermediate supervision or CoT scoring—to resolve their deficits in multi-layered inference.
Integrated multimodal reasoning: Further fusion of visual and textual modalities is critical, as high-quality diagram description has been shown to systematically improve performance (Zhang et al., 15 Feb 2024, Anand et al., 1 Dec 2024).
Expanding OOD coverage and complexity: Growing the hard subsets, especially in solid and analytic geometry domains, will allow more granular diagnosis of reasoning bottlenecks (Wang et al., 27 May 2025).
Evaluation of intermediate reasoning fidelity: New scoring pipelines that weigh not just the accuracy of the answer but also the logical soundness, interpretability, and principled nature of intermediate reasoning steps (as in GeoSense and GPSM4K).
RAG and context-enhanced generation: Retrieval-based methodologies improve model resilience and generalization (Anand et al., 1 Dec 2024).
Improved handling of mathematical constants and symbols: Ensuring robust symbolic processing for expressions such as $\pi$ , roots, and vector notation in model outputs.

7. Broader Impact on Multimodal Reasoning Research

Geometry3K and its successors constitute a pivotal axis for advancing and diagnosing multimodal mathematical reasoning in both open- and closed-source MLLMs. The diversity of problem types, inclusion of both textual and diagram-based modalities, rigorous evaluation strategies, and publicly available resources together provide an empirical foundation for current and future research at the intersection of mathematical reasoning, spatial understanding, and multimodal system design.

Applications extend beyond pure mathematics into robotics, autonomous systems, and educational diagnostics, wherever the integration of symbolic, linguistic, and visual reasoning is required. Geometry3K’s strengths in driving standardized, reproducible evaluations, and its demonstrable influence on subsequent benchmark designs such as GeoEval, GPSM4K, GeoSense, and SOLIDGEO, ensure its continued relevance for at-scale, multimodal intelligence assessment in the academic community.