CapGeo-Bench: Geometric Captioning Benchmark
- CapGeo-Bench is a benchmark that assesses models' geometric captioning abilities using keypoint-based metrics for enhanced diagram comprehension.
- It comprises 4,641 bilingual figure–caption pairs spanning varied geometrical topics and difficulty levels to standardize spatial, relational, and numerical evaluations.
- Empirical findings show dramatic gains in reasoning accuracy when high-quality captions are used, underscoring the impact of caption-assisted geometric reasoning.
CapGeo-Bench is a specialized benchmark and dataset designed to rigorously evaluate the capacity of models—particularly Multimodal LLMs (MLLMs)—to generate high-fidelity geometric captions for mathematical diagrams, and to quantify the downstream impact of this captioning on geometric reasoning performance. Emerging from the CapGeo framework, which introduces caption-assisted geometric reasoning, CapGeo-Bench provides a curated corpus of annotated figure-caption pairs, keypoint-based evaluation metrics, and detailed assessment protocols, enabling systematic comparison of geometric captioning approaches and their contributions to mathematical problem-solving.
1. Motivation and Conceptual Framework
The CapGeo approach stems from the observation that current MLLMs, including advanced closed-source systems (e.g., GPT-O3, Gemini-2.5-Pro), attain strong results on textual mathematics yet underperform on geometric reasoning tasks primarily due to difficulties in diagram understanding rather than deficiencies in symbolic reasoning. CapGeo recasts the geometric reasoning pipeline by introducing a caption generation stage: the model first maps the visual diagram to a precise, concise textual caption that summarizes spatial elements, relationships, and quantitative constraints, before passing this structured representation (together with the original problem statement) to the downstream LLM for reasoning.
Formally, geometric problem-solving is structured as:
where is the question, is the geometric figure, is the generated caption, and is the answer. The insertion of enables the model’s symbolic reasoning strengths to be directly leveraged once the visual information is made explicit in language.
2. Dataset Construction and Composition
CapGeo-Bench consists of 4,641 rigorously curated figure–caption pairs. Annotation encompasses:
- Coverage: Plane Geometry, Analytic Geometry, and Solid Geometry, sampled from a spectrum of K–12 problems up to competition-grade (e.g., Olympiad-level).
- Difficulty Gradient: Four discrete levels of complexity, from elementary to highly advanced geometrical configurations.
- Bilinguality: All captions are provided in English and Chinese, reflecting rigorous translation and annotation standards.
- Structure: Captions are created using strict, instruction-based templates to standardize the extraction of geometric elements, relations, and numeric data.
Each figure–caption instance links an image file containing the geometry problem to a dense, information-rich caption, enabling direct input to LLMs and facilitating controlled ablation studies.
3. Keypoint-Based Evaluation and Metric Design
To systematically assess the quality of figure captions and to establish their correlation with downstream reasoning performance, CapGeo-Bench introduces a fine-grained keypoint-based metric. For each caption, three orthogonal sets of keypoints are automatically extracted:
- Element Keypoints (): Entities such as points (, , ...), lines, circles, polygons, each formally identified.
- Spatial Relation Keypoints (): Geometric relations—collinearity, parallelism, perpendicularity, containment, incidence.
- Numerical Relation Keypoints (): Numerical constraints—angles, lengths, ratios, and proportional measures.
Given a model-generated caption and the annotated reference , extraction is performed by prompting an LLM to parse captions into structured lists. For each keypoint dimension, recall is computed:
where denotes “true positive” keypoints (as per semantic matching), and indicates set cardinality. This process assesses not mere surface-form similarity but grounded consistency of geometric information.
Empirical analysis shows that these keypoint recall scores—particularly for relations and numerical values—strongly correlate with final reasoning accuracy on the CapGeo pipeline. Enhanced caption scores reliably yield improved solution rates in downstream MLLMs.
4. Caption-Assisted Reasoning Performance
CapGeo-Bench’s main experimental findings are:
- Dramatic Gains with Captioning: Vision-only geometric reasoning yields low baseline accuracy (e.g., 8.6% for Qwen2.5-VL-72B). When high-quality captions are provided, these scores rise to 59.0% for Qwen2.5-VL-72B and from 44.8% to 73.0% for Claude-Opus-4, clearly demonstrating that diagram understanding—not symbolic inference—forms the performance bottleneck.
- Model Class Bridging: Caption guidance enables mid-tier, open-source models to approach or match performance of leading closed-source MLLMs. This finding highlights the criticality of structured visual-to-text conversion for geometric intelligence.
- Benchmark Reliability: The keypoint metric, when used to filter captioning models, identifies those with highest downstream reasoning impact—confirming the utility of CapGeo-Bench as both diagnostic and selection tool.
Table: Caption Assistance Impact (selected results) | Model | Vision-Only (%) | + Caption (%) | |-------------------|-----------------|--------------| | Qwen2.5-VL-72B | 8.6 | 59.0 | | Claude-Opus-4 | 44.8 | 73.0 |
5. Challenges in Geometric Visual Understanding
State-of-the-art MLLMs experience difficulty in:
- Visual Redundancy and Noise: Diagrams often introduce non-informative tokens, occlusion, or misleading visual cues.
- Relation Extraction: Capturing spatial and numerical relations from pure visual input is non-trivial due to subtle diagrammatic cues and conventions.
- Alignment of Visual and Linguistic Modalities: Standard visual encoders are not optimized for spatial structure extraction required for geometry, in contrast to their success on object or scene recognition.
The CapGeo pipeline's explicit conversion of visual information into formal textual descriptions circumvents these bottlenecks by leveraging LLM proficiency on well-structured inputs.
6. Future Research Directions
CapGeo-Bench points to several open research problems:
- Numerical Relation Extraction: Scores for keypoints remain lower than for or , indicating the need for more precise algorithms or prompt engineering for quantitative captioning.
- Instruction Template Refinement: More expressive, less ambiguous captioning templates may further support accurate extraction of intricate geometric configurations.
- Integrated Training Paradigms: Joint training or reinforcement learning approaches could reward improvements in captioning quality directly by their impact on downstream reasoning outcomes.
- Domain-Specific Captioners: Development of geometric diagram-specialized captioning submodels may exploit domain knowledge and diagram conventions more effectively.
- Extension Beyond Geometry: The general pipeline—structured caption extraction followed by language-based reasoning—may be transferable to other diagram-driven domains (e.g., scientific figures, engineering schematics).
7. Significance and Broader Impact
CapGeo-Bench establishes a robust, interpretable, and scalable evaluation paradigm for geometric captioning, providing the first rigorous standard to assess figure-to-text conversion and its direct effect on automated geometric reasoning. This framework:
- Shapes Evaluation Practices: Moves beyond end-to-end black-box metrics to expose which aspects of visual understanding limit reasoning.
- Guides Captioning Model Development: Keypoint metrics direct model selection and architecture innovation toward maximally beneficial information extraction.
- Enables Better Geometric Reasoning: By closing the gap in diagram understanding, even conventional LLMs attain high accuracy in competitive math domains, democratizing access to automated problem-solving in educational and research contexts.
CapGeo-Bench thus charts a new course for multimodal reasoning research, demonstrating that advances in structured caption extraction can be transformational for visual–symbolic intelligence in mathematical domains (Li et al., 10 Oct 2025).