CapGeo: Caption-Assisted Geometric Reasoning
- CapGeo is a framework that converts geometric diagrams into structured textual captions, enhancing the clarity of spatial and numerical relationships.
- The two-stage pipeline—captioning followed by reasoning—improves model performance, with notable accuracy gains in tests like Qwen2.5-VL-72B and Claude-Opus-4.
- The CapGeo-Bench dataset and keypoint-based metrics offer a granular evaluation of caption fidelity, driving continual refinements in multimodal geometric reasoning.
CapGeo refers to a caption-assisted geometric reasoning framework for Multimodal LLMs (MLLMs), designed to address persistent limitations in visual geometric understanding and reasoning. It operates by transforming geometric diagrams into structured textual captions and integrating these captions into the model context, which has been demonstrated to substantially enhance the geometric reasoning capabilities of both open- and closed-source LLMs. The CapGeo-Bench dataset and keypoint-based evaluation methodology serve as core assets for benchmarking and refining figure captioning within this context (Li et al., 10 Oct 2025).
1. Framework Architecture and Methodology
CapGeo processes geometric problems using a two-stage pipeline: captioning followed by reasoning. In the captioning stage, the geometric figure is described textually, guided by an instruction-heavy template that formats the output into formal mathematical language. The reasoning stage fuses this caption (denoted ) with the problem statement and optionally the original image to produce the model answer . The schema is formalized as:
Captions are constructed to explicitly list geometric entities (points, lines, circles, polygons), spatial relationships (congruence, collinearity, perpendicularity, tangency), and precise numerical attributes (coordinates, angle measures, segment lengths). This textual formalization serves to eliminate ambiguous or redundant visual tokens that complicate direct diagram-based reasoning.
In CapGeo-Bench evaluation, captions are decomposed into keypoints representing elements (), spatial relations (), and numerical relations ():
Matching of generated caption keypoints to ground truth is performed, and recall scores are computed:
where indicates true positive elements, the ground truth elements, etc. This provides a granular and interpretable measure of caption fidelity relevant to downstream reasoning.
2. Empirical Improvements over Vision-Only Models
Systematic experiments demonstrate that supplementing problems with captions sharply increases geometric problem-solving accuracy for a range of MLLMs. Notable metrics include:
- Qwen2.5-VL-72B improved from 8.6% to 59.0% accuracy with caption assistance.
- Claude-Opus-4 improved from 44.8% to 73.0%.
- On MathVista, Qwen2.5-VL-72B-Instruct increased from 75.9% to 92.6%.
These results underscore that the primary bottleneck in geometric reasoning is diagram interpretation rather than logical reasoning per se. The use of captions enables models to bypass complex and unreliable visual parsing, aligning model context more closely with the structured relationships required for inference. The performance gains persist across both “vision only” and “vision intensive” problem settings, demonstrating robustness of the caption-assisted approach.
3. CapGeo-Bench Dataset Design and Evaluation Metrics
CapGeo-Bench contains 4,641 figure–caption pairs, spanning three geometric domains (plane, analytic, solid geometry) and four problem difficulty levels. Captions in both Chinese and English systematically convey geometric configurations.
The keypoint-based metric decomposes captions into elements, spatial relations, and numerical relations. Captioning models are scored on recall for each dimension, using LLM-based extraction and semantic matching functions. The metric is shown to correlate strongly with CapGeo’s downstream geometric problem-solving performance; models scoring well on CapGeo-Bench keypoint recall metrics reliably achieve higher accuracy when reasoning over captioned figures.
Table: Example CapGeo-Bench Caption Keypoint Dimensions
Dimension | Description | Example Keypoints |
---|---|---|
Elements () | Geometric components | Points A, B, C; Line |
Spatial () | Relations/constraints | ; |
Numerical () | Quantitative attributes |
4. Implications for Multimodal Geometric Reasoning
The CapGeo approach shifts the burden of geometric analysis from image processing to formal textual reasoning, leveraging the strengths of LLMs in symbolic logic, algebraic calculation, and structured deduction. This modality bridging is shown to be particularly effective for geometric tasks, where visual ambiguity (e.g., occlusion, lack of diagram clarity, or complex spatial arrangements) commonly impedes neural vision models.
A plausible implication is that similar caption-assisted schemes may be broadly applicable to other multimodal domains where visual ambiguity constrains model performance and where faithful textual abstraction is feasible (e.g., chemistry diagrams, engineering schematics).
5. Future Directions and Open Challenges
While CapGeo represents a significant advance, keypoint analysis in the numerical dimension remains a challenge—top models reach only 26% accuracy in capturing numerical relations (Li et al., 10 Oct 2025). This suggests more research is needed in precise extraction and formalization of quantitative diagram features.
Potential future research avenues include:
- Refinement of captioning models and templates to improve completeness, especially for quantities and relations not immediately visible.
- Feedback-loop optimization, where reasoning performance further informs caption fidelity.
- Expansion of CapGeo-Bench with a wider variety of diagram structures and problem types, as well as additional languages.
- Application of caption-based reasoning to educational software, automated grading, and tutoring systems in STEM.
CapGeo-Bench is positioned as a standard resource for the community and invites continued benchmarking and improvement of both captioning and reasoning models.
6. Summary and Significance
CapGeo is a caption-assisted framework for geometric reasoning with MLLMs, systematically converting geometric diagrams into structured textual representations that dramatically improve reasoning performance. Grounded in the CapGeo-Bench dataset and keypoint-based evaluation metrics, this methodology demonstrates that geometric understanding bottlenecks in multimodal models are largely attributable to visual ambiguity, not logical reasoning capacity. By formalizing geometric figures as captions and integrating these with problem statements, CapGeo provides a robust pathway for advancing multimodal geometric problem solving, setting a new standard for empirical and algorithmic evaluation in the domain (Li et al., 10 Oct 2025).