Caption-Assisted Geometric Reasoning

Updated 25 December 2025

The paper introduces a novel approach that injects explicit geometric captions into vision-language systems, yielding significant gains in spatial localization and reasoning tasks.
It details a comparative analysis of pre-captioning, self-captioning, and interleaved captioning methods, demonstrating improvements of up to 55% in localization accuracy.
The work dissects cognitive bottlenecks in perception, attention, and memory, providing actionable insights to overcome limitations in current vision-based embedding models.

Caption-Assisted Geometric Reasoning refers to a paradigm in vision-language modeling whereby explicit natural-language captions describing geometric entities, spatial relations, or visual configurations are leveraged to improve reasoning over geometric or spatially complex tasks. Rather than relying solely on direct pixel-based vision or dense multi-modal embeddings, these systems inject or extract structured textual descriptions—“captions”—which serve to bridge the representation gap between visual scenes and linguistic inference models. Recent work has demonstrated that such approaches yield substantial gains particularly in tasks demanding precise localization, relational logic, or multi-step spatial reasoning, isolating deficiencies in current model architectures and advancing state-of-the-art results on both synthetic and real-world evaluation suites (Weng et al., 24 May 2025).

1. Task Taxonomy and Evaluation Protocols

Caption-assisted geometric reasoning frameworks address a comprehensive suite of spatial, geometric, and composite reasoning tasks specifically designed to probe model abilities along cognitive axes:

Perception Tasks: Categorization (Cat) and Localization (Loc) in which the model identifies classes, counts objects, and accurately reports or compares locations within static frames.
Attention Tasks: Selective reporting or comparison where the model must focus attention in the presence of distractors, utilizing cues based on features (category) or explicit spatial information (location).
Memory Tasks: Maintaining object identity or spatial location across blanked or distractor intermediates, requiring short- or mid-term memory for geometric properties.
Composite Visual Reasoning (CVR): Procedurally generated tasks that combine perception, attention, and memory in logical multi-step sequences, varying in complexity (Low, Medium, High) and encompassing both categorical and locational dependencies.

Each task category is isolated to diagnose specific failure modes and to test whether textualization via captions alleviates the relevant bottleneck (Weng et al., 24 May 2025). Quantitative assessment is performed via accuracy measurements on each subtask, with absolute improvements ( $\Delta$ ) defined as the accuracy increase after introducing self-generated or ground-truth captions.

2. Vision–Text Decoupling and Caption Insertion Strategies

The core methodology involves a decoupling of raw vision from downstream linguistic reasoning by replacing or augmenting image input with textual captions. Three primary settings are employed:

Pre-captioned (PC): Direct provision of ground-truth captions describing object category and spatial information; these supersede the raw image as model input.
Self-captioning (SC): The model generates its own concise caption (prompted to include category and location) for each image, which is then circulated back into the input stream for downstream reasoning.
Self-captioning-Interleaved (SC-I): Alternating the image and its self-generated caption so that both modalities are present in a structured, sequence-aligned manner.

In each scenario, subsequent reasoning is prompted purely over textual representations, and success is measured by comparing performance to the base visual-only setting. This procedure isolates deficiencies in vision encoding (perceptual bottleneck) from failures in pure chain-of-thought logic (reasoning bottleneck), revealing that for many tasks, localization and complex geometric inference are the primary points of failure in vision-based embeddings, not in the language reasoning core (Weng et al., 24 May 2025).

3. Chain-of-Thought Guidance and Reasoning Prompts

Chain-of-thought (CoT) prompting is systematically incorporated by appending explicit instructions to evaluation prompts, such as:

“What is the correct answer to this task? (…possible answers…). Think step-by-step, analyze each frame and provide your answer here:”

Models are thus encouraged—not by majority voting but via explicit inference articulation—to enumerate their logical steps (e.g., “Frame 1: …, Frame 2: …, therefore …”) while reasoning over text-based geometric descriptions. This protocol is critical for disentangling weaknesses in the integration of perceptual (visual) input with symbolic geometric logic; it ensures that failures in geometric reasoning cannot be simply attributed to the absence of multi-hop inference in the core LLM(Weng et al., 24 May 2025). CoT scaffolding is a direct probe into the model’s ability to utilize explicit geometry once all relevant facts are available in natural language.

4. Quantitative Impact and Measurement Formulations

Absolute performance gains due to caption assistance are measured as follows. Let $A_{\mathrm{Base}}$ be the accuracy using direct image input, $A_{\mathrm{SC}}$ with self-captioned text, and $A_{\mathrm{PC}}$ with ground-truth pre-captioning:

$\Delta_{\mathrm{SC}} = A_{\mathrm{SC}} - A_{\mathrm{Base}}, \quad \Delta_{\mathrm{PC}} = A_{\mathrm{PC}} - A_{\mathrm{Base}}$

On spatial localization tasks for Qwen2.5-VL-7B [Table 5 in (Weng et al., 24 May 2025)]:

Perception (Loc): $A_{\mathrm{Base}} = 44.33\%$ , $A_{\mathrm{SC}} = 72.67\%$ ( $\Delta_{\mathrm{SC}} = +28.34\%$ ); $A_{\mathrm{PC}} = 99.33\%$ ( $\Delta_{\mathrm{PC}} = +55.00\%$ )
Memory (Loc): $A_{\mathrm{Base}} = 42.22\%$ , $A_{\mathrm{SC}} = 69.11\%$ ( $\Delta_{\mathrm{SC}} = +26.89\%$ ); $A_{\mathrm{PC}} = 89.61\%$ ( $\Delta_{\mathrm{PC}} = +47.39\%$ )

Composite tasks show similar trends, with $\Delta_{\mathrm{SC}}$ typically in the $+8$ – $+18$ pp range, and $\Delta_{\mathrm{PC}}$ up to $+30$ pp for high-complexity localizations. This pattern is robust across multiple evaluated state-of-the-art models.

5. Cognitive Bottleneck Analysis and Mechanistic Explanation

Detailed analysis along the axes of perception, attention, and memory identifies the loci of geometric reasoning failure:

Perception bottleneck: While models attain high accuracy on category recognition, spatial localization lags significantly.
Attention bottleneck: Selective focus on cued locations or features falters in the presence of distractors, especially for purely spatial cues.
Memory bottleneck: Persistent deficits are observed in maintaining precise locational information across delays, even as categorical memory is preserved.

Caption assistance, by explicitly introducing natural-language spatial descriptors, bypasses the integration stage between raw vision tokens and symbolic reasoning. With location and relations written out, the LLM’s reasoning module is able to execute multi-step inference chains, as demonstrated by the dramatic post-captioning gains. The primary geometric limitation is thus attributed to the misalignment or insufficient binding of fine-grained spatial tokens in the vision encoder, rather than any symbolic reasoning incapacity in the language core (Weng et al., 24 May 2025).

In summary, caption-assisted geometric reasoning reframes visual tasks with spatial and geometric complexity into a text-domain where transformer LLMs excel, providing an immediately effective and generalizable solution to perception–reasoning bottlenecks in contemporary vision-language systems. The framework enables controlled, interpretable assessment of model abilities, isolates vision-specific weaknesses, and provides a simple but powerful workaround by making geometric structure explicit in text.