Spatial Chain-of-Thought (CoT) in Vision Models

Updated 24 December 2025

Spatial Chain-of-Thought (CoT) is a method that externalizes intermediate reasoning as explicit spatial references, improving the clarity and structure of the reasoning process.
It links symbolic tokens with precise 2D/3D geometric evidence, enabling robust performance across maze navigation and 3D object analysis tasks.
Training protocols using Supervised Fine-Tuning and Reinforcement Learning yield faster convergence and higher test accuracy for spatial reasoning challenges.

Spatial Chain-of-Thought (CoT), also known as Grounding CoT, is a paradigm for structuring intermediate reasoning within vision-LLMs (VLMs) and large reasoning models (LRMs) by externalizing each reasoning step as an explicit spatial reference. This approach enhances multi-step visual and spatial reasoning, improving semantic grounding and cross-task generalization in both 2D and 3D vision-centric tasks by linking symbolic inferences to precise geometric evidence. Spatial CoT is widely formalized and validated through benchmarks in maze navigation and 3D object analysis, and is a critical tool for constructing more robust vision-centric datasets and training pipelines (Du et al., 27 Nov 2025, Chen et al., 8 Mar 2025).

1. Formalism of Spatial (Grounding) Chain-of-Thought

In Grounding CoT, the model outputs both a symbolic reasoning token $r_t^{(g)}$ and a spatial grounding $g_t$ at each reasoning step $t$ . The simplest form expresses each step as an image coordinate:

$s_t = (x_t, y_t), \quad (x_t,y_t) \in [0,1]^2,$

with the full reasoning trajectory:

$S = \{(x_1, y_1), (x_2, y_2), ..., (x_T, y_T)\}.$

For broader applications, the spatial grounding can incorporate points, lines, or regions, defined by:

$C_k = \begin{cases} (x, y), & G_k = \mathrm{point} \ \{(x_1, y_1), (x_2, y_2)\}, & G_k = \mathrm{line} \ \{(x_i, y_i)\}_{i=1}^n, & G_k = \mathrm{region} \end{cases}$

The reasoning chain is then:

$R_T^\mathrm{grd} = \bigl((r_1^{(g)}, g_1), (r_2^{(g)}, g_2), ..., (r_T^{(g)}, g_T)\bigr),$

where each symbolic segment is tightly bound to a spatial reference. This explicit localization links semantic progression to visual space, supporting generalization across different geometries and task domains (Du et al., 27 Nov 2025).

2. Training Protocols: SFT and Reinforcement Learning

Spatial CoT learning employs a two-stage protocol:

a. Supervised Fine-Tuning (SFT):

Datasets are synthesized for multiple CoT formats (Language CoT, Grounding CoT, Visual CoT) using rule-based path extraction and large model prompting.
Each example is encapsulated within > … tags, with minimal Grounding CoT (G-CoT-least) omitting all but the essential coordinate sequence.
The VLM vision encoder is frozen; only the LLM head is fine-tuned (learning rate $1 \times 10^{-5}$ , batch size 64, three epochs) on cross-entropy loss over text tokens (Du et al., 27 Nov 2025).

b. Reinforcement Learning (RL):

Post-SFT, the model undergoes policy optimization via Group Relative Policy Optimization (GRPO) on expanded synthetic datasets.
The reward function combines path correctness and output formatting:

$r = \alpha r_{\mathrm{acc}} + (1-\alpha) r_{\mathrm{format}},$

with $\alpha = 0.1$ , $r_{\mathrm{acc}} = 1$ only if the inferred path obeys all grid constraints.

RL proceeds for up to 1000 steps with rollout batch sizes of 128, driving models to convergence on both in-distribution and out-of-distribution spatial tasks (Du et al., 27 Nov 2025).

3. Benchmarks: Maze-Solving and 3D-CoT

3.1 Maze-Solving (2D Spatial Reasoning)

Grid $\mathcal{M}$ : $\mathcal{M} = \{(i, j) \mid i, j \in \{1,...,N\}\}$
Walls $\mathcal{W}$ : indicate blocked moves between adjacent cells.
Paths $P$ : Valid only if sequential steps $(i_k, j_k) \to (i_{k+1}, j_{k+1})$ are not separated by a wall.
Data: All intermediate CoT steps are auto-generated by rule functions, mapping path cells to image coordinates. Both correct and synthetic “mistake plus correction” step chains are included for robustness (Du et al., 27 Nov 2025).

3.2 3D-CoT Benchmark (3D Vision-Language Reasoning)

Sources: Objaverse, ShapeNet, ABO, 3D-FUTURE datasets.
CoT-CAP3D: 1.51M point-cloud–text pairs, several annotation styles (No-CoT, Unmarked CoT, Tagged CoT with reasoning markers).
Hierarchy: Each chain covers object recognition, functional inference, and causal reasoning.
Annotation: LLM-generated, with 20% manual checks for logical fidelity (Chen et al., 8 Mar 2025).

4. Empirical Results and the "Short is Long" Effect

Spatial CoT, especially in minimal forms, has demonstrated superior convergence and generalization properties:

Method	RL Steps to 90% Train Accuracy	Final Test Accuracy (7×7)
L-CoT	~600	~78%
G-CoT	~400	~82%
V-CoT	~300	~83%
G-CoT-least	~200	~86%

All formats reach near-perfect training accuracy, but spatially grounded and minimal chains (G-CoT-least) require substantially fewer RL steps.
Test accuracy on larger, out-of-distribution mazes is consistently higher for G-CoT-least, highlighting enhanced generalization (Du et al., 27 Nov 2025).
In 3D object reasoning, CoT-structured annotation improves multi-step reasoning: LRMs benefit from unmarked CoTs, while LLMs perform better with explicit tags—the form of annotation interacts with downstream architecture design. Metrics such as object recognition (OBJ), functional inference (FUNC), and interaction (INTER) scores all improve with CoT annotation (Chen et al., 8 Mar 2025).

5. Structural and Practical Considerations in Dataset Design

Minimal spatial grounding: Favor short coordinate sequences over verbose, wordy reasoning traces.
Synthetic labels: Rule-based path extraction ensures accurate, scalable annotation, obviating the need for manual labeling.
Error injection: Limited inclusion of mistake-correction sequences, when desired, keeps reasoning chains compact but robust.
Scale coverage: Training sets should span a range of grid or object sizes but always supervise with minimal spatial grounding for scale invariance.
Formatting reward: During RL, maintain regularity in chain structure while allowing the model to compress and optimize its spatial policy (Du et al., 27 Nov 2025).

For 3D tasks, hierarchical annotation (recognition, affordance, cause) yields strong compositional reasoning, with the dual-layer evaluation framework covering both intermediate reasoning and final answer quality. Annotation style further mediates architecture-specific learning benefits (Chen et al., 8 Mar 2025).

6. Interpretative Insights and Broader Impact

Compact inductive bias: Minimal spatial chains focus learning on navigation or transformation rules, supporting efficient and generalizable policies (Du et al., 27 Nov 2025).
Explicit structure: Tagged chains increase interpretability, especially in black-box LLMs, though they may disrupt the natural rhythm of models optimized for fluid inference (Chen et al., 8 Mar 2025).
Trade-offs: Annotation form and chain conciseness directly impact model learning behavior and cross-task transfer.
Extension: The CoT approach, particularly spatial grounding, extends to robotics planning, medical imaging, and embodied AI—where explicit geometric reasoning underlies performance (Chen et al., 8 Mar 2025).

A plausible implication is that automated construction of minimal spatial CoT chains could enable scalable, generalizable reasoning in largescale multimodal settings, mitigating the annotation burden and opening new avenues for efficient curriculum learning.

7. Limitations and Future Directions

Manual annotation overhead: In 3D domains, cost remains prohibitive; future directions include automated or synthetic chain construction (Chen et al., 8 Mar 2025).
Inference cost: LRMs leveraging CoT may incur higher computational demands, indicating a need for more efficient architectures and training schedules.
Generalization: While spatial CoT has proven robust across scales and task modalities, further studies are required to assess transferability to highly disparate geometric settings and real-world vision-language tasks.

The prevailing evidence underscores that the spatialization of CoT—especially in minimal, grounding-oriented forms—substantially enhances the visual reasoning capabilities and generalization of advanced multimodal models (Du et al., 27 Nov 2025, Chen et al., 8 Mar 2025).