SiT-Bench: Text-Only Spatial Reasoning
- SiT-Bench is a large-scale, expert-curated benchmark that evaluates spatial intelligence using high-fidelity, text-based scene descriptions without image inputs.
- It comprises 3,892 items across five spatial cognition categories, isolating true symbolic reasoning from visual perception.
- Evaluation reveals substantial gaps in global layout consistency among LLMs, highlighting the benefit of explicit symbolic reasoning such as Chain-of-Thought.
SiT-Bench is a large-scale, expert-curated benchmark designed to rigorously evaluate the spatial intelligence (SI) of LLMs exclusively via textual, coordinate-rich scene descriptions, with all image inputs removed. Developed in response to foundational questions about the provenance of spatial understanding in multi-modal AI—specifically, whether spatial reasoning is rooted in visual encoders or the model’s symbolic “reasoning backbone”—SiT-Bench comprises 3,892 items distributed over five primary spatial cognition categories and 17 subtasks. The benchmark systematically probes the ability of LLMs to construct and reason over coherent “world models” using only high-fidelity, verbal scene narratives featuring precise distances, angles, and reference frames. Evaluation results reveal substantial gaps between the best-performing models and humans, especially on tasks that demand persistent, globally consistent spatial representations (Guo et al., 7 Jan 2026).
1. Motivation and Foundational Objectives
SiT-Bench was created to address a critical confounding factor in prior evaluations of Vision-LLMs (VLMs): the conflation of spatial reasoning with visual perception. While VLMs have shown remarkable dexterity in navigation, manipulation, and geometric inference, there has been no principled separation of pattern recognition from true symbolic spatial reasoning. By stripping away all pixel-level data and representing single/multi-view scenes solely as coordinate-aware text, SiT-Bench enables a direct interrogation of the LLM's capacity for world-model construction, susceptibility to “spatial gaps,” and the rescuing effects of explicit symbolic reasoning protocols such as Chain-of-Thought (CoT).
The benchmark aims to answer:
- Whether LLMs can synthesize coherent spatial maps and relations given only symbolic, textual input.
- Which dimensions of SI—local judgments versus global layout consistency—deteriorate most rapidly in the absence of visual grounding.
- The relative efficacy of explicit reasoning (e.g., CoT) in closing observed performance gaps.
2. Dataset Composition and Task Taxonomy
SiT-Bench consists of 3,892 expert-annotated spatial reasoning items partitioned into five superordinate categories, each comprising several core subtasks:
| Category | Number of Items | % of Total |
|---|---|---|
| Navigation & Planning | 900 | 23.1% |
| Embodied Fine-grained Perception | 1,105 | 28.4% |
| Multi-View Geometric Reasoning | 836 | 21.5% |
| Global Perception & Mapping | 601 | 15.4% |
| Logic & Anomaly Detection | 450 | 11.6% |
Each category is designed to isolate particular facets of SI as follows:
- Global Perception & Mapping: Cognitive map synthesis from concatenated egocentric scene descriptions; tasks include Panoramic Counting, Scene Layout Reasoning, and JSON-format 2D mapping.
- Navigation & Planning: Prediction of agent or object pose changes under maneuvers; textual path planning; vector inference for egocentric/allocentric motion.
- Multi-View Geometric Reasoning: 3D mental rotation, perspective shifts, multi-view consistency, and spatial puzzles.
- Embodied Fine-grained Perception: Physical interaction inference, relative distances/depths, state trajectory tracking, and action success prediction in simulated robotic scenarios.
- Logic & Anomaly Detection: Internal consistency checks, object permanence, and directional judgment (e.g., adhering to cardinal frame requirements).
3. Scene Representation and Curation Pipeline
To ensure that grounding is exclusively symbolic, SiT-Bench employs a dual-path curation strategy:
- Path A: Generates original QA pairs from diverse imagery (robotic, urban, simulated). GPT-4o filters for spatial complexity; a SOTA VLM composes high-resolution, coordinate-rich textual descriptions, including absolute distances (meters), angular offsets (degrees), egocentric frames (e.g., “North”), and structured predicates. Human experts, supported by DeepSeek-R1 automated audit and CoT-derived justifications, finalize gold-standard items.
- Path B: Adapts legacy vision benchmarks focused on spatial tasks by converting image stimuli into dense textual narratives, preserving explicit coordinate structure and multi-view integration.
Each item is vetted for deducibility via text alone, ambiguity minimization, and multi-perspective coherence. Scene encodings utilize predicates such as “Object A is 3 m to the left of Object B,” direct coordinate placements “Building X stands at (x=2.5, y=–1.0),” and structured outputs (e.g., JSON for 2D cognitive mapping).
4. Evaluation Protocols and Metrics
SiT-Bench employs both zero-shot and few-shot evaluation regimes, with optional Chain-of-Thought prompting for “reasoning-enabled” assessment. The majority of tasks use a multiple-choice format; mapping subtasks require models to output exact JSON structures. Evaluation systems include:
- Proprietary LLMs: GPT-4o, Gemini-3-Flash, DeepSeek-V3.2.
- Open-source VLM backbones: Qwen2.5/3-VL, InternVL3 with LLaVA-1.5, Llama3.1.
- Specialized spatial models: Space-Qwen, SpaceThinker, SpaceR, Cosmos-Reason2, Robobrain2.0.
Two principal metrics are reported:
- Accuracy:
- Consistency Score: Proportion of items for which answers remain stable under logical reorderings or viewpoint permutations.
5. Performance Analysis across Models and Task Types
Evaluation reveals stratified performance:
| Model | Avg Acc | Global Perception | Navigation | Multi-View | Embodied | Logic |
|---|---|---|---|---|---|---|
| Human Level | 74.42% | 67.85% | 80.00% | 78.22% | 71.23% | 72.13% |
| Random Baseline | 27.30% | — | 25.00% | 34.72% | 25.00% | 25.00% |
| Gemini-3-Flash | 59.46% | 35.66% | 77.11% | 68.54% | 72.65% | 59.11% |
| Qwen3-VL-32B (thinking) | 51.06% | 16.34% | 68.67% | 59.45% | 59.54% | 50.00% |
| GPT-4o | 45.70% | 17.74% | 53.78% | 54.55% | 51.28% | 45.33% |
- SOTA models exhibit strong local semantic accuracy (e.g., neighbor relations, real-world QA often ), but global consistency tasks (e.g., cognitive mapping, panoramic merging) remain a pronounced weakness (best “thinking” model achieves vs. human ).
- Systematic performance gaps persist even for spatially specialized architectures, highlighting the difficulty of maintaining consistent world-models without visual priors.
6. Effect of Explicit Symbolic Reasoning
Enabling Chain-of-Thought (“thinking”) mode in evaluated LLMs produces significant absolute accuracy gains, exceeding those obtained by scaling model size. For instance:
- Qwen3-8B:
- Qwen3-VL-32B:
- DeepSeek-V3.2:
Reasoning-enabled traces demonstrate improved entity overlap detection across views and precise arithmetic in count-merging. Typical error patterns for non-CoT models include "spatial hallucinations"—incorrect summation of overlapping entities and fragmented layout synthesis. This suggests that explicit symbolic reasoning mechanisms substantially boost latent world-modeling capabilities even absent vision-grounded priors.
The trade-off is notable: reasoning-enabled inference can incur up to latency compared to direct prediction, motivating further research into selectively triggered, efficient hybrid reasoning pipelines.
7. Implications and Prospective Research Trajectories
SiT-Bench constitutes a rigorous, reproducible framework for assessing the language-model backbone of spatial intelligence independent of pixel cues. Key implications include:
- Vision-centric architectures contribute helpful priors, but do not on their own close the “spatial gap” observed on text-only reasoning tasks.
- Integration of explicit intermediate world-models or symbolic reasoning layers (Chain-of-Thought, structured output protocols) is essential for strong global consistency.
- Mere expansion or specialization of the spatial module (SpaceR) is insufficient unless the LLM’s core world-modeling capacity is simultaneously enhanced.
- Considerable latency penalties of deep reasoning underline the practical necessity for adaptable hybrid solutions.
- SiT-Bench establishes a development target for embodied agents whose SI is fundamentally symbolic. By tracking progress on coordinate-aware, text-only benchmarks, the community can ensure authentic advances in “reasoning backbone” spatial representation and manipulation.
A plausible implication is that future embodied agents and VLMs should focus on hybrid protocols that combine rapid pattern recognition with dynamic invocation of deep symbolic reasoning, targeting not just increased accuracy but also global layout coherence and stability under viewpoint permutations (Guo et al., 7 Jan 2026).