Papers
Topics
Authors
Recent
2000 character limit reached

SiT-Bench: Text-Only Spatial Reasoning

Updated 14 January 2026
  • SiT-Bench is a large-scale, expert-curated benchmark that evaluates spatial intelligence using high-fidelity, text-based scene descriptions without image inputs.
  • It comprises 3,892 items across five spatial cognition categories, isolating true symbolic reasoning from visual perception.
  • Evaluation reveals substantial gaps in global layout consistency among LLMs, highlighting the benefit of explicit symbolic reasoning such as Chain-of-Thought.

SiT-Bench is a large-scale, expert-curated benchmark designed to rigorously evaluate the spatial intelligence (SI) of LLMs exclusively via textual, coordinate-rich scene descriptions, with all image inputs removed. Developed in response to foundational questions about the provenance of spatial understanding in multi-modal AI—specifically, whether spatial reasoning is rooted in visual encoders or the model’s symbolic “reasoning backbone”—SiT-Bench comprises 3,892 items distributed over five primary spatial cognition categories and 17 subtasks. The benchmark systematically probes the ability of LLMs to construct and reason over coherent “world models” using only high-fidelity, verbal scene narratives featuring precise distances, angles, and reference frames. Evaluation results reveal substantial gaps between the best-performing models and humans, especially on tasks that demand persistent, globally consistent spatial representations (Guo et al., 7 Jan 2026).

1. Motivation and Foundational Objectives

SiT-Bench was created to address a critical confounding factor in prior evaluations of Vision-LLMs (VLMs): the conflation of spatial reasoning with visual perception. While VLMs have shown remarkable dexterity in navigation, manipulation, and geometric inference, there has been no principled separation of pattern recognition from true symbolic spatial reasoning. By stripping away all pixel-level data and representing single/multi-view scenes solely as coordinate-aware text, SiT-Bench enables a direct interrogation of the LLM's capacity for world-model construction, susceptibility to “spatial gaps,” and the rescuing effects of explicit symbolic reasoning protocols such as Chain-of-Thought (CoT).

The benchmark aims to answer:

  • Whether LLMs can synthesize coherent spatial maps and relations given only symbolic, textual input.
  • Which dimensions of SI—local judgments versus global layout consistency—deteriorate most rapidly in the absence of visual grounding.
  • The relative efficacy of explicit reasoning (e.g., CoT) in closing observed performance gaps.

2. Dataset Composition and Task Taxonomy

SiT-Bench consists of 3,892 expert-annotated spatial reasoning items partitioned into five superordinate categories, each comprising several core subtasks:

Category Number of Items % of Total
Navigation & Planning 900 23.1%
Embodied Fine-grained Perception 1,105 28.4%
Multi-View Geometric Reasoning 836 21.5%
Global Perception & Mapping 601 15.4%
Logic & Anomaly Detection 450 11.6%

Each category is designed to isolate particular facets of SI as follows:

  • Global Perception & Mapping: Cognitive map synthesis from concatenated egocentric scene descriptions; tasks include Panoramic Counting, Scene Layout Reasoning, and JSON-format 2D mapping.
  • Navigation & Planning: Prediction of agent or object pose changes under maneuvers; textual path planning; vector inference for egocentric/allocentric motion.
  • Multi-View Geometric Reasoning: 3D mental rotation, perspective shifts, multi-view consistency, and spatial puzzles.
  • Embodied Fine-grained Perception: Physical interaction inference, relative distances/depths, state trajectory tracking, and action success prediction in simulated robotic scenarios.
  • Logic & Anomaly Detection: Internal consistency checks, object permanence, and directional judgment (e.g., adhering to cardinal frame requirements).

3. Scene Representation and Curation Pipeline

To ensure that grounding is exclusively symbolic, SiT-Bench employs a dual-path curation strategy:

  • Path A: Generates original QA pairs from diverse imagery (robotic, urban, simulated). GPT-4o filters for spatial complexity; a SOTA VLM composes high-resolution, coordinate-rich textual descriptions, including absolute distances (meters), angular offsets (degrees), egocentric frames (e.g., “North”), and structured predicates. Human experts, supported by DeepSeek-R1 automated audit and CoT-derived justifications, finalize gold-standard items.
  • Path B: Adapts legacy vision benchmarks focused on spatial tasks by converting image stimuli into dense textual narratives, preserving explicit coordinate structure and multi-view integration.

Each item is vetted for deducibility via text alone, ambiguity minimization, and multi-perspective coherence. Scene encodings utilize predicates such as “Object A is 3 m to the left of Object B,” direct coordinate placements “Building X stands at (x=2.5, y=–1.0),” and structured outputs (e.g., JSON for 2D cognitive mapping).

4. Evaluation Protocols and Metrics

SiT-Bench employs both zero-shot and few-shot evaluation regimes, with optional Chain-of-Thought prompting for “reasoning-enabled” assessment. The majority of tasks use a multiple-choice format; mapping subtasks require models to output exact JSON structures. Evaluation systems include:

  • Proprietary LLMs: GPT-4o, Gemini-3-Flash, DeepSeek-V3.2.
  • Open-source VLM backbones: Qwen2.5/3-VL, InternVL3 with LLaVA-1.5, Llama3.1.
  • Specialized spatial models: Space-Qwen, SpaceThinker, SpaceR, Cosmos-Reason2, Robobrain2.0.

Two principal metrics are reported:

  • Accuracy: Accuracy=Number of correct predictionsTotal items\text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total items}}
  • Consistency Score: Proportion of items for which answers remain stable under logical reorderings or viewpoint permutations.

5. Performance Analysis across Models and Task Types

Evaluation reveals stratified performance:

Model Avg Acc Global Perception Navigation Multi-View Embodied Logic
Human Level 74.42% 67.85% 80.00% 78.22% 71.23% 72.13%
Random Baseline 27.30% 25.00% 34.72% 25.00% 25.00%
Gemini-3-Flash 59.46% 35.66% 77.11% 68.54% 72.65% 59.11%
Qwen3-VL-32B (thinking) 51.06% 16.34% 68.67% 59.45% 59.54% 50.00%
GPT-4o 45.70% 17.74% 53.78% 54.55% 51.28% 45.33%
  • SOTA models exhibit strong local semantic accuracy (e.g., neighbor relations, real-world QA often >80%>80 \%), but global consistency tasks (e.g., cognitive mapping, panoramic merging) remain a pronounced weakness (best “thinking” model achieves 8.34%8.34\% vs. human 26.77%26.77\%).
  • Systematic performance gaps persist even for spatially specialized architectures, highlighting the difficulty of maintaining consistent world-models without visual priors.

6. Effect of Explicit Symbolic Reasoning

Enabling Chain-of-Thought (“thinking”) mode in evaluated LLMs produces significant absolute accuracy gains, exceeding those obtained by scaling model size. For instance:

  • Qwen3-8B: 37.91%45.04% (+7.13%)37.91\% \rightarrow 45.04\% ~ (+7.13\%)
  • Qwen3-VL-32B: 45.90%51.06% (+5.16%)45.90\% \rightarrow 51.06\% ~ (+5.16\%)
  • DeepSeek-V3.2: 37.06%43.74% (+6.68%)37.06\% \rightarrow 43.74\% ~ (+6.68\%)

Reasoning-enabled traces demonstrate improved entity overlap detection across views and precise arithmetic in count-merging. Typical error patterns for non-CoT models include "spatial hallucinations"—incorrect summation of overlapping entities and fragmented layout synthesis. This suggests that explicit symbolic reasoning mechanisms substantially boost latent world-modeling capabilities even absent vision-grounded priors.

The trade-off is notable: reasoning-enabled inference can incur up to 70×70\times latency compared to direct prediction, motivating further research into selectively triggered, efficient hybrid reasoning pipelines.

7. Implications and Prospective Research Trajectories

SiT-Bench constitutes a rigorous, reproducible framework for assessing the language-model backbone of spatial intelligence independent of pixel cues. Key implications include:

  • Vision-centric architectures contribute helpful priors, but do not on their own close the “spatial gap” observed on text-only reasoning tasks.
  • Integration of explicit intermediate world-models or symbolic reasoning layers (Chain-of-Thought, structured output protocols) is essential for strong global consistency.
  • Mere expansion or specialization of the spatial module (SpaceR) is insufficient unless the LLM’s core world-modeling capacity is simultaneously enhanced.
  • Considerable latency penalties of deep reasoning underline the practical necessity for adaptable hybrid solutions.
  • SiT-Bench establishes a development target for embodied agents whose SI is fundamentally symbolic. By tracking progress on coordinate-aware, text-only benchmarks, the community can ensure authentic advances in “reasoning backbone” spatial representation and manipulation.

A plausible implication is that future embodied agents and VLMs should focus on hybrid protocols that combine rapid pattern recognition with dynamic invocation of deep symbolic reasoning, targeting not just increased accuracy but also global layout coherence and stability under viewpoint permutations (Guo et al., 7 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to SiT-Bench.