SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

Published 11 Jun 2026 in cs.CV and cs.AI | (2606.13673v1)

Abstract: Spatial reasoning, the ability to determine where objects are, how they relate, and how they move in 3D, remains a fundamental challenge for vision-LLMs (VLMs). Tool-augmented agents attempt to address this by augmenting VLMs with specialist perception modules, yet their effectiveness is bounded by the action interface through which those tools are invoked. In this work, we study how the design of this interface shapes the agent's capacity for open-ended spatial reasoning. Existing spatial agents either employ single-pass code execution, which commits to a full analysis strategy before any intermediate result is observed, or rely on a structured tool-call interface that often offers less flexibility for freely composing operations or tailoring the analysis to each task. Both designs offer limited flexibility for open-ended, complex 3D/4D spatial reasoning. We therefore propose SpatialClaw, a training-free framework for spatial reasoning that adopts code as the action interface. SpatialClaw maintains a stateful Python kernel pre-loaded with input frames and a suite of perception and geometry primitives, letting a VLM-backed agent write one executable cell per step conditioned on all prior outputs, enabling the agent to flexibly compose and manipulate perception results and adapt its analysis to both intermediate text and visual observations and the demands of each problem. Evaluated across 20 spatial reasoning benchmarks spanning a broad range of static and dynamic 3D/4D spatial reasoning tasks, SpatialClaw achieves 59.9% average accuracy, outperforming the recent spatial agent by +11.2 points, with consistent gains across six VLM backbones from two model families without any benchmark- or model-specific adaptation.

Abstract PDF Upgrade to Chat

Authors (11)

Summary

The paper introduces a code-as-action interface using a persistent Python kernel for iterative spatial reasoning.
It demonstrates a significant improvement (+11.2 accuracy points) over traditional spatial agents on 20 diverse benchmarks.
The method leverages compositional code execution and iterative feedback to enable adaptive, multi-step geometric analysis.

SpatialClaw: Code as the Action Interface for Agentic Spatial Reasoning

Motivation and Problem Definition

Spatial reasoning—locating, relating, and tracking objects in 3D and 4D environments—poses a persistent challenge for multimodal VLMs, as existing architectures inadequately support compositional geometric analysis and iterative refinement. Prior tool-augmented agents employ either single-pass code execution or structured tool-call interfaces, both constraining flexible multi-step reasoning. The paper "SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning" (2606.13673) investigates the consequences of action interface design, proposing that code, efficiently orchestrated via an agentic loop, should serve as the central interface for spatial analysis.

Action Interface Characterization

Spatial agents traditionally operate through one of two modalities: (1) single-pass code, requiring the agent to commit to an entire computational strategy upfront, limiting observation-dependent adaptation; (2) structured tool-calls, exposing perception operations via APIs (e.g., JSON/XML) that restrict composition and specialized computation. Both modalities fail to enable open-ended 3D/4D reasoning, especially when task-specific geometric aggregation or iterative inspection of tool output is critical. SpatialClaw introduces persistent kernel-based code execution, transforming code into a medium for iterative spatial composition, diagnosis, and revision.

Figure 1: Three action interfaces—single-pass code, structured tool-call, and SpatialClaw's persistent Python—with increasing flexibility for spatial reasoning.

Framework Architecture

SpatialClaw consists of a persistent Python kernel loaded with input frames, perception tools (e.g., depth, segmentation), and scientific libraries (NumPy, SciPy, Matplotlib). Each agentic step emits a Python code cell executed in context, with all intermediate variables retained for later compositional access. Feedback—including printed values, variable summaries, and rendered visual evidence—is appended to the context, enabling subsequent steps to adapt to present observations. Stages in the agentic loop comprise planning (via a separate LLM session), iterative code generation and execution, feedback assembly, and answer commitment.

Figure 2: The SpatialClaw agentic loop coordinates iterative planning, code execution, and feedback integration in a persistent kernel.

System prompts encode spatial reasoning principles favoring metric computation, cross-validation, and intermediate inspection, eschewing task-specific customization. The planner, isolated from visual input, generates stepwise analysis plans grounded in available tools; the main agent generates actions conditioned on plan, context, and feedback.

Benchmarking and Results

SpatialClaw is evaluated on 20 diverse spatial reasoning benchmarks (single-image, multi-view, video-based 4D, general spatial, and general video), employing six open-source VLM backbones from Qwen and Gemma4 families, spanning 26B–397B parameters. Average accuracy across all tasks is 59.9%, a +11.2 point improvement over the strongest recent spatial agent baseline (SpaceTools), with consistent superiority across backbone variants and task categories. Gains are most pronounced in dynamic 4D video and multi-view settings, validating the advantage of the code action interface for multi-step geometric computation.

Figure 3: Pairwise win/loss margin: SpatialClaw outperforms prior interfaces in 11/13 meta-categories, concentrated in compositional geometric tasks.

Interface comparison demonstrates that neither single-pass code nor structured tool-call achieves comparable generalization or performance, even when the tool set and prompt are held fixed. Ablation studies further show that the persistent kernel's compositional flexibility persists absent predefined utility functions, confirming that the core improvements derive from code-based iterative reasoning with scientific primitives.

Analysis: Task-Adaptive Composition and Failure Modes

Analysis of kernel action traces reveals spontaneous adaptation: agents select primitives (e.g., KD-tree for proximity, dot product for direction) aligned with question semantics, without category-specific prompt engineering. LLM-judge attribution indicates that over 50% of wins against baseline methods stem from code composition, with another 19.5% attributable to iterative control flow; interface-neutral gains are limited to perceptual tasks.

Figure 3: SpatialClaw's largest gains are in multi-step spatial reasoning tasks demanding compositional computation.

Failure mode categorization identifies geometric reasoning errors as the dominant bottleneck, while perception limitations (e.g., hallucinations, tool failures) are the principal secondary cause. The remaining cases originate from misinterpretations, recovery deficiencies, or ambiguous annotations.

Practical and Theoretical Implications

SpatialClaw demonstrates the criticality of the action interface for spatial reasoning agents, showing that code-as-action with persistent kernel state enables task-adaptive, multi-step composition previously unattainable with structured APIs. The framework's training-free design facilitates out-of-the-box extension for deployed VLMs, enhancing spatial reasoning without further fine-tuning or engineering.

Theoretically, the results suggest that compositional expressiveness and iterative feedback integration are necessary conditions for agentic spatial reasoning. The framework provides a foundation for reinforcement learning and automatic error recovery in future work.

Conclusion

SpatialClaw establishes code as the central action interface for spatial reasoning, outperforming all structured and single-pass alternatives across a comprehensive benchmark suite and backbone variants. The expressiveness and adaptability of persistent kernel code, rather than toolset augmentation or backbone tuning, drive substantive performance improvements. Further research should prioritize compositional interface design and systematic error recovery to advance spatial agent intelligence.

Markdown Report Issue