ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models

Published 11 May 2026 in cs.CV and cs.AI | (2605.10106v1)

Abstract: Recent advances in Multi-modal LLMs (MLLMs) target 3D spatial intelligence, yet the progress has been largely driven by post-training on curated benchmarks, leaving the inference-time approach relatively underexplored. In this paper, we take a training-free perspective and introduce ViSRA, a human-aligned Video-based Spatial Reasoning Agent, as a framework to probe the spatial reasoning mechanism of MLLMs. ViSRA elicits spatial reasoning in a modular and extensible manner by leveraging explicit spatial information from expert models, enabling a plug-and-play flexible paradigm. ViSRA offers two key advantages: (1) human-aligned and transferable 3D understanding rather than task-specific overfitting; and (2) no post-training computational cost along with heavy manual curation of spatial reasoning datasets. Experimental results demonstrate consistent improvement across a set of MLLMs on both existing benchmarks and unseen 3D spatial reasoning tasks, with ViSRA outperforming baselines by up to a 15.6% and 28.9% absolute margin respectively.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces ViSRA, a novel training-free, inference-time agent that leverages modular spatial tools to enhance 3D video reasoning in MLLMs.
The paper demonstrates substantial accuracy improvements up to 15.6% on benchmarks and up to 28.9% on out-of-distribution tasks.
The paper’s framework provides interpretable, evidence-grounded spatial reasoning through a multi-role agent architecture and modular tool orchestration.

ViSRA: An Inference-Time Framework for 3D Video-Based Spatial Reasoning in MLLMs

Motivation and Background

Despite significant progress in Multi-modal LLMs (MLLMs) for image and video understanding, the extension of genuine 3D spatial reasoning – such as comprehension of relative distances, directions, and object permanence in dynamic scenes – remains elusive. Most prior improvements in MLLMs' spatial intelligence have come from supervised post-training with large-scale, curated datasets, which introduces expensive computational overhead and fosters benchmark-specific overfitting, reducing transferability to out-of-distribution (OOD) situations or complex, real-world tasks.

The paper "ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal LLMs" (2605.10106) presents ViSRA, a training-free, inference-time agentic architecture designed to augment spatial reasoning capabilities of MLLMs by orchestrating modular, domain-expert perception tools in a human-aligned manner. ViSRA obviates the need for further fine-tuning or dataset curation while enabling generalizable, flexible, and interpretable 3D spatial reasoning grounded in explicit intermediate outputs.

Figure 1: Comparison of three paradigms for 3D spatial reasoning. The left panel diagrams the paradigms, the middle panel provides qualitative examples showing ViSRA's advantage on both in-benchmark and novel questions, and the right panel presents ViSRA's superior accuracy across established and unseen tasks.

Limitations of Post-Training Approaches and the Need for Tool-Driven Inference

Current post-training strategies – including spatially grounded supervised fine-tuning, architectural adaptations, and spatial instruction tuning – have demonstrated accuracy gains on curated benchmarks. However, these improvements are largely non-transferable, as evidenced by sharp accuracy drops on OOD question types or novel task formulations. Notably, even with access to precise, ground-truth cognitive maps, state-of-the-art MLLMs cannot reliably infer straightforward spatial relations, indicating fundamental misalignment with human reasoning about space.

The core empirical findings leading to ViSRA are:

Cognitive maps as auxiliary inputs do not confer reliable spatial reasoning capacity to MLLMs for a variety of spatial tasks.
Post-trained models overfit to benchmark distributions and show negligible gains – or even performance degradation – on OOD spatial queries.

The ViSRA Framework: Modular Tools and Multi-Role Agentic Design

ViSRA consists of two principal components:

A suite of interoperable, explicit spatial tools that encapsulate strong domain expert perception models for extracting 2D/3D object presence, tracking, scene geometry, and semantic priors.
A multi-role agentic runtime which instantiates four specialized agent roles – Planner, Reflector, Executor, Summarizer – coordinating tool invocations to decompose and solve spatial reasoning tasks.
Figure 2: Overview of ViSRA, showing the spatial tools (left) and the multi-role agent architecture with planning, execution, reflection, and summarization (right).

Spatial Tools

2D Object Detection: Per-frame detection of query-relevant objects using advanced detectors, informed by MLLM-driven frame filtering.
Cross-Frame Object Tracking: Propagation and association of detected objects across time for instance-level recognition.
3D Object Detection: Lifting of 2D detections to 3D via foundational geometric models (e.g., VGGT), with constrained clustering to prevent instance-mixing.
Scene Modeling: Estimation of canonical scene geometry (e.g., ground plane) for metric spatial tasks and the construction of real-world-aligned coordinates.
Knowledge Retrieval: Contextual access to a structured knowledge base of object/room priors for semantically ambiguous tasks.
Video/Image Query: Direct delegation of questions to the MLLM for unstructured or hard-to-specify tasks.
Utility Functions: Downstream operations (e.g., distance, direction, height, obstruction computation) leveraging extracted representations.

Multi-Role Agent

Planner: Parses the question and tool schemas to synthesize an explicit tool-use plan.
Reflector: Iteratively verifies sufficiency of collected evidence and plans additional tool calls.
Executor: Executes individual tool invocations and produces interpretable, step-wise outputs.
Summarizer: Integrates all intermediate results to generate the final answer, explicitly grounded in accumulated evidence rather than latent model knowledge.
Figure 3: ViSRA answering a relative-distance question by orchestrating the four agent roles and multiple spatial tools on a concrete video-query instance.

Experimental Evaluation

ViSRA is evaluated as a drop-in agentic enhancement across multiple open-source MLLMs (e.g., Qwen2.5-VL, InternVL3, LLaVA-OneVision) of varying sizes and compared with state-of-the-art proprietary and post-trained models. Evaluations span:

VSI-Bench: Established benchmark for spatial intelligence in egocentric videos.
VSI-Bench-Extra: OOD benchmarks with new spatial question types to assess generalization.
Other suites: Cross-viewpoint, online spatio-temporal, and comprehensive video-based spatial intelligence benchmarks (ViewSpatial-Bench, OST-Bench, MMSI-Video-Bench).

Quantitative Results and Highlights

On VSI-Bench, ViSRA-augmented models realize absolute accuracy improvements up to 15.6% over their base counterparts, yielding performance competitive with much larger, post-trained or proprietary systems.
On VSI-Bench-Extra, ViSRA outperforms post-trained baselines by up to 28.9% on OOD spatial queries, demonstrating superior generalization beyond training distributions.
Substantial accuracy gains are observed in spatial categories highly dependent on explicit 3D reasoning (relative direction, planning, appearance order).
Ablation studies show tool choice and execution protocol (frame sampling, detector/tracker selection, clustering algorithm) significantly affect final accuracy, with transferability to improved perception models.
Figure 4: Correct object-counting results generated by ViSRA, invoking both detection and tracking tools.

Figure 5: Accurate appearance-order reasoning achieved by ViSRA leveraging tool-driven evidence aggregation.

Figure 6: Successful relative-direction reasoning by ViSRA, using spatial tools to extract human-aligned spatial cues.

Figure 7: A failure case from ViSRA due to upstream perception errors (e.g., category confusion in detection).

Interpretability and Modularity

A critical strength of ViSRA is its modular architecture and explicit evidence trail. Each decision step, intermediate output, and final answer is interpretable, facilitating fault diagnosis and extension. Improvements in vision models (e.g., better detectors/trackers or geometric solvers) are inherited without retraining, making ViSRA both upgradable and generalizable.

Ambiguities inherent to annotation (referent ambiguity, visual occlusion) are visualized and accounted for in the qualitative analysis, illustrating both progress and residual source of error in spatial QA.

Figure 8: Cognitive map visualizations utilized for evaluation, drawn from 3D ground truth annotations.

Figure 9: Examples from VSI-Bench indicating ambiguity in human reference expressions and limitations of benchmark answer quality.

Implications and Directions for Future Research

ViSRA demonstrates that domain-structured, inference-time orchestration of expert models within an agentic MLLM framework is a robust alternative to dataset-centric post-training approaches. The approach is plug-and-play, supporting seamless integration of future advances in low-level visual perception or spatial modeling tools.

Practically, ViSRA's tool-based methodology encourages robust, evidence-grounded spatial reasoning applicable to mobile, real-world deployments, robotics, or low-resource devices. Theoretically, it offers a path towards more human-like 3D spatial intelligence in multimodal AI, shifting the research focus from optimizing for narrow benchmarks to optimizing agentic reasoning strategies and tool augmentation.

Future investigations may focus on expanding the toolset (including more sophisticated geometry, affordance, or physics reasoning), optimizing agentic decision latency, and formalizing agent self-improvement protocols as new tools become available.

Conclusion

ViSRA advances the study of 3D video-based spatial reasoning in MLLMs by proposing a modular, extensible agentic framework that operates entirely at inference time, leveraging explicit spatial tools without additional training. Empirical evaluation demonstrates that ViSRA systematically enhances both established and OOD spatial reasoning tasks, offering a clear path to generalizable and upgradeable MLLM spatial intelligence, and establishing a strong incentive to further explore training-free, tool-augmented paradigms in multimodal video reasoning.

Markdown Report Issue