Papers
Topics
Authors
Recent
2000 character limit reached

VSI-Super-Recall: Spatial Memory in Video AI

Updated 24 November 2025
  • VSI-Super-Recall (VSR) is a benchmark evaluating long-horizon spatial recall by testing models on encoding, sequencing, and retrieving object events in extended videos.
  • The methodology uses concatenated room-tour clips with inserted surprise objects to challenge models with photorealistic event insertion and multi-hop retrieval.
  • Baseline results show that simple semantic retrieval can nearly perfect accuracy, highlighting limitations in assessing genuine spatial cognition and world modeling.

VSI-Super-Recall (VSR) is a benchmark task introduced to evaluate long-horizon spatial recall capabilities in video world models. Developed as part of the VSI-SUPER suite, VSR aims to test not just basic semantic perception, but the ability of models to selectively encode, organize, and retrieve complex visual-spatial experiences over arbitrarily long video streams. Despite these goals, recent analyses indicate that VSR, as currently formulated, can be near-perfectly solved through simple semantic retrieval rather than genuine spatial cognition or predictive world modeling (Yang et al., 6 Nov 2025, Udandarao et al., 20 Nov 2025).

1. Task Definition and Evaluation Protocol

VSR assesses a model’s ability to recover the precise temporal order in which a single object appears at four distinct spatial locations within a long, continuous video. Each video V={xt}t=1N\mathcal V = \{x_t\}_{t=1}^N, downsampled to 1 FPS, is constructed by concatenating indoor "room-tour" clips with four “surprise” object placements at unknown times and locations. Human annotators perform in-frame editing to insert visually incongruent objects (e.g., Teddy Bear) into four unique frames. The model is provided with a query—“Which of the following correctly represents the order in which the <object> appeared in the video?”—and four multiple-choice options encoding different location sequences.

Formally, for each video instance ii, the sequence of ground-truth spatial coordinates is S(i)=(s1(i),s2(i),s3(i),s4(i))S^{(i)} = (s_1^{(i)}, s_2^{(i)}, s_3^{(i)}, s_4^{(i)}). The model outputs a predicted sequence S^(i)\hat S^{(i)}, and accuracy is defined as:

AccVSR=1Ni=1N1[S^(i)=S(i)]\mathrm{Acc}_{\mathrm{VSR}} = \frac{1}{N} \sum_{i=1}^N \mathbf{1}\left[\hat S^{(i)} = S^{(i)}\right]

with NN test instances. In practice, this reduces to reporting the percentage of correctly chosen multiple-choice options (Yang et al., 6 Nov 2025, Udandarao et al., 20 Nov 2025).

2. Dataset Construction and Structure

The VSR dataset comprises five splits by temporal duration: 10, 30, 60, 120, and 240 minutes, each with 60 videos formed by concatenating room-tour clips from sources such as ScanNet, ADT, and ARKitScenes. For each video, annotators insert the same object into four distinct, photorealistic frames, documenting spatial context and presentation order. These long-form streams are created by horizontally concatenating edited and unedited clips; downsampling at 1 FPS ensures visibility of all events. The task's design yields 300 test instances covering a diversity of unique indoor scenes and spatial contexts, although detailed diversity counts are not tabulated in the baseline release (Yang et al., 6 Nov 2025, Udandarao et al., 20 Nov 2025).

3. Core Methodological Principles

VSR is intended to move beyond context-window-based sequence modeling through several factors:

  • Unbounded horizon: Videos up to 4 hours in length preclude brute-force attention, necessitating selective memory formation.
  • Photorealistic event insertion: Spatial edits maintain perceptual realism, requiring semantic perception to locate and recognize inserted objects.
  • Multi-hop retrieval: Recall demands ordering four separate placements, each reliant on memory of all prior events.
  • Uniform generalization: The task and question format remain invariant across durations, placing demands on scalability without retraining (Yang et al., 6 Nov 2025).

The benchmark is positioned to resist “needle-in-a-haystack” shortcuts familiar in language benchmarks by embedding events seamlessly into visual data. However, as detailed in subsequent analyses, these preventative measures are insufficient under current instantiation (Udandarao et al., 20 Nov 2025).

4. Baselines, Approaches, and Quantitative Results

Early results from Yang et al. revealed that state-of-the-art large context window models such as Gemini-2.5-Flash (1,048,576 token context) achieve up to 41.5% accuracy on 60min VSR but are unable to process videos beyond ~1 hour. The Cambrian-S 7B model, finetuned on VSI-590K, demonstrates rapid performance decay with increasing video length when using naive memory: 38.3% (10min) → 6.0% (60min) → 0.0% (>60min) (Yang et al., 6 Nov 2025). By introducing a surprise-driven memory system based on latent frame prediction (LFP) error, Cambrian-S maintains stable performance (approx. 40%) up to 4 hours, outperforming proprietary baselines and ablations that use adjacent-frame similarity for memory consolidation.

A critical development comes from the introduction of the NoSense baseline (Udandarao et al., 20 Nov 2025). NoSense is a purely streaming, atemporal system leveraging a contrastive vision–LLM (SigLIP2). The pipeline:

  • Processes each frame independently at 1 FPS;
  • Stores only the top-4 frames with highest cosine similarity to the object prompt embedding;
  • Constructs a 4×4 similarity matrix between these frames and auxiliary text embeddings;
  • Scores each permutation and returns the order with the highest aggregate similarity.

NoSense solves VSR across all durations with near-perfect accuracy:

Split Cambrian-S SoTA NoSense (best)
10 min 45.0% 98.3%
30 min 41.7% 98.3%
60 min 40.0% 96.7%
120 min 36.5% 95.0%
240 min 34.2% 94.8%

This performance is robust to prompt ensembling ablations and even outpaces Cambrian-S by over 35 points across all splits.

5. Shortcut Exploitation and Benchmark Limitations

NoSense demonstrates that VSR, as currently formulated, is highly susceptible to shortcut exploitation and does not evaluate the stated core of spatial supersensing. Key weaknesses are:

  • Event sparsity: Exactly four “needle-in-haystack” object placements per video enable trivial top-4 retrieval by similarity.
  • Fixed insertion count: No distractor objects or variable events mean retrieval covers all target frames.
  • Semantic–not temporal–matching: Visually distinctive auxiliary contexts allow simple mapping via similarity, negating the need for temporal or spatial reasoning.
  • Shared structure with intended memory pipelines: Both Cambrian-S and NoSense employ frame-level encoding and compact memory over salient frames.

These characteristics imply that spatial supersensing or world modeling is never required: NoSense does not track objects, perform temporal smoothing, or infer scene structure, yet saturates the benchmark (Udandarao et al., 20 Nov 2025). The task, therefore, effectively becomes a semantic retrieval test rather than an assessment of long-horizon spatial cognition.

6. Recommendations for Robust Spatial Supersensing Evaluation

To enforce the intended challenge, researchers recommend the incorporation of several invariance and perturbation checks into VSR and related benchmarks:

  • Variable event counts: Randomizing the number of object insertions between 2 and 6 disrupts fixed-size retrieval.
  • Event repetition: Revisiting the same spatial context within a video challenges object identity tracking over time.
  • Segment shuffling: Permuting video segments prevents inference of order from global temporal statistics.
  • Playback speed changes: Altering playback rates obfuscates correlations between frame index and event timing.
  • Natural long-form video: Moving beyond stitched clips to continuous egocentric or robotic streams (with natural revisits and loops) increases ecological validity and requires scene-level integration (Udandarao et al., 20 Nov 2025).

Adopting these perturbations would disallow shortcut-based solutions and compel models to construct persistent object–scene representations, memory traces, and true predictive world models.

7. Context in Spatial Supersensing Research

VSR and the broader VSI-SUPER benchmark arise in the context of seeking comprehensive metrics for spatial cognition, event organization, and predictive modeling in video-based artificial intelligence systems (Yang et al., 6 Nov 2025). While the current VSR embodiment is instructive in revealing both the limitations of brute-force context expansion and the susceptibility of naive designs to semantic shortcutting, critical analysis underscores the ongoing need for robust, invariance-aware benchmarks that can drive progress in predictive sensing and world modeling. Future iterations will need to address these challenges to genuinely advance multimodal intelligence and spatial supersensing capabilities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to VSI-Super-Recall (VSR).