VSI-Super-Counting (VSC): Spatial Recall Benchmark
- VSI-Super-Counting (VSC) is a visual benchmark designed to evaluate models' spatial recall by testing their ability to recover object ordering in long video streams.
- The benchmark uses a multiple-choice recall task with four predefined events per video, emphasizing sustained memory, spatial perception, and sequential reasoning.
- Studies show that retrieval-based models can nearly solve VSC through shortcut exploitation, highlighting the need for more robust designs that enforce genuine world-modeling.
VSI-Super-Recall (VSR) is a long-horizon visual benchmark designed to assess models' capacities for spatial recall by evaluating their ability to recover the temporal order of specific objects appearing in distinct locations across extended video streams. VSR forms part of the broader VSI-SUPER benchmark suite, which targets the advancement of spatial supersensing—challenging video models far beyond brute-force context expansion and short-range event recognition by emphasizing sustained memory, spatial perception, and world-modeling across arbitrarily long video inputs (Yang et al., 6 Nov 2025, Udandarao et al., 20 Nov 2025). However, subsequent critical analyses have revealed significant limitations in the current VSR benchmark, demonstrating that it can be nearly perfectly solved by approaches with no explicit spatial cognition or world-modeling (Udandarao et al., 20 Nov 2025). The following sections provide an in-depth synthesis of the foundational design, evaluation methodology, performance characteristics, shortcut exploitation, and recommendations for future VSR iterations.
1. Formal Task Definition and Evaluation
VSR is structured as a multiple-choice recall task over long, downsampled video streams. Each video consists of concatenated "room-tour" clips (e.g., from ScanNet, ADT, ARKitScenes), with four “surprise” objects—such as a Teddy Bear or Hello Kitty—inserted via in-frame human editing at four unknown times and spatial locations.
For a video instance of frames (sampled at 1 FPS), the object-of-interest is placed at four distinct auxiliary context labels (e.g., “stove,” “bathtub,” “counter,” “trash bin”). At test time, the model is supplied with the full video, the relevant text query, and four candidate permutations describing possible orders in which the object appeared at those locations.
The model outputs a predicted permutation according to a scoring function, and is considered correct if , the ground-truth order:
where is the number of test videos.
2. Dataset Construction and Statistics
The VSR dataset comprises five splits, each based on video durations of approximately 10, 30, 60, 120, and 240 minutes. Each video contains precisely four event frames in which the same object-of-interest appears in four distinct contexts, yielding one object and four unique auxiliary objects or scenes per clip. All videos are constructed from diverse indoor source material, with high scene and layout variability, but each consistently follows the four-event structure.
Frames are sampled at a uniform rate (1 FPS), ensuring none of the inserted events are skipped during evaluation. The test set contains 60 videos per duration, for a total of 300 VSR test instances. Detailed counts of unique room layouts and object classes exist but are not public in the original release (Yang et al., 6 Nov 2025, Udandarao et al., 20 Nov 2025).
3. Evaluation Protocol and Performance Metrics
Each VSR evaluation sample provides: the video , query and the auxiliaries, and four orderings . Ground-truth correctness requires the predicted permutation to exactly match the true order .
Performance is measured as the proportion of test videos for which the correct sequence is selected. The four-way multiple-choice format implies a chance performance of 25%. Table 1 summarizes canonical performance of leading VSR approaches:
| Split | Cambrian-S SOTA (%) | NoSense (best) (%) |
|---|---|---|
| 10 min | 45.0 | 98.3 |
| 30 min | 41.7 | 98.3 |
| 60 min | 40.0 | 96.7 |
| 120 min | 36.5 | 95.0 |
| 240 min | 34.2 | 94.8 |
Cambrian-S achieves modest accuracy (max. 45.0% at 10 minutes, degrading to 34.2% at 240 minutes), while the NoSense baseline—using only frame-level semantic matching—achieves near-perfect recall across all durations (Udandarao et al., 20 Nov 2025).
4. Model Architectures and Inference Paradigms
The Cambrian-S baseline employs a streaming LLM (S-7B) architecture, combined with a predictive memory system based on latent frame prediction (LFP) and surprise-driven memory compression. The LFP memory module detects prediction error (“surprise”) between subsequent frames and leverages this signal to drive compression and event segmentation, aiming to retain salient event information over arbitrarily long videos. This method sustains VSR accuracy (≈40%) even as video length increases to four hours, whereas standard LLMs and non-memory variants degrade to chance (Yang et al., 6 Nov 2025).
NoSense, developed as a critical baseline, adopts a fundamentally different inference strategy: it is streaming and atemporal, discards all explicit temporal or spatial reasoning, and utilizes a contrastive vision-LLM (SigLIP2). Key operations include processing each frame independently, tracking only the top-4 frames with highest cosine similarity to the object-of-interest, and matching these four frames to auxiliary contexts via prompt-based similarity. This recipe uses no world-modeling, spatial tracking, or temporal smoothing, and performs exhaustive retrieval rather than reasoning (Udandarao et al., 20 Nov 2025).
5. Shortcut Exploitation and Benchmark Limitations
Analysis demonstrates that NoSense saturates VSR by exploiting structural shortcuts inherent in the benchmark:
- Event sparsity: Exactly four event frames per video permit retrieval-based methods to select the “top-4” most object-salient frames using cosine similarity, with high probability of matching all target events.
- Fixed insertion count: The predictable count of target events and their isolation from distractors means that selection heuristics never require temporal integration or spatial memory.
- Semantic over temporal discrimination: Distinctiveness of auxiliary contexts allows a matrix of image-text similarities to resolve the temporal ordering, obviating the need for tracking object persistence.
- No need for 3D or world modeling: Absent revisit or ambiguity, models never face conditions demanding scene integration or spatial reasoning.
NoSense thus demonstrates that current VSR design predominantly evaluates semantic retrieval rather than long-horizon spatial recall, fundamentally undermining its intended purpose as a supersensing benchmark (Udandarao et al., 20 Nov 2025).
6. Recommendations for Robust Spatial Supersensing Evaluation
To address these vulnerabilities and enforce genuine spatial supersensing, researchers propose several modifications:
- Variable event counts: Randomizing the number of object insertions (2–6 per video) prevents memorization of fixed retrieval heuristics.
- Event repetition: Introducing repeated events or revisitation enforces the need for identity tracking and long-term object permanence.
- Shuffled segments and playback perturbation: Permuting video segments and altering playback speeds breaks spurious statistical correlations between frame index and event timing.
- Natural continuous data: Moving beyond stitched room tours to continuous egocentric or robotics data (with loops and genuine scene revisits) introduces real-world complexity demanding true world-modeling.
Integrating these measures into future iterations of VSR would compel models to maintain persistent scene representations and support long-horizon integration, rather than exploiting surface-level retrieval cues (Udandarao et al., 20 Nov 2025).
7. Broader Impact and Future Directions
VSR catalyzed renewed focus on the problem of spatial supersensing and motivated new architectural paradigms emphasizing predictive sensing, latent memory, and event-driven segmentation. The Cambrian-S and predictive memory approaches achieved state-of-the-art performance using surprise-guided storage, but have also illuminated the ease with which current benchmarks may be solved at the semantic, rather than spatial, level.
A plausible implication is that progress in spatial supersensing will depend less on scaling dataset or model size, and more on rigorous benchmark construction—with structural invariances and perturbations—that can robustly differentiate semantic retrieval from genuine world-modeling. Ongoing research seeks to develop stress tests ensuring that future VSR-style tasks require models not only to "see" but to anticipate, select, and dynamically integrate spatiotemporal experience (Yang et al., 6 Nov 2025, Udandarao et al., 20 Nov 2025).