Can Vision-Language Models Solve the Shell Game?

Published 9 Mar 2026 in cs.CV and cs.CL | (2603.08436v1)

Abstract: Visual entity tracking is an innate cognitive ability in humans, yet it remains a critical bottleneck for Vision-LLMs (VLMs). This deficit is often obscured in existing video benchmarks by visual shortcuts. We introduce VET-Bench, a synthetic diagnostic testbed featuring visually identical objects that necessitate tracking exclusively through spatiotemporal continuity. Our experiments reveal that current state-of-the-art VLMs perform at or near chance level on VET-Bench, exposing a fundamental limitation: an over-reliance on static frame-level features and a failure to maintain entity representations over time. We provide a theoretical analysis drawing connections to the state-tracking problem, proving that fixed-depth transformer-based VLMs are fundamentally limited in tracking indistinguishable objects without intermediate supervision due to expressivity constraints. To address this, we propose Spatiotemporal Grounded Chain-of-Thought (SGCoT): generating object trajectories as explicit intermediate states. Leveraging Molmo2's object tracking ability, we elicit SGCoT reasoning by fine-tuning on synthesized text-only data for alignment. Our method achieves state-of-the-art accuracy exceeding 90% on VET-Bench, demonstrating that VLMs can reliably solve the video shell-game task end-to-end without external tools. Our code and data are available at https://vetbench.github.io .

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper demonstrates that VLMs perform near random chance on entity tracking tasks when visual shortcuts are removed via VET-Bench.
It establishes that transformer architectures are theoretically limited (NC¹-complete for k ≥ 5), impeding robust spatiotemporal reasoning.
The proposed SGCoT paradigm, which uses explicit chain-of-thought reasoning, achieves over 90% accuracy, highlighting its potential for future VLMs.

Spatiotemporal Bottlenecks in Vision-LLMs: An Analysis via the Shell Game

Introduction and Motivation

Human cognition is marked by robust visual entity tracking—the ability to maintain correspondence between indistinguishable objects across occlusion and motion. In contrast, contemporary Vision-LLMs (VLMs), despite impressive advances in general video understanding, display fundamental deficits in this fine-grained spatiotemporal perception. The paper "Can Vision-LLMs Solve the Shell Game?" (2603.08436) systematically dissects these deficits using synthetic diagnostics, theoretical analysis, and architectural mitigation strategies.

The authors argue that, in current video benchmarks, the apparent competence of VLMs at tracking tasks is confounded by visual shortcuts—static frame-level cues or distinctive object appearances—which enable models to bypass genuine reasoning over temporal dynamics. When such shortcuts are obviated, VLMs' entity-tracking abilities are revealed as severely limited.

The VET-Bench Diagnostic: Eliminating Frame-Level Shortcuts

To rigorously evaluate visual entity tracking, the authors introduce VET-Bench, a synthetic benchmark explicitly designed to remove appearance-based cues. Objects are rendered physically and visually identical; task success demands tracking based solely on spatiotemporal continuity through video frames.

Figure 1: Overview of VET-Bench.

Key aspects of VET-Bench's design include:

Fully synthetic data generation (similar in spirit to CLEVR/CATER pipelines), enabling precise control over object count, swap count, textures, and camera viewpoints, eliminating dataset bias and enabling unlimited task variation.
Two canonical shell-game settings: "cups game" (track a hidden object under identical cups) and "cards game" (track a face-down card through shuffles and flips).
Strict continuity constraints, ensuring that object displacement per frame is low enough to guarantee unambiguous tracking absent of occlusion aliasing.
No single frame—nor any swap operation annotation—provides disambiguating information; entities are indistinguishable except via their temporal trajectories.

Empirical Evaluation: VLMs Collapse to Random-Guessing

A comprehensive suite of SOTA proprietary and open VLMs (e.g., Gemini-3-Pro, Qwen3-VL, GLM-4.6V-Flash, PerceptionLM, Molmo2) was evaluated on VET-Bench. The results are consistent: all models perform at or near random chance on tasks requiring actual entity tracking, regardless of model scale, video sampling rate, or reasoning prompt configuration.

Figure 2: All evaluated VLMs perform near random chance on VET-Bench, with only Molmo2-SGCoT (a fine-tuned variant using spatiotemporal grounded chain-of-thought) exceeding 90%.

Qualitative error analyses reveal three principal failure modes:

Direct Answer Guessing: Models bypass reasoning or chain-of-thought, submitting a final answer consistent with chance.
Coarse Semantic Description: Some models provide plausible-sounding event summaries but lack fine-grained entity correspondence, rendering predictions random.
Hallucinated Reasoning: Advanced VLMs, when prompted for reasoning, generate linguistically and logically coherent swap sequences that are visually incorrect, as evidenced by misidentified or phantom swaps.

Furthermore, performance degrades sharply with just a single swap and never recovers as task complexity (swap or object count) increases.

Figure 3: Performance degrades as swap count increases, converging to random chance rapidly.

Benchmark Comparison: Revealing Shortcut Reliance

Analysis of popular "shell game" benchmark datasets (e.g., Perception Test, VideoReasonBench) demonstrates that prior evaluations substantially overestimate VLMs' entity-tracking capabilities due to shortcut exploits:

Appearance Cues: In Perception Test, transparent or distinctive cups/cards allow trivial frame-based solution strategies.
Swap Annotations: VideoReasonBench overlays explicit swap cues (e.g., arrows), collapsing tracking into a static token-matching task rather than dynamic reasoning.

Applying rigorous filtering to Perception Test (limiting to identical, opaque cups and excluding non-shuffling cases) yields a marked drop in VLM accuracy—to baseline—mirroring VET-Bench findings.

Figure 4: Example frames from videos involving distinct cups in the Perception Test. Visual shortcut cues allow appearance-based problem solving, bypassing entity tracking.

Figure 5: Example frames from videos involving transparent cups, similarly undermining the requirement for temporal reasoning.

Figure 6: Visual shortcuts in real-world data—frames explicitly reveal the answer without requiring temporal tracking.

Figure 7: VideoReasonBench provides frame-level swap cues (arrows, left), absent in VET-Bench (right), thus failing to isolate genuine spatiotemporal reasoning.

Theoretical Hardness: NC¹-Completeness and Transformer Limitations

A key theoretical contribution is the proof that visual entity tracking for $k \geq 5$ indistinguishable objects on the grid is NC¹-complete. This situates the problem at a level of circuit complexity that fixed-depth transformer architectures cannot, in principle, solve over sequences of arbitrary length without intermediate computation.

Key points of the analysis:

Tracking the permutation of $k$ objects after $T$ arbitrary transpositions (swaps) is reducible to the word problem for the symmetric group $S_k$ , known to be NC¹-complete for $k \geq 5$ .
Transformers of bounded depth are strictly less expressive (TC⁰; i.e., limited to problems solvable by constant-depth majority circuits) and thus insufficient for representing or learning state-tracking tasks unless augmented by explicit intermediate-state supervision or chain-of-thought (CoT) reasoning [merrill2024illusion, li2024chain, huang2025transformers].
Empirical training of transformers (Qwen2.5-VL-3B-Instruct) with direct-answer supervision does not overcome this limitation—the loss quickly plateaus at random-guessing levels despite sufficient training epochs, paralleling classic failures to learn parity.

Spatiotemporal Grounded Chain-of-Thought (SGCoT): Overcoming Perceptual Bottlenecks

The authors propose and validate Spatiotemporal Grounded Chain-of-Thought (SGCoT): requiring the model to explicitly generate entity trajectories as an ordered collection of spatial coordinates indexed by timestamps—prior to giving the final answer.

SGCoT implementation involves:

Utilizing the native object-tracking capabilities of Molmo2, which outputs per-frame coordinates for object identities.
Fine-tuning Molmo2 (Molmo2-SGCoT) on synthetic text-only samples that scaffold object tracking as an explicit reasoning step, but supervising only the final answer via lightweight QLoRA parameter updates.
Prompting in the QA phase to elicit explicit spatial chain-of-thought, leveraging the generated trajectory to support or directly produce the answer.

This approach demonstrates state-of-the-art accuracy exceeding 90% on VET-Bench, with rapid, data-efficient training (one epoch, 300 samples, single GPU, <3 minutes)—substantially outperforming all tested baselines.

Analysis of the SGCoT Paradigm

SGCoT's success is linked to:

Fine-grained, temporally aligned state representation: Discrete trajectories capture all event transitions at sufficient resolution, avoiding the temporal misalignment and underspecification endemic to coarse or purely linguistic CoTs.
Robustness to sequence length and complexity: Explicit tracking scales with task difficulty, as the intermediate state at each timestamp provides a direct path to the answer.
Interpretability and Error Localization: Mistakes in SGCoT outputs manifest as detectable spatial "jumps" or inconsistencies, improving model transparency and failure diagnosis.

Remaining failure cases for SGCoT are localized primarily to perception-stage errors (identity confusion among visually identical entities), especially under conditions of fast swaps or near-overlap, but are dramatically reduced compared to non-tracking approaches.

Practical and Theoretical Implications

Benchmark Recommendations: Diagnostic benchmarks must excise static frame-level shortcuts to truly evaluate VLMs' spatiotemporal grounding and reasoning.
Architecture/Training Design: Without explicit intermediate computations, transformer-based VLMs are fundamentally ill-equipped to solve rich spatiotemporal state-tracking tasks, a limitation that must be addressed via architectural modifications (e.g., chained reasoning, explicit persistent memory) or sophisticated supervision regimens (e.g., annotated CoT).
Future Directions: Extending SGCoT to real-world videos with occlusion, ambiguous referring expressions, localization noise, or physical interactions (e.g., agentic manipulation) is an important future direction; integration with world-modeling priors and causal reasoning remains a critical open problem.

Conclusion

The investigation presented in this paper provides a rigorous, multifaceted diagnosis of visual entity tracking bottlenecks in current VLMs. By developing VET-Bench and removing visual shortcuts, the authors expose the stark limitations of SOTA models, theoretically anchoring these deficits in expressivity gaps of transformer architectures. The proposed SGCoT paradigm effectively bridges this gap, demonstrating that with appropriate intermediate state reasoning, VLMs can approach human-level entity tracking in constrained synthetic domains. These findings carry significant implications for the design and evaluation of next-generation video-language systems, especially for applications in embodied AI and interactive agents requiring robust, compositional state tracking over high-dimensional perceptual streams.

Markdown Report Issue