VET-Bench: Tracking in Vision-Language Models
- VET-Bench is a benchmark that evaluates vision-language models’ ability to track visually identical objects through strict spatiotemporal continuity in synthetic video tasks.
- It employs two shell-game tasks—Cups and Cards—to rigorously enforce motion constraints that prevent reliance on static, appearance-based cues.
- The introduction of Spatiotemporal Grounded Chain-of-Thought (SGCoT) demonstrates that explicit intermediate state tracking can significantly overcome the limitations of fixed-depth transformer architectures.
VET-Bench is a term that refers to two distinct, high-impact benchmarks in the vision-language and multimodal modeling literature. Both are notable for probing integrated or temporal capabilities in models, but they diverge strongly in conceptual focus and evaluation protocol. The first VET-Bench, derived from “MM-Vet v2,” is a public benchmark for holistic vision-language evaluation with structured integration of multiple capabilities including image-text sequence reasoning. The second, “Visual Entity Tracking Benchmark” (VET-Bench) from “Can Vision-LLMs Solve the Shell Game?”, is a synthetic diagnostic tailored to expose the spatiotemporal object-tracking deficiencies of video-capable vision-LLMs. The following article focuses on the latter, as this benchmark is canonical in current literature for the abbreviation VET-Bench (Liu et al., 9 Mar 2026).
1. Motivation and Problem Definition
VET-Bench is designed to rigorously assess a vision-LLM’s (VLM’s) ability to perform visual entity tracking, specifically in scenarios where objects are visually indistinguishable and tracking is possible only through spatiotemporal continuity. Unlike prior video QA and perception benchmarks, which often admit visual shortcuts or static re-identification, VET-Bench enforces conditions that preclude any utility from frame-wise or appearance-based cues. The core objective is to diagnose and quantify the persistence gap—the inability of present-day large VLMs to maintain coherent object identity representations across time—despite human-level performance on analogous shell-game tasks (Liu et al., 9 Mar 2026).
2. Dataset Design and Construction
VET-Bench encompasses two canonical “shell-game” tasks: Cups Game (the classic ball under cups shuffling) and Cards Game (Three-Card Monte), instantiated as fully synthetic video episodes. Each episode comprises frames , where visually identical objects undergo a sequence of deterministic swap operations implementing an unknown permutation . At , the target object is explicitly highlighted (e.g., “the ball under cup 1”). The model is tasked with predicting the slot that contains the target in the final frame, using only visual evidence.
The construction enforces a strict spatiotemporal continuity condition. Let be the center of object in frame . The constraint (per-frame displacement) and 0 with 1 ensures that identity assignment is both unique and consistent under temporal smoothness, barring any moment where two objects "cross over."
Dataset statistics:
- 100 videos (50 Cups, 50 Cards) in the evaluation split;
- Objects per task 2;
- Shuffle count per task 3;
- Frame rate at minimum 2 frames/swap to ensure continuity cues;
- Full photorealistic variability in appearance, lighting, texture, and camera view via three.js rendering;
- Unlimited synthetic generation to prevent memorization.
This regime guarantees that correct tracking absolutely requires chaining correspondences between object coordinates across all consecutive frames and eliminates all forms of appearance-based discrimination (Liu et al., 9 Mar 2026).
3. Theoretical Foundations and Expressivity Analysis
The architectural limitations of fixed-depth transformer-based VLMs for the visual entity tracking problem are rigorously formalized. The paper introduces the decision problem TRACK4:
- Definition: Given a video of 5 visually identical objects under the VET-Bench continuity regime, decide whether the induced permutation 6 is the identity.
- Key Result: For any fixed 7, TRACK8 is 9-complete.
The proof proceeds by showing that the task admits a polylog depth (in sequence length) circuit solution (membership) but is as hard as the word problem for 0 under group theory reductions (hardness). Because fixed-depth transformers are contained in 1, a strictly weaker class, this establishes a fundamental “barrier”: fixed-depth VLMs cannot solve general visual tracking without access to explicit intermediate computation or external memory. This separates entity tracking from tasks solvable by transformers in a highly formal sense (Liu et al., 9 Mar 2026).
4. Baseline Evaluation and Model Analysis
State-of-the-art video-capable VLMs—including Gemini-3-Pro, Gemini-2.5, Qwen3-VL, GLM-4.6V-Flash, Ernie-4.5, Doubao-Seed, Kimi-K2.5, PerceptionLM, and Molmo2—are evaluated using standard MCQA prompts (e.g., “Which cup contains the ball at the end?”), with or without simple chain-of-thought (CoT).
Results indicate that for all 2, models perform at chance (e.g., 3 for 4). Analysis of response patterns reveals:
- Direct-answer completions: random guessing;
- Coarse description completions: generic recounting of “they shuffle” without actual tracking;
- CoT completions: can hallucinate plausible but incorrect entity transformations, leading to error propagation.
Performance degrades rapidly with increasing swap count or object count. Even 5 parity-style tracking remains at chance unless explicit intermediate state representations are used. Prior tests such as Perception Test's "cups‐game" (appearance cues present) show a dramatic collapse (from 6 to 7) once those cues are removed, corroborating the stringency of the VET-Bench regime (Liu et al., 9 Mar 2026).
5. Spatiotemporal Grounded Chain-of-Thought (SGCoT) Methodology
To break the expressivity barrier, the benchmark authors introduce Spatiotemporal Grounded Chain-of-Thought (SGCoT), a method compelling the model to explicitly generate object trajectories as intermediate (not just implicit) states.
Pipeline:
- The input prompt is prefixed to demand explicit tracking: "Track the [object] and answer where it is at the end of the video."
- Molmo2 produces a
<tracks>block encoding the object's 8 coordinates at 9 second intervals. - The final answer ("Answer: left/middle/right") is placed after the trajectory trace.
Training deploys QLoRA fine-tuning solely on synthetic text data: coordinate strings plus final label. The loss is masked everywhere except the final answer token. This effectively aligns the VLM's output structure to link accurate, token-level trajectory accounting with correct final localization.
Technical parameters: 300 synthetic text samples, 1 epoch, fixed vision encoder, batch size 64, single A100 GPU, run time under 3 minutes (Liu et al., 9 Mar 2026).
6. Empirical Results and Insights
After SGCoT alignment, Molmo2's accuracy on VET-Bench jumps from near-chance to 0 (as per empirical evaluation). Error analysis attributes residual failures solely to errors in the dense, half-second tracking sequence, confirming that trajectory-level intermediate state generation is the critical enabler.
Comparison to legacy and contemporary video QA and tracking tests demonstrates VET-Bench's exclusivity in demanding persistent, token-wise visual memory. Competing datasets (e.g., VideoReasonBench) are less stringent, as they include motion arrows, which shortcut the raw tracking requirement, yielding only 1 accuracy.
The released codebase includes data generation in three.js, alignment and evaluation scripts, and auxiliary resources for model training and assessment (https://vetbench.github.io).
7. Significance and Broader Implications
VET-Bench serves as a definitive measure of spatiotemporal memory and object permanence in vision-language agents. The benchmark rigorously confirms that:
- Fixed-depth transformer VLMs, as currently constructed, cannot solve even simple video shell games when static appearance cues are absent;
- Eliciting explicit intermediate state reasoning (SGCoT) is algorithmically sufficient for success;
- Minimal fine-tuning on text-only data is capable of enabling previously inaccessible capabilities, provided the underlying visual encoder is trained for tracking.
A plausible implication is that genuine temporal state modeling—explicit in step-by-step token output or architectural memory structure—will be a precondition for reliably deploying VLMs in real-world, temporally extended video reasoning settings. VET-Bench thus acts as both a diagnostic and developmental benchmark for multimodal research targeting integrated perception, temporal reasoning, and causal inference in synthetic and real environments (Liu et al., 9 Mar 2026).