Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 169 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 38 tok/s Pro
GPT-4o 104 tok/s Pro
Kimi K2 191 tok/s Pro
GPT OSS 120B 433 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

VOST-TAS Dataset Benchmark

Updated 10 November 2025
  • VOST-TAS is a benchmark dataset that rigorously annotates object transformation events, enabling systematic evaluation of video object segmentation systems.
  • It employs a unique annotation schema marking transformation boundaries using free-text verbs and paired segmentation masks.
  • Evaluation metrics like temporal precision, recall, and semantic accuracy highlight both methodological advances and current challenges in tracking state changes.

VOST-TAS is a benchmark dataset specifically designed to quantify and evaluate the capabilities of video object segmentation (VOS) systems in tracking objects undergoing state transformations—such as fragmentation, emergence, or geometric and appearance changes—where conventional trackers typically fail. Developed in the context of the “Track Any State” (TAS) task (Sun et al., 6 Nov 2025), VOST-TAS provides detailed human annotations of transformation boundaries and outcomes, forming the first standard for systematic evaluation of complex object transformations in video.

1. Motivation, Scope, and Construction Principles

The VOST-TAS dataset addresses fundamental limitations of standard VOS trackers, which assume connected component continuity and static appearance. Real-world videos frequently depict scenarios where objects fragment, merge, or undergo irreversible appearance changes (e.g., slicing a fruit, metamorphosis). Existing methods such as SAM2 and XMem predominantly produce false negatives at transformation points due to reliance on pixelwise similarity and global propagation.

VOST-TAS was constructed via expert relabeling of videos from the VOST validation split. Out of 70 candidate videos, 13 were filtered out for annotation ambiguity or quality issues (§A.1), yielding a curated set of 57 video instances. Each video is annotated only at transformation boundaries rather than at every frame, capturing a total of 108 transformation events and 293 resultant object masks.

2. Annotation Schema and Format

Annotations in VOST-TAS rigorously encode both when and how objects transition between states. For a video V={It}t=0TV = \{I_t\}_{t=0}^T, the annotation tuple is

A=(tstart,tend,Γ)A = (t_{\mathrm{start}},\, t_{\mathrm{end}},\, \Gamma)

where tstartt_{\mathrm{start}} and tendt_{\mathrm{end}} delimit the annotated segment and Γ\Gamma is an ordered list of transformations. Each transformation is an explicit 4-tuple:

τi=(tis,tie,vi,Oi)\tau_i = (t_i^{\mathrm{s}},\, t_i^{\mathrm{e}},\, v_i,\, \mathcal{O}_i)

with tis,tiet_i^{\mathrm{s}}, t_i^{\mathrm{e}} denoting transformation boundaries, viv_i a free-text verb (e.g., “emerge,” “cut,” “detach”), and Oi\mathcal{O}_i the set {(Mi,j,di,j)}j=1Ki\{(M_{i,j}, d_{i,j})\}_{j=1}^{K_i} pairing binary segmentation masks at tiet_i^{\mathrm{e}} with textual instance labels.

For evaluation, predicted transformations are formalized as edges in a state-change graph:

(t,Tpre,Tpost,D)(t,\, \mathcal{T}_{\mathrm{pre}},\, \mathcal{T}_{\mathrm{post}},\, D)

where tt is the predicted timestamp, Tpre,Tpost\mathcal{T}_{\mathrm{pre}}, \mathcal{T}_{\mathrm{post}} the sets of track IDs before/after transformation, and DD a natural-language event description. This facilitates semantic comparison of system output to ground truth.

3. Dataset Structure, Splits, and Accessibility

VOST-TAS exclusively repurposes the validation split of VOST, without a training or test partition for transformation-aware evaluation. Its construction protocol ensures high annotation fidelity only at change boundaries, rather than dense per-frame labeling. Average video duration is \sim22.3 s at 60 fps, consistent with the source VOST split.

The dataset, together with code and auxiliary materials, is publicly available via https://tubelet-graph.github.io. Masks are provided as PNG images; transformation and metadata follow a JSON schema. No license is stated, but repository access is unrestricted.

For model training, external datasets such as VOST train, VSCOS, M³-VOS, and DAVIS17 are used; VOST-TAS itself serves as the de facto evaluation standard for Track Any State due to its unique annotation strategy focused on transformation events rather than steady tracking.

4. Evaluation Protocol and Metrics

Performance in VOST-TAS is quantified via precise spatiotemporal and semantic criteria. For temporal localization, the protocol involves matching predicted change timestamps {tjpred}\{t_j^{\mathrm{pred}}\} to ground-truth intervals {[tis,tie]}\{[t_i^s, t_i^e]\} via a cost matrix:

Cij={0if tjpred[tis,tie] 1otherwiseC_{ij} = \begin{cases} 0 & \text{if } t_j^{\mathrm{pred}} \in [t_i^s, t_i^e] \ 1 & \text{otherwise} \end{cases}

Assignments (Hungarian algorithm) yield precision and recall:

P=TPTP+FP,R=TPTP+FNP = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}}, \quad R = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}}

Semantic accuracy is decomposed into:

  • Action-verb accuracy VV, via GPT-4.1 prompt similarity scoring for TP matches.
  • Resulting-object accuracy OO, via mask IoU matching (>0.5>0.5) and text similarity evaluated also with GPT-4.1.

Combined recall measures:

  • Spatiotemporal recall RSTR_{\mathrm{ST}}: correct timestamp + all ground-truth mask matches.
  • Overall recall R\mathcal{R}: additional requirement that all actions/objects are semantically matched.
Metric Value (TubeletGraph)
Temporal Precision (PP) 43.1%
Temporal Recall (RR) 20.4%
Action Verb Accuracy (VV) 81.8%
Object Description Accuracy (OO) 72.3%
Spatiotemporal Recall (RSTR_{\mathrm{ST}}) 12.0%
Overall Recall (R\mathcal{R}) 6.5%

5. Transformation Types and Annotated Events

VOST-TAS encapsulates a diverse set of transformation events representative of common real-world scenarios requiring explicit reasoning about object state. Each transformation, described by a free-text verb, includes actions such as cutting/slicing, opening, emergence, detachment, or peeling. The granularity of annotation enables systems to benchmark both the detection and semantic interpretation of transition points and outcomes, moving beyond mere foreground tracking.

A plausible implication is that this schema generalizes to arbitrary transformation descriptions without constraining annotation to limited action categories. However, the dataset does not report a categorical breakdown beyond representative examples.

6. Baseline Results and Observed Limitations

TubeletGraph is reported as the first system to address the Track-Any-State benchmark, with performance on VOST-TAS indicating several key phenomena:

  • Moderate precision (43.1%): the system triggers transformation detection only in clear cases, predominantly those accompanied by obvious pixelwise changes or false-negative object tracks.
  • Low recall (20.4%): many transformations do not produce emergent tracklets (e.g., changes that preserve pixel appearance or involve gradual transitions); as a result, the majority of annotated events are not detected.
  • High semantic accuracy in detected cases (V=81.8%V=81.8\%, O=72.3%O=72.3\%): when transformations are triggered, describing the nature of change and object outcome is handled with reasonable reliability.
  • Overall recall (R=6.5%\mathcal{R}=6.5\%): full compliance with spatiotemporal and semantic requirements remains challenging.

This suggests fundamental limitations in current state-aware tracking pipelines when addressing transformations not accompanied by clear segmentation discontinuities.

7. Significance and Future Directions

VOST-TAS establishes a principled reference for evaluating video understanding systems that must reason about state transitions, object fragmentation, and semantic event description. By shifting the benchmark focus from tracking continuity to explicit change detection and outcome description, it exposes failure modes not captured by conventional VOS or video retrieval metrics. The annotation protocol and evaluation design support rigorous, reproducible analysis and set the groundwork for future advances in object-centric temporal reasoning.

A plausible implication is that further progress will require systems capable of integrating appearance, geometry, and semantic priors—potentially leveraging large-scale pretraining on transformation-rich corpora or incorporating multimodal (e.g., textual) event understanding in addition to mask propagation and track assignment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to VOST-TAS Dataset.