Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 167 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 31 tok/s Pro
GPT-4o 106 tok/s Pro
Kimi K2 187 tok/s Pro
GPT OSS 120B 443 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

TubeletGraph: Tracking Object Transformations

Updated 10 November 2025
  • TubeletGraph is a zero-shot vision system that tracks objects through drastic visual transformations by recovering missing tubelets and constructing a semantic state graph.
  • It utilizes a four-stage pipeline combining spatio-temporal segmentation, candidate tubelet reasoning with spatial and semantic priors, and LLM-powered event labeling.
  • The approach achieves state-of-the-art performance on benchmarks like VOST-TAS, significantly improving tracking and accurate state-change detection compared to conventional methods.

TubeletGraph is a zero-shot vision system for tracking user-prompted objects through state transformations characterized by significant appearance changes, such as those caused by cuts, breaks, or metamorphoses. Unlike conventional tracking models that lose the target after radical visual change, TubeletGraph recovers missing objects after transformations and constructs a semantic state graph that represents the sequence of state changes across time. The system relies on spatial and semantic reasoning, as well as natural language understanding, to detect and describe transformation events, and it achieves state-of-the-art results on dedicated benchmarks.

1. Formal Definition and Task Overview

TubeletGraph addresses the "Track Any State" (TAS) task, which requires not only maintaining identity tracking of objects across radical transformations but also detecting, temporally localizing, and naming each transformation event. An input consists of a video V={It}t=1TV = \{I_t\}_{t=1}^T and an initial mask M1M_1 at frame 1 indicating the target object. The system outputs a set of object tracks (tubelets), change points (frames where state transitions occur), a state graph SS with tubelets as nodes, and directed, verb-labeled edges representing the transformations. The challenge arises when the object splits, recomposes, or changes so significantly that standard tracking cannot reliably associate the before- and after-states.

2. Methodological Pipeline

TubeletGraph executes in four major stages:

2.1 Video Partitioning into Tubelets

  • Spatio-temporal Partitioning: The entire video is exhaustively decomposed into “tubelets”—partial tracks such that every pixel in every frame belongs to exactly one tubelet.
  • Segmentation (Frame 1): The CropFormer-Hornet-3X model produces non-overlapping entity masks e1i{e_1^i} for the first frame; the user prompt M1M_1 is unioned, yielding E1=CF(I1){M1}E_1 = \mathrm{CF}(I_1) \cup \{M_1\}.
  • Tubelet Initialization: Each initial entity e1iE1e_1^i \in E_1 is tracked forward in time via SAM2.1-Large to obtain tubelet Pi={eti}t=1TP_i = \{e_t^i\}_{t=1}^T, forming PinitP_{\mathrm{init}}.
  • Coverage-Guided Tubelet Growth: For t>1t > 1, new CropFormer segments e^tj\hat{e}_t^j launch novel tubelets if their uncovered area ratio exceeds τcoverage=0.25\tau_\mathrm{coverage} = 0.25; formally, if cover(e^tj,PPinitPt)<τcoverage\mathrm{cover}(\hat{e}_t^j, \bigcup_{P \in P_{\mathrm{init}}} P_t) < \tau_\mathrm{coverage}.

2.2 Candidate-Tubelet Reasoning with Priors

  • Candidate Selection: Tubelets starting at t>1t>1 are considered as potential transformation products.
  • Spatial Proximity Prior: Valid tubelets appear spatially proximate to where the original object disappeared. Sprox(C,P)=maxj csmsj/csS_{\mathrm{prox}}(C,P) = \max_j |\ c_s \cap m_s^j|/|c_s| for three alternative SAM2 segmentations; must exceed τprox=0.3\tau_\mathrm{prox} = 0.3.
  • Semantic Consistency Prior: Despite visual change, semantic features of candidate matches and the original should correlate. CLIP-based pooling extracts f(M,I)=Pool(CLIP(I),M)f(M,I) = \mathrm{Pool}(\mathrm{CLIP}(I), M). Ssem(C,P)=maxi<s,jsf(pi,Ii)f(cj,Ij)S_{\mathrm{sem}}(C,P) = \max_{i < s, j \geq s} f(p_i, I_i) \cdot f(c_j, I_j), requiring Ssem(C,P)>τsem=0.7S_{\mathrm{sem}}(C,P) > \tau_\mathrm{sem} = 0.7.
  • Final Validations: Only tubelets satisfying both priors are accepted as valid continuations.

2.3 State-Change Description and State Graph Construction

  • Event Detection: Each recovered tubelet CVC \in V marks a state-change event at its start frame ss.
  • LLM-Powered Semantic Labeling: For events, the system submits pre/post crops (originating mask and resulting fragment) to GPT-4.1 (temperature 0) which outputs: (i) an action verb vv describing the transformation and (ii) a textual label djd_j for each post-state fragment.
  • Graph Assembly: Event tuples τ=(s,v,{cj,dj})\tau = (s, v, \{c_j, d_j\}) are organized into a directed state graph SS, where nodes are tubelets and edges link transformations, labeled with action verbs.

2.4 Implementation Details

All core modules are used off-the-shelf with no fine-tuning.

  • Entity segmentation: CropFormer-Hornet-3X.
  • Tubelet propagation: SAM2.1-Large.
  • Semantic features: FC-CLIP-COCO pooling.
  • Language module: GPT-4.1.
  • Hyperparameters: τcoverage=0.25\tau_\mathrm{coverage}=0.25, τprox=0.3\tau_\mathrm{prox}=0.3, τsem=0.7\tau_\mathrm{sem}=0.7.

3. Benchmark: VOST-TAS

VOST-TAS extends the VOST benchmark with explicit, dense annotations of state changes.

Characteristic Details
Number of videos 57
Video duration \sim22 s (60 fps)
Annotated transformations 108
Post-condition masks/labels 293
Event representation τi=(tis,tie,vi,{(Mi,j,di,j)}j=1Ki)\tau_i = (t_i^s, t_i^e, v_i, \{(M_{i,j}, d_{i,j})\}_{j=1}^{K_i})

Annotations track start/end frames, ground-truth verbs, resulting masks, and their textual descriptions, providing a rigorous resource for spatio-temporal and semantic evaluation.

4. Evaluation Metrics and Quantitative Results

Tracking and State-Change Metrics

Metric Description
J\mathcal{J} Jaccard over all frames
Jtr\mathcal{J}_{\mathrm{tr}} Jaccard over last 25% of frames (tracks lost after transformation)
P\mathcal{P}, R\mathcal{R} Pixel-level precision and recall
PTLP_\mathrm{TL}, RTLR_\mathrm{TL} Temporal localization precision/recall for state changes
VV Semantic verb accuracy; GPT-4.1 evaluated
OO Object label accuracy; mask correspondence (IoU>>0.5) + GPT-4.1 label match
RSTR_{\mathrm{ST}} Spatio-temporal recall: fraction of ground-truth changes with matched time and mask overlap
RallR_{\mathrm{all}} Overall recall: RSTR_{\mathrm{ST}} + correct verb & object label

Key Results

Dataset Method J\mathcal{J} Jtr\mathcal{J}_{\mathrm{tr}}
VOST val SAM2.1 48.4 32.4
TubeletGraph 51.0 36.9 (+2.6, +4.5)
VSCOS val TubeletGraph \sim +3-4 over SAM
M³-VOS val TubeletGraph 74.2
(for reference: ReVOS) 75.6
(for reference: SAM2.1) 71.3
DAVIS17 TubeletGraph 85.6 \approx best prior (87.1)

For state-change detection (VOST-TAS):

Metric Value
PTLP_\mathrm{TL} 43.1%
RTLR_\mathrm{TL} 20.4%
VV (Verb) 81.8%
OO (Object) 72.3%
RSTR_\mathrm{ST} 12.0%
RallR_\mathrm{all} 6.5%

These results indicate that TubeletGraph improves tracking performance under appearance-changing transformations and supports more grounded state-change understanding.

5. Design Choices, Ablations, and Limitations

Ablation Findings

Modification Effect
Removing CF (SAM automasks only) J\downarrow \mathcal{J} by 1.8
Replacing SAM2.1 with Cutie J\downarrow \mathcal{J} by 3.4
Swapping CLIP for DINOv2 J\leftrightarrow \mathcal{J}
GPT-4.1 \rightarrow Qwen-2.5VL Verb/object accuracy drops significantly

Variation in τprox\tau_\mathrm{prox} or τsem\tau_\mathrm{sem} over wide intervals yields less than ±\pm0.5 change in J\mathcal{J}.

Computational and Practical Characteristics

  • Zero-shot Generalization: No fine-tuning is employed; robust across diverse domains and video styles (e.g., egocentric, internet video).
  • Resource Requirements: 7\sim 7 seconds per frame on an NVIDIA A6000 GPU, largely due to per-entity SAM2 tracking.
  • Occlusion Robustness: Short occlusions are addressed by additional tubelet generation.
  • Dominant Error Modes: False-positives from SAM2, mislabeling by the LLM (e.g., object ambiguity such as smartphone vs. tape measure).
  • Ethical Risks: Potential for misuse similar to other robust vision-language pipelines, including privacy concerns in surveillance contexts.

6. Qualitative Behavior and Output Structure

TubeletGraph demonstrates the semantic structure of transformations via state graphs. For example:

  • "Pulling foil from a roll": SAM2.1 fails to track the narrow sheet after transformation; TubeletGraph's reasoning based on spatial proximity and semantic alignment recovers the fragment, and the LLM labels the event “pull out,” assigning "sheet of foil" as an object label.
  • "Butterfly Emergence": The chrysalis mask disappears upon metamorphosis in SAM2.1; TubeletGraph recovers the new butterfly tubelet and the language module labels the event as “emerge” and the new object as “butterfly.”

Such outputs enable complex analysis, visualization, and querying of object transformation sequences within video.


TubeletGraph reframes the challenge of tracking through transformations as a discrete tubelet selection task, leveraging spatial and semantic priors, and augments outputs with structured, natural-language state-change graphs. This approach advances both the robustness of visual tracking under appearance shifts and deepens automatic understanding of transformation events.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to TubeletGraph.