TubeletGraph: Tracking Object Transformations

Updated 10 November 2025

TubeletGraph is a zero-shot vision system that tracks objects through drastic visual transformations by recovering missing tubelets and constructing a semantic state graph.
It utilizes a four-stage pipeline combining spatio-temporal segmentation, candidate tubelet reasoning with spatial and semantic priors, and LLM-powered event labeling.
The approach achieves state-of-the-art performance on benchmarks like VOST-TAS, significantly improving tracking and accurate state-change detection compared to conventional methods.

TubeletGraph is a zero-shot vision system for tracking user-prompted objects through state transformations characterized by significant appearance changes, such as those caused by cuts, breaks, or metamorphoses. Unlike conventional tracking models that lose the target after radical visual change, TubeletGraph recovers missing objects after transformations and constructs a semantic state graph that represents the sequence of state changes across time. The system relies on spatial and semantic reasoning, as well as natural language understanding, to detect and describe transformation events, and it achieves state-of-the-art results on dedicated benchmarks.

1. Formal Definition and Task Overview

TubeletGraph addresses the "Track Any State" (TAS) task, which requires not only maintaining identity tracking of objects across radical transformations but also detecting, temporally localizing, and naming each transformation event. An input consists of a video $V = \{I_t\}_{t=1}^T$ and an initial mask $M_1$ at frame 1 indicating the target object. The system outputs a set of object tracks (tubelets), change points (frames where state transitions occur), a state graph $S$ with tubelets as nodes, and directed, verb-labeled edges representing the transformations. The challenge arises when the object splits, recomposes, or changes so significantly that standard tracking cannot reliably associate the before- and after-states.

2. Methodological Pipeline

TubeletGraph executes in four major stages:

2.1 Video Partitioning into Tubelets

Spatio-temporal Partitioning: The entire video is exhaustively decomposed into “tubelets”—partial tracks such that every pixel in every frame belongs to exactly one tubelet.
Segmentation (Frame 1): The CropFormer-Hornet-3X model produces non-overlapping entity masks ${e_1^i}$ for the first frame; the user prompt $M_1$ is unioned, yielding $E_1 = \mathrm{CF}(I_1) \cup \{M_1\}$ .
Tubelet Initialization: Each initial entity $e_1^i \in E_1$ is tracked forward in time via SAM2.1-Large to obtain tubelet $P_i = \{e_t^i\}_{t=1}^T$ , forming $P_{\mathrm{init}}$ .
Coverage-Guided Tubelet Growth: For $t > 1$ , new CropFormer segments $\hat{e}_t^j$ launch novel tubelets if their uncovered area ratio exceeds $\tau_\mathrm{coverage} = 0.25$ ; formally, if $\mathrm{cover}(\hat{e}_t^j, \bigcup_{P \in P_{\mathrm{init}}} P_t) < \tau_\mathrm{coverage}$ .

2.2 Candidate-Tubelet Reasoning with Priors

Candidate Selection: Tubelets starting at $t>1$ are considered as potential transformation products.
Spatial Proximity Prior: Valid tubelets appear spatially proximate to where the original object disappeared. $S_{\mathrm{prox}}(C,P) = \max_j |\ c_s \cap m_s^j|/|c_s|$ for three alternative SAM2 segmentations; must exceed $\tau_\mathrm{prox} = 0.3$ .
Semantic Consistency Prior: Despite visual change, semantic features of candidate matches and the original should correlate. CLIP-based pooling extracts $f(M,I) = \mathrm{Pool}(\mathrm{CLIP}(I), M)$ . $S_{\mathrm{sem}}(C,P) = \max_{i < s, j \geq s} f(p_i, I_i) \cdot f(c_j, I_j)$ , requiring $S_{\mathrm{sem}}(C,P) > \tau_\mathrm{sem} = 0.7$ .
Final Validations: Only tubelets satisfying both priors are accepted as valid continuations.

2.3 State-Change Description and State Graph Construction

Event Detection: Each recovered tubelet $C \in V$ marks a state-change event at its start frame $s$ .
LLM-Powered Semantic Labeling: For events, the system submits pre/post crops (originating mask and resulting fragment) to GPT-4.1 (temperature 0) which outputs: (i) an action verb $v$ describing the transformation and (ii) a textual label $d_j$ for each post-state fragment.
Graph Assembly: Event tuples $\tau = (s, v, \{c_j, d_j\})$ are organized into a directed state graph $S$ , where nodes are tubelets and edges link transformations, labeled with action verbs.

2.4 Implementation Details

All core modules are used off-the-shelf with no fine-tuning.

Entity segmentation: CropFormer-Hornet-3X.
Tubelet propagation: SAM2.1-Large.
Semantic features: FC-CLIP-COCO pooling.
Language module: GPT-4.1.
Hyperparameters: $\tau_\mathrm{coverage}=0.25$ , $\tau_\mathrm{prox}=0.3$ , $\tau_\mathrm{sem}=0.7$ .

3. Benchmark: VOST-TAS

VOST-TAS extends the VOST benchmark with explicit, dense annotations of state changes.

Characteristic	Details
Number of videos	57
Video duration	$\sim$ 22 s (60 fps)
Annotated transformations	108
Post-condition masks/labels	293
Event representation	$\tau_i = (t_i^s, t_i^e, v_i, \{(M_{i,j}, d_{i,j})\}_{j=1}^{K_i})$

Annotations track start/end frames, ground-truth verbs, resulting masks, and their textual descriptions, providing a rigorous resource for spatio-temporal and semantic evaluation.

4. Evaluation Metrics and Quantitative Results

Tracking and State-Change Metrics

Metric	Description
$\mathcal{J}$	Jaccard over all frames
$\mathcal{J}_{\mathrm{tr}}$	Jaccard over last 25% of frames (tracks lost after transformation)
$\mathcal{P}$ , $\mathcal{R}$	Pixel-level precision and recall
$P_\mathrm{TL}$ , $R_\mathrm{TL}$	Temporal localization precision/recall for state changes
$V$	Semantic verb accuracy; GPT-4.1 evaluated
$O$	Object label accuracy; mask correspondence (IoU $>$ 0.5) + GPT-4.1 label match
$R_{\mathrm{ST}}$	Spatio-temporal recall: fraction of ground-truth changes with matched time and mask overlap
$R_{\mathrm{all}}$	Overall recall: $R_{\mathrm{ST}}$ + correct verb & object label

Key Results

Dataset	Method	$\mathcal{J}$	$\mathcal{J}_{\mathrm{tr}}$
VOST val	SAM2.1	48.4	32.4
	TubeletGraph	51.0	36.9 (+2.6, +4.5)
VSCOS val	TubeletGraph $\sim$ +3-4 over SAM
M³-VOS val	TubeletGraph	74.2
(for reference: ReVOS)		75.6
(for reference: SAM2.1)		71.3
DAVIS17	TubeletGraph	85.6	$\approx$ best prior (87.1)

For state-change detection (VOST-TAS):

Metric	Value
$P_\mathrm{TL}$	43.1%
$R_\mathrm{TL}$	20.4%
$V$ (Verb)	81.8%
$O$ (Object)	72.3%
$R_\mathrm{ST}$	12.0%
$R_\mathrm{all}$	6.5%

These results indicate that TubeletGraph improves tracking performance under appearance-changing transformations and supports more grounded state-change understanding.

5. Design Choices, Ablations, and Limitations

Ablation Findings

Modification	Effect
Removing CF (SAM automasks only)	$\downarrow \mathcal{J}$ by 1.8
Replacing SAM2.1 with Cutie	$\downarrow \mathcal{J}$ by 3.4
Swapping CLIP for DINOv2	$\leftrightarrow \mathcal{J}$
GPT-4.1 $\rightarrow$ Qwen-2.5VL	Verb/object accuracy drops significantly

Variation in $\tau_\mathrm{prox}$ or $\tau_\mathrm{sem}$ over wide intervals yields less than $\pm$ 0.5 change in $\mathcal{J}$ .

Computational and Practical Characteristics

Zero-shot Generalization: No fine-tuning is employed; robust across diverse domains and video styles (e.g., egocentric, internet video).
Resource Requirements: $\sim 7$ seconds per frame on an NVIDIA A6000 GPU, largely due to per-entity SAM2 tracking.
Occlusion Robustness: Short occlusions are addressed by additional tubelet generation.
Dominant Error Modes: False-positives from SAM2, mislabeling by the LLM (e.g., object ambiguity such as smartphone vs. tape measure).
Ethical Risks: Potential for misuse similar to other robust vision-language pipelines, including privacy concerns in surveillance contexts.

6. Qualitative Behavior and Output Structure

TubeletGraph demonstrates the semantic structure of transformations via state graphs. For example:

"Pulling foil from a roll": SAM2.1 fails to track the narrow sheet after transformation; TubeletGraph's reasoning based on spatial proximity and semantic alignment recovers the fragment, and the LLM labels the event “pull out,” assigning "sheet of foil" as an object label.
"Butterfly Emergence": The chrysalis mask disappears upon metamorphosis in SAM2.1; TubeletGraph recovers the new butterfly tubelet and the language module labels the event as “emerge” and the new object as “butterfly.”

Such outputs enable complex analysis, visualization, and querying of object transformation sequences within video.

TubeletGraph reframes the challenge of tracking through transformations as a discrete tubelet selection task, leveraging spatial and semantic priors, and augments outputs with structured, natural-language state-change graphs. This approach advances both the robustness of visual tracking under appearance shifts and deepens automatic understanding of transformation events.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to TubeletGraph.