TubeletGraph: Tracking Object Transformations
- TubeletGraph is a zero-shot vision system that tracks objects through drastic visual transformations by recovering missing tubelets and constructing a semantic state graph.
- It utilizes a four-stage pipeline combining spatio-temporal segmentation, candidate tubelet reasoning with spatial and semantic priors, and LLM-powered event labeling.
- The approach achieves state-of-the-art performance on benchmarks like VOST-TAS, significantly improving tracking and accurate state-change detection compared to conventional methods.
TubeletGraph is a zero-shot vision system for tracking user-prompted objects through state transformations characterized by significant appearance changes, such as those caused by cuts, breaks, or metamorphoses. Unlike conventional tracking models that lose the target after radical visual change, TubeletGraph recovers missing objects after transformations and constructs a semantic state graph that represents the sequence of state changes across time. The system relies on spatial and semantic reasoning, as well as natural language understanding, to detect and describe transformation events, and it achieves state-of-the-art results on dedicated benchmarks.
1. Formal Definition and Task Overview
TubeletGraph addresses the "Track Any State" (TAS) task, which requires not only maintaining identity tracking of objects across radical transformations but also detecting, temporally localizing, and naming each transformation event. An input consists of a video and an initial mask at frame 1 indicating the target object. The system outputs a set of object tracks (tubelets), change points (frames where state transitions occur), a state graph with tubelets as nodes, and directed, verb-labeled edges representing the transformations. The challenge arises when the object splits, recomposes, or changes so significantly that standard tracking cannot reliably associate the before- and after-states.
2. Methodological Pipeline
TubeletGraph executes in four major stages:
2.1 Video Partitioning into Tubelets
- Spatio-temporal Partitioning: The entire video is exhaustively decomposed into “tubelets”—partial tracks such that every pixel in every frame belongs to exactly one tubelet.
- Segmentation (Frame 1): The CropFormer-Hornet-3X model produces non-overlapping entity masks for the first frame; the user prompt is unioned, yielding .
- Tubelet Initialization: Each initial entity is tracked forward in time via SAM2.1-Large to obtain tubelet , forming .
- Coverage-Guided Tubelet Growth: For , new CropFormer segments launch novel tubelets if their uncovered area ratio exceeds ; formally, if .
2.2 Candidate-Tubelet Reasoning with Priors
- Candidate Selection: Tubelets starting at are considered as potential transformation products.
- Spatial Proximity Prior: Valid tubelets appear spatially proximate to where the original object disappeared. for three alternative SAM2 segmentations; must exceed .
- Semantic Consistency Prior: Despite visual change, semantic features of candidate matches and the original should correlate. CLIP-based pooling extracts . , requiring .
- Final Validations: Only tubelets satisfying both priors are accepted as valid continuations.
2.3 State-Change Description and State Graph Construction
- Event Detection: Each recovered tubelet marks a state-change event at its start frame .
- LLM-Powered Semantic Labeling: For events, the system submits pre/post crops (originating mask and resulting fragment) to GPT-4.1 (temperature 0) which outputs: (i) an action verb describing the transformation and (ii) a textual label for each post-state fragment.
- Graph Assembly: Event tuples are organized into a directed state graph , where nodes are tubelets and edges link transformations, labeled with action verbs.
2.4 Implementation Details
All core modules are used off-the-shelf with no fine-tuning.
- Entity segmentation: CropFormer-Hornet-3X.
- Tubelet propagation: SAM2.1-Large.
- Semantic features: FC-CLIP-COCO pooling.
- Language module: GPT-4.1.
- Hyperparameters: , , .
3. Benchmark: VOST-TAS
VOST-TAS extends the VOST benchmark with explicit, dense annotations of state changes.
| Characteristic | Details |
|---|---|
| Number of videos | 57 |
| Video duration | 22 s (60 fps) |
| Annotated transformations | 108 |
| Post-condition masks/labels | 293 |
| Event representation |
Annotations track start/end frames, ground-truth verbs, resulting masks, and their textual descriptions, providing a rigorous resource for spatio-temporal and semantic evaluation.
4. Evaluation Metrics and Quantitative Results
Tracking and State-Change Metrics
| Metric | Description |
|---|---|
| Jaccard over all frames | |
| Jaccard over last 25% of frames (tracks lost after transformation) | |
| , | Pixel-level precision and recall |
| , | Temporal localization precision/recall for state changes |
| Semantic verb accuracy; GPT-4.1 evaluated | |
| Object label accuracy; mask correspondence (IoU0.5) + GPT-4.1 label match | |
| Spatio-temporal recall: fraction of ground-truth changes with matched time and mask overlap | |
| Overall recall: + correct verb & object label |
Key Results
| Dataset | Method | ||
|---|---|---|---|
| VOST val | SAM2.1 | 48.4 | 32.4 |
| TubeletGraph | 51.0 | 36.9 (+2.6, +4.5) | |
| VSCOS val | TubeletGraph +3-4 over SAM | ||
| M³-VOS val | TubeletGraph | 74.2 | |
| (for reference: ReVOS) | 75.6 | ||
| (for reference: SAM2.1) | 71.3 | ||
| DAVIS17 | TubeletGraph | 85.6 | best prior (87.1) |
For state-change detection (VOST-TAS):
| Metric | Value |
|---|---|
| 43.1% | |
| 20.4% | |
| (Verb) | 81.8% |
| (Object) | 72.3% |
| 12.0% | |
| 6.5% |
These results indicate that TubeletGraph improves tracking performance under appearance-changing transformations and supports more grounded state-change understanding.
5. Design Choices, Ablations, and Limitations
Ablation Findings
| Modification | Effect |
|---|---|
| Removing CF (SAM automasks only) | by 1.8 |
| Replacing SAM2.1 with Cutie | by 3.4 |
| Swapping CLIP for DINOv2 | |
| GPT-4.1 Qwen-2.5VL | Verb/object accuracy drops significantly |
Variation in or over wide intervals yields less than 0.5 change in .
Computational and Practical Characteristics
- Zero-shot Generalization: No fine-tuning is employed; robust across diverse domains and video styles (e.g., egocentric, internet video).
- Resource Requirements: seconds per frame on an NVIDIA A6000 GPU, largely due to per-entity SAM2 tracking.
- Occlusion Robustness: Short occlusions are addressed by additional tubelet generation.
- Dominant Error Modes: False-positives from SAM2, mislabeling by the LLM (e.g., object ambiguity such as smartphone vs. tape measure).
- Ethical Risks: Potential for misuse similar to other robust vision-language pipelines, including privacy concerns in surveillance contexts.
6. Qualitative Behavior and Output Structure
TubeletGraph demonstrates the semantic structure of transformations via state graphs. For example:
- "Pulling foil from a roll": SAM2.1 fails to track the narrow sheet after transformation; TubeletGraph's reasoning based on spatial proximity and semantic alignment recovers the fragment, and the LLM labels the event “pull out,” assigning "sheet of foil" as an object label.
- "Butterfly Emergence": The chrysalis mask disappears upon metamorphosis in SAM2.1; TubeletGraph recovers the new butterfly tubelet and the language module labels the event as “emerge” and the new object as “butterfly.”
Such outputs enable complex analysis, visualization, and querying of object transformation sequences within video.
TubeletGraph reframes the challenge of tracking through transformations as a discrete tubelet selection task, leveraging spatial and semantic priors, and augments outputs with structured, natural-language state-change graphs. This approach advances both the robustness of visual tracking under appearance shifts and deepens automatic understanding of transformation events.