Point Prompting: Counterfactual Tracking with Video Diffusion Models (2510.11715v1)

Published 13 Oct 2025 in cs.CV

Abstract: Trackers and video generators solve closely related problems: the former analyze motion, while the latter synthesize it. We show that this connection enables pretrained video diffusion models to perform zero-shot point tracking by simply prompting them to visually mark points as they move over time. We place a distinctively colored marker at the query point, then regenerate the rest of the video from an intermediate noise level. This propagates the marker across frames, tracing the point's trajectory. To ensure that the marker remains visible in this counterfactual generation, despite such markers being unlikely in natural videos, we use the unedited initial frame as a negative prompt. Through experiments with multiple image-conditioned video diffusion models, we find that these "emergent" tracks outperform those of prior zero-shot methods and persist through occlusions, often obtaining performance that is competitive with specialized self-supervised models.

Summary

The paper introduces a novel zero-shot tracking method using pretrained video diffusion models to propagate a colored marker through video frames.
It leverages counterfactual modeling and negative prompting via SDEdit to ensure robust marker propagation and effective occlusion handling.
Experimental results demonstrate competitive performance with high AJ scores and occlusion robustness, highlighting the technique's promising integration with generative models.

Point Prompting: Counterfactual Tracking with Video Diffusion Models

Introduction

This paper introduces a novel approach for zero-shot point tracking in videos by leveraging pretrained video diffusion models. The method exploits the generative capabilities of these models to propagate a visually distinctive marker (a colored dot) through video frames, enabling the estimation of a point's trajectory without any additional training. The approach is grounded in counterfactual modeling, where the input video is perturbed by marking the query point, and the model is prompted to generate future frames that maintain the marker's presence. The technique demonstrates strong performance on the TAP-Vid benchmark, outperforming previous zero-shot methods and approaching the accuracy of specialized self-supervised trackers.

Figure 1: Prompting a diffusion model for tracking by inserting a colored dot in the first frame and propagating it through subsequent frames using SDEdit.

Methodology

Video Diffusion Model Manipulation

The core of the method is the manipulation of pretrained video diffusion models via SDEdit. The process involves:

Marker Insertion: A colored dot (typically red) is placed at the query point in the initial frame.
Regeneration via SDEdit: The video is regenerated from an intermediate noise level, conditioned on the edited first frame, which propagates the marker through subsequent frames.
Negative Prompting: To counteract the model's prior that may ignore unnatural markers, the original unedited frame is used as a negative prompt. This is implemented by subtracting the noise estimate conditioned on the unedited frame from that conditioned on the edited frame, weighted by a hyperparameter $\lambda$ .
Figure 2: Enhancing the counterfactual signal by negative prompting, ensuring the marker remains visible in generated frames.

Tracking Pipeline

The tracking pipeline consists of several stages:

Color-Based Tracking: The marker's position is detected in each frame using color thresholding in HSV space, with a local search window centered on the previous detection.
Occlusion Handling: If the marker is not detected, the last known position is propagated, and the search radius is expanded until the marker reappears.
Coarse-to-Fine Refinement: Initial tracks are refined by inpainting only the regions near the predicted marker positions, correcting for spatial misalignments introduced during denoising.
Color Rebalancing: The video's color distribution is adjusted to suppress the marker's color in the background, reducing false positives during occlusion.
Figure 3: Tracking enhancements including color rebalancing and inpainting-based refinement for improved accuracy.

Experimental Results

Quantitative Evaluation

The method is evaluated on the TAP-Vid DAVIS and Kinetics splits using several recent video diffusion models (Wan2.1, Wan2.2, CogVideoX). Key findings include:

Zero-Shot Superiority: The proposed approach achieves an AJ score of 42.21 on TAP-Vid DAVIS, outperforming all other zero-shot baselines and even surpassing some self-supervised methods.
Occlusion Robustness: Occlusion accuracy (OA) is competitive with supervised trackers, indicating strong object permanence capabilities.
Model Scaling: Larger and higher-fidelity video diffusion models (Wan2.1-14B, Wan2.2-14B) yield improved tracking performance.
Resolution Impact: Upscaling input videos enhances tracking accuracy, with higher-resolution inputs better aligning with the model's training distribution.

Figure 4: Ablation results showing the impact of each pipeline stage and the effect of denoising strength and search radius on tracking performance.

Qualitative Analysis

Generated videos demonstrate consistent propagation of the marker, with successful recovery after long occlusions and robust tracking across diverse scenes. Failure cases are primarily attributed to out-of-distribution inputs, ambiguous object boundaries, or insufficient texture.

Figure 5: Point propagation examples showing the marker's trajectory through occlusions and complex motion.

Figure 6: Tracking results visualized as the marker and its trajectory over time, including occlusion handling.

Figure 7: Generation failures such as stationary points, symmetry confusion, and marker disappearance.

Figure 8: Qualitative results on TAP-Vid Kubric, highlighting both successful and failed point propagation in synthetic scenes.

Implementation Details

Computational Requirements: Generating a 50-frame video for a single query point requires 7–30 minutes on A40/L40S GPUs, depending on model size.
Hyperparameters: Optimal performance is achieved with a denoising strength of 0.5, a search radius of 2 pixels, and a guidance weight $\lambda = 8$ .
Color Detection: HSV color space is preferred for marker detection; LAB yields slightly lower accuracy.
Pipeline Efficiency: The method is not optimized for speed but could be distilled into a more efficient model or extended to track multiple points simultaneously.

Limitations

Per-Point Generation: Each tracked point requires a separate video generation, limiting scalability.
Failure Modes: The model may fail to interpret the marker as attached to the object, especially in synthetic or low-texture scenes.
Generalization: Performance drops on synthetic datasets (e.g., Kubric) due to domain mismatch with training data.

Implications and Future Directions

This work demonstrates that pretrained video diffusion models possess emergent tracking capabilities that can be elicited via visual prompting. The approach unifies generative modeling and motion analysis, suggesting that generative models can serve as a source of pretraining for tracking tasks. Future research may explore:

Efficient Distillation: Compressing the tracking behavior into lightweight models for real-time applications.
Multi-Point Tracking: Extending the method to simultaneously track multiple points.
Generalization: Improving robustness to synthetic and out-of-distribution videos.
Prompting Schemes: Developing more sophisticated visual prompts for other video understanding tasks.

Conclusion

The paper establishes that video diffusion models, when prompted with simple visual cues, can perform zero-shot point tracking with competitive accuracy. The method leverages counterfactual modeling and negative prompting to maintain marker visibility, and achieves strong results on standard benchmarks. This approach opens new avenues for adapting generative models to video analysis tasks and highlights the potential for further integration of generation and tracking in computer vision.