DiTracker: Robust Video Point Tracking

Updated 30 December 2025

DiTracker is a point tracking framework designed to robustly localize corresponding points across video frames using spatio-temporal attention and transformer-based features.
It employs a dual-backbone strategy that fuses global consistency from Diffusion Transformers with local precision from ResNet through cost-level fusion.
Leveraging LoRA-based fine-tuning, DiTracker achieves state-of-the-art performance with fewer labeled data and lower computational overhead in dynamic tracking scenarios.

DiTracker is a point tracking framework designed to robustly localize corresponding points across video frames. Leveraging video Diffusion Transformers (DiTs) pretrained with spatio-temporal attention on large-scale real-world data, DiTracker addresses the limitations of prior convolutional approaches by introducing temporal coherence and effective handling of dynamic motions and frequent occlusions. Its architecture integrates a dual-backbone design—coupling the global consistency of DiT features with the local precision of ResNet—along with specialized query-key matching, low-rank adaptation, and cost-level fusion strategies. On benchmarks such as ITTO-MOSE and TAP-Vid, DiTracker achieves state-of-the-art performance, demonstrating resilience in challenging tracking conditions with orders-of-magnitude less labeled data and smaller batch sizes than contemporary baselines (Son et al., 23 Dec 2025).

1. Architectural Design

The core of DiTracker is founded on the Video Diffusion Transformer (DiT), featuring a two-stage encoding and attention mechanism:

VAE Encoding: The video input $X \in \mathbb{R}^{F \times H \times W \times 3}$ is mapped into latent representations $\mathbf{z}_{\rm video} \in \mathbb{R}^{f \times h \times w \times d_{\rm video}}$ via a variational autoencoder. This latent encoding preserves spatio-temporal structure necessary for coherent matching.
Spatio-Temporal Transformer: The “denoising” transformer $v_\theta$ applies 3D multi-head self-attention across space and time. Each layer attends jointly over all positions $(h,w,f)$ , using the attention formulation:

$\mathrm{Attention}(Q,K,V) = \mathrm{softmax}\left( \frac{QK^\top}{\sqrt{d_k}} \right) V$

where $Q,K,V \in \mathbb{R}^{N \times d_k}$ and $N = h w f$ .

Dual-Backbone Strategy: In parallel, DiTracker derives frame-wise features from a standard ResNet-50 backbone. Both DiT and ResNet streams compute independent matching costs—DiT emphasizing global, long-range correspondences; ResNet contributing fine-grained, local detail. Feature fusion is deliberately postponed to the matching cost level to preserve the integrity of DiT’s cost distributions. This “dual-backbone” (Editor's term) workflow enables complementary strengths in robustness and detail.

2. Query–Key Attention Matching Mechanism

DiTracker's point tracking mechanism relies on precise extraction and comparison of local features in the attention maps:

Local Neighborhood Embedding: For a query point $\mathbf{p}=(x,y)$ in frame $i$ , DiTracker extracts the corresponding spatio-temporal query and key projections $Q_i^{l,m}, K_i^{l,m} \in \mathbb{R}^{h w \times d_{\rm head}}$ from the $l$ -th layer, $m$ -th head of DiT.
Neighborhood Sampling: Around $\mathbf{p}$ , a local neighborhood of $(2\Delta+1)^2$ is bilinearly sampled for multi-scale matching:

$q_i^s = \{Q_i^s(\tfrac{x}{r 2^{s-1}} + \delta_x,\, \tfrac{y}{r 2^{s-1}} + \delta_y)\}_{\|\delta\|_\infty \leq \Delta}$

Analogous sampling is performed on the destination frame $j$ for $k_j^s$ .

Local 4D Cost Computation: The matching score at scale $s$ :

$\mathcal{C}_{i,j}^{s,\mathrm{DiT}} = \mathrm{softmax}\left( \frac{q_i^s (k_j^s)^\top}{\sqrt{d_{\rm head}}} \right) \in \mathbb{R}^{(2\Delta+1)^2 \times (2\Delta+1)^2}$

with the softmax over key-neighborhoods, mirroring the original DiT attention regime.

3. LoRA-Based Fine-Tuning Protocol

To efficiently adapt large pretrained DiT models, DiTracker implements low-rank adaptation (LoRA):

Parameterization: Instead of updating the full attention weights $W \in \mathbb{R}^{d\times d}$ , the adjusted weights are:

$W + \Delta W = W + BA,\quad B \in \mathbb{R}^{d \times r},\, A \in \mathbb{R}^{r \times d}$

with $r \ll d$ (e.g., LoRA rank $r=128$ ). Only $B$ and $A$ are learned; $W$ remains fixed.

Benefits: LoRA dramatically reduces the trainable parameter count, lowering GPU memory and computational requirements (just $2dr$ parameters per layer). Crucially, it preserves the spatio-temporal priors encoded in the pretrained DiT, mitigating catastrophic forgetting for temporal correspondence.

4. Cost Fusion and Embedding Strategy

A distinctive element of DiTracker is cost-level fusion, which avoids potentially disruptive early-layer integration:

Separate Matching Costs: Both DiT and ResNet compute independent matching costs, each subjected to their respective softmax normalizations.
Fusion Procedure: At each scale $s$ , the cost vectors are flattened and concatenated:

$\mathcal{C}_{i,j}^{s,\mathrm{fused}} = \left[\mathrm{Flatten}(\mathcal{C}_{i,j}^{s,\mathrm{DiT}}),\, \mathrm{Flatten}(\mathcal{C}_{i,j}^{s,\mathrm{ResNet}}) \right]$

Aggregated across scales $s=1\ldots S$ , these are projected via a multilayer perceptron (MLP) into a unified cost embedding $E_j \in \mathbb{R}^{d_E}$ .

Adaptive Weighting: This approach enables the MLP to learn how to weight the global (DiT) versus local (ResNet) costs on a per-point, per-frame basis, replacing manual scalar-weight tuning.

5. Training Regimen and Implementation Details

DiTracker deploys a compact and computationally efficient training methodology:

Data: Training occurs on the Kubric MOVi-F synthetic point-tracking sequences.
Schedule: 36,000 total iterations; batch size of 4 (notably, 8× fewer samples compared to baselines using batches of 32–64); maximum video sequence length of 46 frames (resolution: $480 \times 720$ ).
Hyperparameters: Scales $S=4$ , correlation radius $\Delta=3$ , model stride $r=4$ , LoRA rank $r=128$ , head dimension $d_{\rm head}=64$ ; optimizer: AdamW with decoupled weight decay. No specialized augmentation beyond standard Kubric randomization.
Small Batch Performance: The strong initial correspondence from pretrained DiTs’ motion priors allows effective adaptation with fewer synthetic samples.

6. Benchmark Results, Robustness, and Ablations

DiTracker demonstrates resilience and superiority across several difficult tracking scenarios:

Benchmark Performance Summary

Dataset	Metric	DiTracker	Prior SOTA
ITTO-MOSE	AJ	43.9%	42.4%
	$\delta_{\rm avg}^x$	57.9%	55.8%
	OA	79.3%	80.4%
TAP-Vid-DAVIS	AJ	62.7%	—
	$\delta_{\rm avg}^x$	77.5%	—
	OA	85.2%	—
TAP-Vid-Kinetics	AJ	54.3%	—
	$\delta_{\rm avg}^x$	67.4%	—
	OA	84.5%	—

On ITTO-MOSE, DiTracker trained for 36k steps solely on synthetic data surpasses CoTracker3, which used 65k steps plus 15k real videos, in Avg. Jaccard (AJ) and tracking precision.
On TAP-Vid benchmarks, DiTracker matches or exceeds prior state-of-the-art results.
Under severe motion blur (ImageNet-C, severity 5), DiTracker’s video DiT stream achieves a $\delta_{\rm avg}^x$ of 42.5–50.7% compared to 27.9% for ResNet-only tracking (~15% performance gain).

Ablation Insights

LoRA contributes a +14.0% improvement in AJ over frozen DiT, and cost fusion adds +4.0% (AJ).
Fusion at the cost level (as opposed to feature level or cost summation) preserves discrete matching distributions—yielding the best precision.

Limitations

DiTracker’s inference latency (2.4 s for 100 points) is considerably higher than pure ResNet trackers (0.08 s), due to the size and memory footprint of DiT. Distillation or model pruning may alleviate this bottleneck.

7. Implications and Future Directions

DiTracker establishes video Diffusion Transformers with full 3D attention as a robust foundation for temporal tracking tasks. LoRA-based adaptation enables domain transfer with minimal computational overhead and without disrupting pretrained priors. Cost-level fusion emerges as a critical design for balancing global and local correspondence, evidenced by improvements over feature-level or simple score fusion.

Potential directions include distilling or pruning DiT layers for increased efficiency, adaptive prompt-tuning for domain-specific applications (e.g., robotics, medical imaging), and generalizing the architecture for dense optical flow, video object segmentation, or 4D reconstruction leveraging the same spatio-temporal attention backbones (Son et al., 23 Dec 2025).

A plausible implication is that the paradigm of pretrained spatio-temporal transformers, fine-tuned via low-rank adaptation and combined with cost-level fusion, may extend beyond point tracking to a range of video correspondence and shape analysis tasks, where both high-level temporal consistency and low-level spatial precision are essential.

PDF Markdown Chat (Pro)

References (1)

Repurposing Video Diffusion Transformers for Robust Point Tracking (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to DiTracker.