CoTracker3: Simplified Transformer for Point Tracking
- CoTracker3 is a transformer-based point tracking architecture that streamlines traditional 4D correlation and iterative attention with a unified transformer head.
- It leverages a semi-supervised training regime with pseudo-labelling, significantly reducing the required training data while achieving high accuracy and robustness.
- Extensive benchmarks demonstrate its superior handling of occlusions and dynamic scenes, with practical applications in areas like markerless biomechanical strain quantification.
CoTracker3 is a state-of-the-art, transformer-based point tracking architecture distinguished by its simplification of the prevailing "features + 4D correlation + iterative transformer" paradigm. Designed to address the persistent domain gap between synthetic and real video datasets, CoTracker3 achieves high accuracy and robustness in tracking both visible and occluded points while operating with significantly less data than previous models. A notable attribute is its effective semi-supervised training recipe, enabling scalable pseudo-labelling on real videos by leveraging outputs from frozen “teacher” trackers. This has yielded substantial data efficiency gains and facilitated deployments in previously inaccessible domains such as markerless strain quantification in biomechanical systems.
1. Architectural Design and Innovations
CoTracker3 inherits the general template of extracting convolutional multi-scale visual features, constructing local 4D correlation volumes, and employing transformer modules for iterative sequence modeling (Karaev et al., 2024, Asadbeygi et al., 4 Jan 2026). Key simplifications distinguish it from predecessors:
- 4D Correlation Module: Building on LocoTrack’s local 4D correlation concept, CoTracker3 applies a lightweight MLP to the raw correlation vectors, projecting them to a fixed dimension . This replaces LocoTrack’s bespoke network and yields a smaller parameter footprint.
- No Global Matching: In contrast to TAPIR, BootsTAPIR, and LocoTrack, CoTracker3 eliminates the global all-pairs attention stage, relying wholly on iterative local updates and cross-track attention to resolve ambiguities.
- Unified Transformer Heads: All outputs—point location updates, confidences, and visibility—are predicted by a single transformer head, in contrast to CoTracker’s separate sub-networks.
- Simplified Token Grid: Each per-frame, per-point token consists of where denotes a Fourier-encoded displacement, supplanting the more elaborate mixed token schemes used previously.
- Proxy Tokens & Factorized Attention: Cross-track interactions are efficiently realized via track-specific proxy tokens and factorized computation, preserving complexity with respect to tracked points.
Removed components include the global matching module, ad-hoc correlation networks, and separated occlusion heads, leading to a more streamlined, faster, and resource-efficient model (Karaev et al., 2024).
2. Semi-Supervised Training via Pseudo-Labelling
The core of CoTracker3’s data efficiency is its semi-supervised training regime, which proceeds in two phases:
- Supervised Pre-training: The model is initialized on synthetic video datasets (e.g., Kubric), with all ground-truth tracks and occlusion states available. Losses used include an iterative Huber tracking loss, binary cross-entropy for confidence, and visibility (Karaev et al., 2024).
- Pseudo-Labelling on Real Video: For unlabelled real videos (up to 100k 30-second Internet-style clips), pseudo-labels are generated using predictions from a uniform sample of frozen teacher models (CoTracker3 variants, CoTracker ECCV’24, TAPIR). SIFT keypoints are used to discover “good to track” queries for each batch.
- Loss Freezing Strategy: During pseudo-labelling, the fine-tuning loss penalizes only coordinate errors and leaves the confidence and visibility heads frozen; this avoids catastrophic forgetting of occlusion reasoning abilities.
- Scaling and Efficiency: Data ablations show that CoTracker3 matches or surpasses prior state-of-the-art (e.g., BootsTAPIR trained on 15 million videos) with as little as 15k pseudo-labelled real videos—a reduction in required data (Karaev et al., 2024).
3. Benchmark Performance and Ablations
CoTracker3 has been benchmarked on multiple tracking datasets with rigorous evaluation protocols, using the TAP-Vid (Kinetics, DAVIS, RGB-S), DynamicReplica, and RoboTAP suites. Key results include:
- AJ and OA Metrics: On TAP-Vid and RoboTAP, Average Jaccard (AJ) and Occlusion Accuracy (OA) surpassed or matched all previous models. For example, CoTracker3 (online) obtains AJ=76.1 on +15k real videos compared to BootsTAPIR’s AJ=75.0; on DynamicReplica, visible/occluded errors are 73.3, 40.1 versus 69.0, 28.0.
- Ablation Findings:
- Cross-track attention yields +5.1 pp improvement for occlusion cases.
- Ensemble use of four teacher models yields best generalization on pseudo-labels.
- Freezing confidence/visibility heads offers +3.9 pp OA on real-video fine-tuning (Karaev et al., 2024).
- Comparison with 3D-Grounded Trackers: On highly dynamic points, CoTracker3’s temporal modeling still leads, but 2-frame 3D pipelines (PointSt3R) close the gap, especially on re-identification-heavy datasets such as EgoPoints (Guerrier et al., 30 Oct 2025).
| Dataset | CoTracker3 AJ | PointSt3R AJ |
|---|---|---|
| TAP-Vid-DAVIS | 76.7 | 73.8 |
| RoboTAP | 78.8 | 78.6 |
| RGB-S | 82.8 | 87.0 |
| EgoPoints | 54.2 | 61.3 |
A consistent trend is a pp drop for CoTracker3 from static to dynamic tracking subsets, highlighting continued challenges for dynamic scenes (Guerrier et al., 30 Oct 2025).
4. Occlusion Handling and Model Variants
Occlusion modeling in CoTracker3 is integrated into the main transformer workflow, producing a continuous visibility score at each iteration. Occlusion supervision employs binary cross-entropy loss, and inference uses the product of confidence and visibility thresholds. Cross-track attention enables observable tracks to inform occluded queries, improving robustness under challenging conditions (e.g., persistent or partial occlusions) (Karaev et al., 2024).
Two operational modes are available:
- Online: Processes videos as a sliding window of length over long sequences (e.g., frames), supports indefinite real-time inference by forwarding only, and is suitable for live or streaming applications.
- Offline: Handles an entire trimmed sequence () in one bidirectional pass, improving occlusion interpolation at the expense of higher memory consumption, often limited by available GPU resources.
5. Practical Implementation and Computational Characteristics
Training and inference leverage scalable deep learning frameworks (PyTorch Lightning, DDP), using mixed precision (bfloat16) and gradient norm clipping to stabilize optimization. Pre-training on synthetic video typically uses 32×A100 GPUs with a batch size of 1 video, followed by real-data fine-tuning on 8 GPUs. Inference throughput on high-resolution images (e.g., px) reaches fps for moderate grid sizes (50 points), declining to fps for very dense grids ( points) (Karaev et al., 2024, Asadbeygi et al., 4 Jan 2026).
6. Scientific Impact and Specialized Applications
CoTracker3 has enabled markerless, speckle-free tracking for fine-scale strain mapping in biomedical imaging, exemplified by its use in quantifying strain fields during active rat bladder contraction. The method achieves:
- Subpixel pixel-tracking RMSE (e.g., $1.31$ px under large synthetic deformations; strain errors ),
- Reliable tracking on natural, low-contrast tissue textures without any fine-tuning.
In direct comparison with conventional digital image correlation (DIC) networks, CoTracker3 demonstrates robust generalization on both laboratory and natural scenes, outperforming supervised DIC on large, un-speckled deformations. In the examined bladder application, the method revealed spatially heterogeneous anisotropic strains (statistically significant at for ), capturing phenomena (fold formation, buckling) that conventional methods failed to resolve (Asadbeygi et al., 4 Jan 2026).
7. Comparisons, Limitations, and Prospects
Quantitative and qualitative evaluation against alternative approaches such as PointSt3R demonstrates that while CoTracker3’s multi-frame temporal context excels in dynamic, occluded, or ambiguous settings, 2-frame 3D-grounded architectures can match or surpass its accuracy in cases dominated by large camera motion and re-identification requirements. A significant finding is CoTracker3’s pp accuracy gap between static and dynamic subsets on standard benchmarks, signifying an open challenge despite its temporal modeling capacity (Guerrier et al., 30 Oct 2025).
Recommendations for future point tracking architectures, derived from these comparative studies, include integrating 3D geometric grounding, dynamic-point specific losses, explicit visibility modeling, and continued exploitation of synthetic data for training and evaluation of long-term correspondence robustness.
References:
(Karaev et al., 2024): CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos (Asadbeygi et al., 4 Jan 2026): Quantifying Local Strain Field and Deformation in Active Contraction of Bladder Using a Pretrained Transformer Model: A Speckle-Free Approach (Guerrier et al., 30 Oct 2025): PointSt3R: Point Tracking through 3D Grounded Correspondence