ODTrack: Online Dense Temporal Token Learning for Visual Tracking (2401.01686v1)
Abstract: Online contextual reasoning and association across consecutive video frames are critical to perceive instances in visual tracking. However, most current top-performing trackers persistently lean on sparse temporal relationships between reference and search frames via an offline mode. Consequently, they can only interact independently within each image-pair and establish limited temporal correlations. To alleviate the above problem, we propose a simple, flexible and effective video-level tracking pipeline, named \textbf{ODTrack}, which densely associates the contextual relationships of video frames in an online token propagation manner. ODTrack receives video frames of arbitrary length to capture the spatio-temporal trajectory relationships of an instance, and compresses the discrimination features (localization information) of a target into a token sequence to achieve frame-to-frame association. This new solution brings the following benefits: 1) the purified token sequences can serve as prompts for the inference in the next video frame, whereby past information is leveraged to guide future inference; 2) the complex online update strategies are effectively avoided by the iterative propagation of token sequences, and thus we can achieve more efficient model representation and computation. ODTrack achieves a new \textit{SOTA} performance on seven benchmarks, while running at real-time speed. Code and models are available at \url{https://github.com/GXNU-ZhongLab/ODTrack}.
- Fully-Convolutional Siamese Networks for Object Tracking. In ECCV Workshops, 850–865.
- Learning Discriminative Model Prediction for Tracking. In ICCV, 6181–6190.
- TCTrack: Temporal Contexts for Aerial Tracking. In CVPR, 14778–14788.
- Backbone is All Your Need: A Simplified Architecture for Visual Object Tracking. In ECCV (22), 375–392.
- SeqTrack: Sequence to Sequence Learning for Visual Object Tracking. CVPR, abs/2304.14394.
- Transformer Tracking. In CVPR, 8126–8135.
- Siamese Box Adaptive Network for Visual Tracking. In CVPR, 6667–6676.
- MixFormer: End-to-End Tracking with Iterative Mixed Attention. In CVPR, 13598–13608.
- ATOM: Accurate Tracking by Overlap Maximization. In CVPR, 4660–4669.
- Probabilistic Regression for Visual Tracking. In CVPR, 7181–7190.
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR.
- LaSOT: A High-Quality Benchmark for Large-Scale Single Object Tracking. In CVPR, 5374–5383.
- STMTrack: Template-Free Visual Tracking With Space-Time Memory Networks. In CVPR, 13774–13783.
- AiATrack: Attention in Attention for Transformer Visual Tracking. In ECCV (22), 146–164.
- Generalized Relation Modeling for Transformer Tracking. CVPR, abs/2303.16580.
- Graph Attention Tracking. In CVPR, 9543–9552.
- Learning Target-aware Representation for Visual Tracking via Informative Interactions. In IJCAI, 927–934.
- Learning To Fuse Asymmetric Feature Maps in Siamese Trackers. In CVPR, 16570–16580.
- Masked Autoencoders Are Scalable Vision Learners. In CVPR, 15979–15988.
- GOT-10k: A Large High-Diversity Benchmark for Generic Object Tracking in the Wild. IEEE Trans. Pattern Anal. Mach. Intell., 43(5): 1562–1577.
- The Eighth Visual Object Tracking VOT2020 Challenge Results. In ECCV Workshops (5), volume 12539 of Lecture Notes in Computer Science, 547–601. Springer.
- SiamRPN++: Evolution of Siamese Visual Tracking With Very Deep Networks. In CVPR, 4282–4291.
- High Performance Visual Tracking With Siamese Region Proposal Network. In CVPR, 8971–8980.
- PG-Net: Pixel to Global Matching Network for Visual Tracking. In ECCV, 429–444.
- Focal Loss for Dense Object Detection. In ICCV, 2999–3007.
- Microsoft COCO: Common Objects in Context. In ECCV, 740–755.
- TrackFormer: Multi-Object Tracking with Transformers. In CVPR, 8834–8844.
- TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild. In ECCV, 310–327.
- Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression. In CVPR, 658–666.
- Attention is All you Need. In NIPS, 5998–6008.
- Siam R-CNN: Visual Tracking by Re-Detection. In CVPR, 6577–6587.
- Transformer meets tracker: Exploiting temporal context for robust visual tracking. In CVPR, 1571–1580.
- Towards More Flexible and Accurate Object Tracking With Natural Language: Algorithms and Benchmark. In CVPR, 13763–13773.
- Object Tracking Benchmark. IEEE Trans. Pattern Anal. Mach. Intell., 37(9): 1834–1848.
- VideoTrack: Learning to Track Objects via Video Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 22826–22835.
- Correlation-Aware Deep Tracking. In CVPR, 8741–8750.
- Autoregressive Visual Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 9697–9706.
- Learning Spatio-Temporal Transformer for Visual Tracking. In ICCV, 10428–10437.
- Alpha-Refine: Boosting Tracking Performance by Precise Bounding Box Estimation. In CVPR, 5289–5298. Computer Vision Foundation / IEEE.
- Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework. In ECCV (22), 341–357.
- Deformable Siamese Attention Networks for Visual Object Tracking. In CVPR, 6727–6736.
- MOTR: End-to-End Multiple-Object Tracking with Transformer. In ECCV (27), 659–675.
- Learning the Model Update for Siamese Trackers. In ICCV, 4009–4018.
- Ocean: Object-Aware Anchor-Free Tracking. In ECCV, 771–787.