Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ODTrack: Online Dense Temporal Token Learning for Visual Tracking (2401.01686v1)

Published 3 Jan 2024 in cs.CV

Abstract: Online contextual reasoning and association across consecutive video frames are critical to perceive instances in visual tracking. However, most current top-performing trackers persistently lean on sparse temporal relationships between reference and search frames via an offline mode. Consequently, they can only interact independently within each image-pair and establish limited temporal correlations. To alleviate the above problem, we propose a simple, flexible and effective video-level tracking pipeline, named \textbf{ODTrack}, which densely associates the contextual relationships of video frames in an online token propagation manner. ODTrack receives video frames of arbitrary length to capture the spatio-temporal trajectory relationships of an instance, and compresses the discrimination features (localization information) of a target into a token sequence to achieve frame-to-frame association. This new solution brings the following benefits: 1) the purified token sequences can serve as prompts for the inference in the next video frame, whereby past information is leveraged to guide future inference; 2) the complex online update strategies are effectively avoided by the iterative propagation of token sequences, and thus we can achieve more efficient model representation and computation. ODTrack achieves a new \textit{SOTA} performance on seven benchmarks, while running at real-time speed. Code and models are available at \url{https://github.com/GXNU-ZhongLab/ODTrack}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Fully-Convolutional Siamese Networks for Object Tracking. In ECCV Workshops, 850–865.
  2. Learning Discriminative Model Prediction for Tracking. In ICCV, 6181–6190.
  3. TCTrack: Temporal Contexts for Aerial Tracking. In CVPR, 14778–14788.
  4. Backbone is All Your Need: A Simplified Architecture for Visual Object Tracking. In ECCV (22), 375–392.
  5. SeqTrack: Sequence to Sequence Learning for Visual Object Tracking. CVPR, abs/2304.14394.
  6. Transformer Tracking. In CVPR, 8126–8135.
  7. Siamese Box Adaptive Network for Visual Tracking. In CVPR, 6667–6676.
  8. MixFormer: End-to-End Tracking with Iterative Mixed Attention. In CVPR, 13598–13608.
  9. ATOM: Accurate Tracking by Overlap Maximization. In CVPR, 4660–4669.
  10. Probabilistic Regression for Visual Tracking. In CVPR, 7181–7190.
  11. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR.
  12. LaSOT: A High-Quality Benchmark for Large-Scale Single Object Tracking. In CVPR, 5374–5383.
  13. STMTrack: Template-Free Visual Tracking With Space-Time Memory Networks. In CVPR, 13774–13783.
  14. AiATrack: Attention in Attention for Transformer Visual Tracking. In ECCV (22), 146–164.
  15. Generalized Relation Modeling for Transformer Tracking. CVPR, abs/2303.16580.
  16. Graph Attention Tracking. In CVPR, 9543–9552.
  17. Learning Target-aware Representation for Visual Tracking via Informative Interactions. In IJCAI, 927–934.
  18. Learning To Fuse Asymmetric Feature Maps in Siamese Trackers. In CVPR, 16570–16580.
  19. Masked Autoencoders Are Scalable Vision Learners. In CVPR, 15979–15988.
  20. GOT-10k: A Large High-Diversity Benchmark for Generic Object Tracking in the Wild. IEEE Trans. Pattern Anal. Mach. Intell., 43(5): 1562–1577.
  21. The Eighth Visual Object Tracking VOT2020 Challenge Results. In ECCV Workshops (5), volume 12539 of Lecture Notes in Computer Science, 547–601. Springer.
  22. SiamRPN++: Evolution of Siamese Visual Tracking With Very Deep Networks. In CVPR, 4282–4291.
  23. High Performance Visual Tracking With Siamese Region Proposal Network. In CVPR, 8971–8980.
  24. PG-Net: Pixel to Global Matching Network for Visual Tracking. In ECCV, 429–444.
  25. Focal Loss for Dense Object Detection. In ICCV, 2999–3007.
  26. Microsoft COCO: Common Objects in Context. In ECCV, 740–755.
  27. TrackFormer: Multi-Object Tracking with Transformers. In CVPR, 8834–8844.
  28. TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild. In ECCV, 310–327.
  29. Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression. In CVPR, 658–666.
  30. Attention is All you Need. In NIPS, 5998–6008.
  31. Siam R-CNN: Visual Tracking by Re-Detection. In CVPR, 6577–6587.
  32. Transformer meets tracker: Exploiting temporal context for robust visual tracking. In CVPR, 1571–1580.
  33. Towards More Flexible and Accurate Object Tracking With Natural Language: Algorithms and Benchmark. In CVPR, 13763–13773.
  34. Object Tracking Benchmark. IEEE Trans. Pattern Anal. Mach. Intell., 37(9): 1834–1848.
  35. VideoTrack: Learning to Track Objects via Video Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 22826–22835.
  36. Correlation-Aware Deep Tracking. In CVPR, 8741–8750.
  37. Autoregressive Visual Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 9697–9706.
  38. Learning Spatio-Temporal Transformer for Visual Tracking. In ICCV, 10428–10437.
  39. Alpha-Refine: Boosting Tracking Performance by Precise Bounding Box Estimation. In CVPR, 5289–5298. Computer Vision Foundation / IEEE.
  40. Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework. In ECCV (22), 341–357.
  41. Deformable Siamese Attention Networks for Visual Object Tracking. In CVPR, 6727–6736.
  42. MOTR: End-to-End Multiple-Object Tracking with Transformer. In ECCV (27), 659–675.
  43. Learning the Model Update for Siamese Trackers. In ICCV, 4009–4018.
  44. Ocean: Object-Aware Anchor-Free Tracking. In ECCV, 771–787.
Citations (25)

Summary

  • The paper introduces ODTrack, a novel visual tracking method using online dense temporal token learning for comprehensive video-level contextual reasoning.
  • ODTrack employs innovative concatenated and separated token attention mechanisms for efficient propagation of discriminative target features across frames, avoiding complex online updates.
  • ODTrack achieves state-of-the-art performance on seven visual tracking benchmarks, including LaSOT and GOT10K, demonstrating superior precision and robustness.

Online Dense Temporal Token Learning for Visual Tracking: An Analysis

The paper introduces ODTrack, an innovative approach to visual tracking that utilizes online dense temporal token learning to enhance contextual reasoning across video sequences. The fundamental problem addressed is the limitation inherent in prior visual tracking methodologies, which often rely on sparse offline mode interactions with reference to only single image pairs, thereby failing to capture the rich temporal dynamics present in video streams. ODTrack proposes a robust solution by introducing a video-level tracking pipeline that improves target instance perception and tracking efficiency through effective temporal token propagation.

Key Contributions

ODTrack stands apart from existing tracking solutions with several notable contributions:

  1. Video-Level Tracking Pipeline: The development of ODTrack marks a significant shift from traditional image-level tracking to a more comprehensive video-level approach, facilitated by dense associations across frames. The model accepts input from arbitrary-length video frames, capturing the spatial-temporal trajectory of targets in real-time.
  2. Temporal Token Propagation Mechanism: Two innovative attention mechanisms are introduced—concatenated token attention and separated token attention—which allow for efficient propagation of token sequences, compressing discriminative target features to guide inference on subsequent frames, thus avoiding complex online update strategies.
  3. State-of-the-Art (SOTA) Performance: ODTrack achieves SOTA results across seven benchmarks, including LaSOT, TrackingNet, GOT10K, and OTB100. This performance metric not only indicates high efficacy in multiple tracking scenarios but also underscores the potential of ODTrack as a pivotal tool in the advancement of visual tracking technology.

Numerical Outcomes

ODTrack exhibits remarkable tracking precision and robustness, as reflected in its high AO (Average Overlap) and AUC (Area Under the Curve) scores across various challenging benchmarks. Specifically, ODTrack-B and ODTrack-L variants demonstrate superior tracking capabilities, with notable improvements in precision metrics over existing methods like SeqTrack and ARTrack.

Implications and Future Developments

The practical implications of this research are profound, potentially transforming the landscape of real-time visual tracking in applications ranging from surveillance to autonomous vehicles. The theoretical implications are equally significant, suggesting new pathways for research into more sophisticated temporal modeling and memory constructs within neural architectures.

Looking forward, future developments could aim to optimize computational efficiency while maintaining high performance, potentially integrating lighter frameworks for mobile and embedded systems. Moreover, ongoing exploration into adaptive reinforcement mechanisms for token propagation could unlock further advancements in intelligent and autonomous visual perception systems.

Conclusion

ODTrack represents a strategic advancement in visual tracking technology, providing a compelling framework that addresses previous limitations and opens new directions for future research. The paper provides a comprehensive analysis of the temporal token propagation mechanism's effectiveness, affirming its potential to redefine the capabilities of visual perception models in dynamically shifting environments.