Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
157 tokens/sec
GPT-4o
43 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Explicit Visual Prompts for Visual Object Tracking (2401.03142v1)

Published 6 Jan 2024 in cs.CV

Abstract: How to effectively exploit spatio-temporal information is crucial to capture target appearance changes in visual tracking. However, most deep learning-based trackers mainly focus on designing a complicated appearance model or template updating strategy, while lacking the exploitation of context between consecutive frames and thus entailing the \textit{when-and-how-to-update} dilemma. To address these issues, we propose a novel explicit visual prompts framework for visual tracking, dubbed \textbf{EVPTrack}. Specifically, we utilize spatio-temporal tokens to propagate information between consecutive frames without focusing on updating templates. As a result, we cannot only alleviate the challenge of \textit{when-to-update}, but also avoid the hyper-parameters associated with updating strategies. Then, we utilize the spatio-temporal tokens to generate explicit visual prompts that facilitate inference in the current frame. The prompts are fed into a transformer encoder together with the image tokens without additional processing. Consequently, the efficiency of our model is improved by avoiding \textit{how-to-update}. In addition, we consider multi-scale information as explicit visual prompts, providing multiscale template features to enhance the EVPTrack's ability to handle target scale changes. Extensive experimental results on six benchmarks (i.e., LaSOT, LaSOT\rm $_{ext}$, GOT-10k, UAV123, TrackingNet, and TNL2K.) validate that our EVPTrack can achieve competitive performance at a real-time speed by effectively exploiting both spatio-temporal and multi-scale information. Code and models are available at https://github.com/GXNU-ZhongLab/EVPTrack.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Fully-Convolutional Siamese Networks for Object Tracking. In ECCV Workshops, 850–865.
  2. TCTrack: Temporal Contexts for Aerial Tracking. In CVPR, 14778–14788.
  3. Backbone is All Your Need: A Simplified Architecture for Visual Object Tracking. In ECCV (22), 375–392.
  4. SeqTrack: Sequence to Sequence Learning for Visual Object Tracking. CVPR, abs/2304.14394.
  5. Transformer Tracking. In CVPR, 8126–8135.
  6. Siamese Box Adaptive Network for Visual Tracking. In CVPR, 6667–6676.
  7. MixFormer: End-to-End Tracking with Iterative Mixed Attention. In CVPR, 13598–13608.
  8. ECO: Efficient Convolution Operators for Tracking. In CVPR, 6931–6939.
  9. ATOM: Accurate Tracking by Overlap Maximization. In CVPR, 4660–4669.
  10. Probabilistic Regression for Visual Tracking. In CVPR, 7181–7190.
  11. LaSOT: A High-quality Large-scale Single Object Tracking Benchmark. Int. J. Comput. Vis., 439–461.
  12. LaSOT: A High-Quality Benchmark for Large-Scale Single Object Tracking. In CVPR, 5374–5383.
  13. STMTrack: Template-Free Visual Tracking With Space-Time Memory Networks. In CVPR, 13774–13783.
  14. AiATrack: Attention in Attention for Transformer Visual Tracking. In ECCV (22), 146–164.
  15. Masked Autoencoders Are Scalable Vision Learners. In CVPR, 15979–15988.
  16. GOT-10k: A Large High-Diversity Benchmark for Generic Object Tracking in the Wild. IEEE Trans. Pattern Anal. Mach. Intell., 43(5): 1562–1577.
  17. SiamRPN++: Evolution of Siamese Visual Tracking With Very Deep Networks. In CVPR, 4282–4291.
  18. High Performance Visual Tracking With Siamese Region Proposal Network. In CVPR, 8971–8980.
  19. Focal Loss for Dense Object Detection. In ICCV, 2999–3007.
  20. Microsoft COCO: Common Objects in Context. In ECCV, 740–755.
  21. Decoupled Weight Decay Regularization. In ICLR.
  22. A Benchmark and Simulator for UAV Tracking. In ECCV, 445–461.
  23. TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild. In ECCV, 310–327.
  24. Learning Multi-domain Convolutional Neural Networks for Visual Tracking. In CVPR, 4293–4302.
  25. Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression. In CVPR, 658–666.
  26. Attention is All you Need. In NIPS, 5998–6008.
  27. Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking. In CVPR, 1571–1580.
  28. Transformer meets tracker: Exploiting temporal context for robust visual tracking. In CVPR, 1571–1580.
  29. VideoTrack: Learning to Track Objects via Video Transformer. In CVPR, 22826–22835.
  30. Autoregressive Visual Tracking. In CVPR, 9697–9706.
  31. Learning Spatio-Temporal Transformer for Visual Tracking. In ICCV, 10428–10437.
  32. Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework. In ECCV (22), 341–357.
  33. Learning the Model Update for Siamese Trackers. In ICCV, 4009–4018.
  34. HiViT: A Simpler and More Efficient Design of Hierarchical Vision Transformer. In ICLR.
  35. Learn To Match: Automatic Matching Network Design for Visual Tracking. In ICCV, 13319–13328.
  36. Ocean: Object-Aware Anchor-Free Tracking. In ECCV, 771–787.
  37. Global Tracking via Ensemble of Local Trackers. In CVPR, 8751–8760.
Citations (10)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com